## Abstract

The visualization capability of a light field display is uniquely determined by its angular and spatial resolution referred to as display passband. In this paper we use a multidimensional sampling model for describing the display-camera channel. Based on the model, for a given display passband, we propose a methodology for determining the optimal distribution of ray generators in a projection-based light field display. We also discuss the required camera setup that can provide data with the necessary amount of details for such display that maximizes the visual quality and minimizes the amount of data.

© 2016 Optical Society of America

## 1. Introduction

Most of the commercially available, stereoscopic as well as autostereoscopic, 3D displays concentrate on reproducing the binocular visual cue for single or multiple observers thereby giving the illusion of 3D [1]. However, a typical consumer display is not capable or has difficulty in reproducing other cues important for 3D vision, most notable one being the continuous head parallax [2]. There are two practical ways for achieving the illusion of continuous head parallax. First way is by performing user’s eye tracking and rendering parallax-correct views depending on user’s position. This can be achieved by either using a head mounted display (e.g. Oculus Rift, Samsung Gear VR, Zeiss VR) or a custom built display with eye-tracking capabilities (e.g. zSpace). Disadvantage of those lies in the fact that, typically, only one user is supported. Second way of achieving a reasonably convincing continuous parallax is by using so called light field (LF) displays [1,3,4].

A LF display strives to reproduce the underlying plenoptic function describing the scene that is visualized [5]. It can be observed by multiple users simultaneously without a need of user tracking or glasses. In order to support continuous parallax, a large and dense set of light rays have to be generated to reconstruct the underlying LF function. In today’s LF displays this is achieved by using projection-based systems [3,4]. There are two major drawbacks of such LF displays. First, only a finite number of light rays can be generated in practice. Based on the properties of the human visual system (HVS) it is possible to estimate the optimal (required) number of rays needed to achieve a level of detail that is sufficient for a human observer [6,7]. Unfortunately, achieving that level of detail is impractical with today’s technology. Second, due to the multiple sources of rays, it is very difficult to achieve the desired uniform density of rays (position wise as well as intensity wise) on the screen surface [4]. Both drawbacks reduce the perceived resolution of the display. Therefore it is important to optimize the display setup and properly preprocess data sent to the display in order to mitigate the aforementioned two drawbacks as much as possible.

We have shown earlier [8] that by performing a frequency domain analysis of a typical LF display it is possible to determine the throughput of the display in terms of its spatial and angular resolution. This enables one to calculate the optimal amount of data that has to be captured and sent to the display to maximally utilize its visual capability. Moreover, it gives a user a good idea what to expect from the display in terms of visual quality. In this paper, we build on some of the ideas presented in [8] in order to achieve a deeper understanding of the relations of various hardware and software parts building a LF display. We present an analysis assuming a desired ray-sampling pattern at the screen plane that will define display specifications. For such display, we estimate the throughput of the display in terms of its angular-spatial bandwidth. Having the display specifications, we develop a methodology for determining the optimal distribution of ray generators that will result in the desired display properties as well as a camera setup that can provide data with required amount of details. This is achieved by developing an optimization / estimation method for determining the required display / camera parameters.

Outline of this paper is as follows: Section 2 introduces the LF concepts and notations followed by the description of the principle of operation and properties in spatial and frequency domain of projection based LF displays. The proposed display-camera system optimization is introduced in Section 3, with several examples given in Section 4. Finally, concluding remarks are given in Section 5.

## 2. Light field displays

#### 2.1. Light field basics

In the most general case, by using ray-optics assumptions, the propagation of light in space can be described by a 7D continuous plenoptic function *R*(*θ,φ,λ,τ,A _{x},A_{y},A_{z}*), where (

*A*) is a location in the 3D space, (

_{x},A_{y},A_{z}*θ,φ,*) are directions (angles) of observation,

*λ*is wavelength, and

*τ*is time [5]. For practical reasons, the continuous plenoptic function is typically simplified to its 4D version, which describes the static and monochromatic light ray propagation in half space. This 4D approximation of the plenoptic function is referred to as LF [9]. In this approximation, the LF ray positions are indexed either by their Cartesian coordinates on two parallel planes, the so-called two-plane parameterization

*L*(

*x*,

*y*,

*s*,

*t*) or by their one plane and direction coordinates

*L*(

*x*,

*y*,

*φ*,

*θ*) [9,10].

In this paper, without loss of generality and in line with today’s display technology, we concentrate on the so-called horizontal parallax only (HPO) case, ignoring the vertical parallax and subsequently dropping variables *t* or *θ* in the aforementioned parameterization. Furthermore, we assume that the relation between planes parameterized by (*x*,*s*) and (*x*,*φ*), is given by s = tan*φ* with *x* being the same in both representations. In this parameterization, the origin of the *s* axis is relative to the given *x* coordinate.

The position of two parallel planes *x* and *s* can be chosen depending on the application. Two such positions, where the distances between parameterizing planes are taken as unit, are given in Fig. 1. According to the figure, the propagation of light rays through space can be mathematically expressed as [8,11]

*L*

_{1}and

*L*

_{2}referring to LFs on plane position 1 and plane position 2, respectively, and

*d*being the distance between the plane positions along the

*z*axis. As can be seen from Eq. (2), when considering propagation of light rays in plane and direction representation, the relation between parameters on both planes is not strictly linear. However, for small angles, this nonlinearity can be ignored. More detailed evaluation on light ray propagation can be found in [8,11].

The continuous LF function has to be sampled in a way, which allows its reconstruction from samples. The plenoptic sampling theory, that considers the LF as a multidimensional bandlimited function has been developed in [12]. In general, it states that LF frequency support depends on the min and max depth of the visual scene, and sampling along *x* and *s* creates the usual replication of the baseband, which should be taken into account when designing the end-to-end LF camera to display system. While the sampling physically occurs at the LF acquisition (sensing) stage, it is the LF display, which recreates the LF originating from a visual 3D scene. In the sampling theory formalism, an LF display can be considered as a discrete-to-continuous (D/C) converter that converts a sampled LF into its continuous version, thereby achieving a continuous visualization of a 3D scene, with continuous parallax being part of it. Consequently, we can consider a LF display as a multidimensional sampling system and as such apply multidimensional sampling theory when analyzing LF displays.

#### 2.2. Light field display as sampling-reconstruction system

In our general model, we consider the LF display being composed by a set of ray generators and a continuous LF reconstruction optical module. The ray generators act as discrete sources of light rays and the module is the D/C converter that converts the set of samples (rays) into its continuous representation that is observed by a viewer. While different display settings can fall into this general model, we specifically concentrate on a LF display consisting of a set of projection engines and a special screen, dubbed as holographic screen as illustrated in Fig. 2(a). Each light ray generated by a ray generator, hits this screen from a different angle at a different position, and the screen converts (diffuses) each light ray into an angular beam around the main direction of the ray. The span of the beam after diffusion is anisotropic with narrow horizontal angle *δ _{x}* and wide vertical angle

*δ*, as illustrated in Fig. 2(b) [3]. The screen does not have an explicit pixel structure. A finite area on it emits different light rays to different directions. The properties of such screen are described in more detail in [3]. In this paper we will assume that the screen is a perfect D/C converter. In practice it introduces some low-frequency selectivity that additionally smooths the reconstructed LF. However, this can be ignored for the purposes of our work.

_{y}From the observer viewpoint, a point (object) in space is reconstructed by the interaction of rays originating from different sources (i.e. coming from different directions). This is illustrated in Fig. 2(a) for two observers and several points in space. Each ray can be traced from its origin (ray generator) to the screen surface and it is uniquely described by its starting position and angle or its starting position and the place it hits the screen surface. This is reminiscent to the two-plane LF parameterization discussed in the previous section.

The overall throughput of the display is directly related to the number of light rays the display can generate. Denser set of rays produces finer spatial and angular details. Technology limitations prevent us from achieving the resolution power of the HVS [6,7]. Therefore, it is important to take these limits into account when building the display and/or processing the visual data to be represented on it. Frequency domain analysis of the sampled and reconstructed light field is the proper tool for doing this.

#### 2.3. Spatial and frequency domain analysis of light field displays

A typical LF display under consideration is illustrated in Fig. 3. It consists of *N _{p}* projection engines uniformly distributed on the ray generators (RG) plane (

*p*- plane) over distance

*d*thereby making the distance between engines

_{p}*x*=

_{p}*d*/ (

_{p}*N*−1). Each projection engine generates

_{p}*N*rays over its field of view

_{x}*FOV*. We assume that the rays hit a certain plane (screen plane,

_{p}*s*- plane) parallel to the RG plane at equidistant points. As a consequence, the angular distribution of rays inside the FOV is not uniform. Nevertheless, for small angles we can assume that this is uniform and approximate the angular resolution at the RG plane as

*α*=

_{p}*FOV*/

_{p}*N*.

_{x}The ‘trajectory’ of a ray can be uniquely defined by its origin ${x}_{0}^{\left(r\right)}$ at the RG plane and its direction determined by angle ${\phi}^{\left(r\right)}$. The position of the ray at a distance *z* from the display is given as

*x*,

*φ*) LF parameterization – see Eq. (2).

The screen of the display is where rays recombine to reconstruct the desired continuous LF function to be observed by a viewer. In Fig. 3, several positions for the screen are illustrated with thick black lines. As seen in the figure, the ray (*r*) crosses those ‘screens’ at different horizontal positions (${x}_{{z}_{p2}}^{\left(r\right)},{x}_{{z}_{p3}}^{\left(r\right)},{x}_{{z}_{p4}}^{\left(r\right)}$) and, due to a finite width of the screen *d _{s}*, it even does not contribute to the screen at distance

*z*

_{p}_{1}. In practice this means that the ray would contribute to a different part of the screen depending on the screen position. Moreover, at different screen positions, it intersects with different rays originating from different ray generators, that is, depending on the screen position, a different combination of rays will be responsible for forming a multiview pixel at that position. As a consequence, the uniform distribution of rays we had at the RG plane is lost.

Rays are indexed (parameterized) by their spatial position and direction (*x*,*φ*) and thus represented as samples in the corresponding ray space. This parameterization has been selected among several possibilities (e.g. *φ* vs. *x*, tan*φ* vs. *x*, *z* tan*φ* vs. *x*) since both ray-space axes can be allocated with measurable (quantifiable) units (position can be expressed in mm and angle in degrees) that are easy to understand by a user. Consequently, at the screen plane, the display can be quantified by its spatial resolution (e.g. number of pixels per mm or per screen size) and its angular resolution (e.g. number of rays per degree or FOV of the display *FOV _{disp}*).

For the need of frequency analysis, each ray is considered as a sample, positioned in the 2D ray-space plane for fixed *z* (in the case of full parallax, this turns into a 4D plane). Since the position of the ray is changing along *z*, as given by Eq. (3), for a set of ray generators, different sampling patterns are obtained at different distances from the screen. This is illustrated by means of an example in Fig. 4 (see also Fig. 3). The figures on the top row for *z* = 0,*z _{p}*

_{2},

*z*

_{p}_{4}show how the whole LF that the display is capable of generating is sheared along the

*x*-axis as the screen plane moves away from the RG plane. For better visualization, one set of rays is marked in blue. The figures in bottom row show zoomed in versions of the LF at different distances from the RG plane. One can observe that for every distance, the sampling pattern is regular although not rectangular. The fact that the sampling patterns are regular, enables us to utilize the multi-dimensional sampling theory [13,14].

Any regular 2D pattern can be uniquely described through a notion of sampling lattice Λ. The elements of the lattice are calculated as a linear combination of two linearly independent vectors

^{T}being the transpose operator. The vectors building the lattice can be expressed in matrix form as

**being referred to as the sampling matrix. It is important to point out that the sampling matrix is not unique for a given sampling pattern since $\text{\Lambda}\left(V\right)=\text{\Lambda}\left(EV\right)$ where**

*V***is any integer matrix with $\left|\mathrm{det}E\right|=1$. Consequently, there are multiple basis vectors describing the same lattice. In practice the set of basic vectors with minimum length (norm) is preferred. Therefore, given a set of basis vectors (${v}_{1},{v}_{2}$), one should find a pair of vectors (${\tilde{v}}_{1},{\tilde{v}}_{2}$) such that $\text{\Lambda}\left(V\right)=\text{\Lambda}\left(\tilde{V}\right)$. Here, tilde denotes the sampling matrix with minimized basis vectors (length $\Vert {v}_{1}\Vert +\Vert {v}_{2}\Vert $ is minimized) – see Fig. 5(a). The problem of finding such vectors is known in literature as the lattice basis reduction problem [15]. The solution applicable to our 2D case can be obtained using the following Lagrange’s algorithm applied to a pair of basis vectors (**

*E*

*v*_{1},

*v*_{2}):

For a regular grid described with a lattice $\text{\Lambda}$, one can also define a unit cell *P* that is a set in ${\mathbb{R}}^{2}$ such that the union of all cells centered on each lattice point covers the whole sampling space without overlapping or leaving empty space. Similar to the basis vectors, the unit cell is not unique, as illustrated in Fig. 6. The figure illustrates three possibilities out of an infinite set of valid unit cells describing the same lattice. The shapes become even stranger if the underlying sampling pattern in not rectangular.

In this paper we use the Voronoi cell as the unit cell representing a given sampling pattern [16]. As illustrated in Fig. 5(b), the Voronoi cell, denoted by *P* (green shaded area in the figure), is a set in ${\mathbb{R}}^{2}$ such that all elements of the set are closer, based on Euclidean distance, to the one lattice point that is inside the cell than to any other lattice point – this makes it the most compact unit cell. In the literature, Voronoi cells are also known as Wigner-Seitz cell – e.g. in solid-state physics [17]. By using the minimum length basis vectors, the construction of the Voronoi cell is straightforward and is illustrated in Fig. 5(b) (in Fig. 6 the Voronoi cell is the one shown by the leftmost example).

The samples in ray space forming regular sampling patterns at different depths *z* represent a bandlimited function. In frequency domain, it has periodic structure with multiple replicas of the baseband. The periodicity and at the same time the baseband frequency support is defined through the reciprocal lattice ${\text{\Lambda}}^{*}$, that can be evaluated as [8,14]

There are many possible unit cells for a given lattice ${\text{\Lambda}}^{*}$. Each possible unit cell describes a set of bandlimited functions that can be represented by the sampling pattern and can be reconstructed from a given discrete representation assuming that the reconstruction filter has the shape of the selected unit cell. Furthermore, this also means that an arbitrary continuous function has to be pre-filtered with a filter aimed at removing all frequency content outside of the selected unit cell in order to prevent aliasing errors during sampling. This can be achieved either by using (if possible) a proper continuous-domain filter before sampling the function or first oversampling the continuous function and then performing filtering and down sampling in the discrete domain. It should be pointed out that in the case under consideration, it might not be possible to perform pre-filtering in the continuous domain since this would require an optical filter in spatial and angular direction. Therefore, in this paper we assume that we oversample the continuous function at the sampling stage and perform all filtering in the discrete domain. If the scene is captured by sparse cameras, the dense (oversampled) LF can be reconstructed by compressive sensing approaches, e.g [18].

The most compact (isotropic) unit cell for a given sampling pattern is, as in the spatial domain, a Voronoi cell, denoted in this paper as *P**. The importance of this unit cell is twofold. First, it will represent frequency support that treats equally both directions (spatial and angular direction in ray space representation) – this is beneficial from the HVS viewpoint. Second, the screen in the display that will perform the D/C conversion has for practical reasons a ‘low-pass’ type characteristics (typically it is rectangular with Gaussian type weights [3]) that has to be matched to available ray distribution or vice versa. As such, the Voronoi cell will be the most convenient unit cell to match the screen reconstruction filter.

The Voronoi cell of a sampling pattern can be considered equivalently in spatial or frequency domain. Given its isotropic behavior, it is precisely the quantity, which characterizes the properties of the reconstructed bandlimited function. Therefore, the estimation of the optimal display and camera setup can be done by comparing Voronoi cells formed on the screen plane. From one side, there is the sampling pattern of the rays generated by the display; from another side there is the sampling pattern of the rays as captured by cameras. Both sampling patterns and the respective bandlimited LF are compared for similarity through their Voronoi cells in ray-space domain at the screen plane. This makes the overall optimization procedure computationally less demanding and thus faster. The frequency bandwidth of the system can be easily estimated once the optimal configuration is determined.

## 3. Light field display–camera configuration optimization

In an ideal case, one would require that a display perfectly reconstructs the underlying plenoptic function or at least up to the level of detail supported by the HVS. With limited resources, one can target the best possible continuous LF approximation out of a given discrete set of rays. In such a case, it is important to determine the optimal display and camera setup that maximizes the visual capabilities of the display.

We tackle the problem in two steps. First, we evaluate the optimal setup of ray generators for a given or desired density of rays at the screen plane. Second, we estimate the bandwidth for such system from the perspective of the scene, that is, what kind of capture setup and pre-processing is required to sense enough data for the given display setup. It should be pointed out that step two can be applied to an arbitrary display setup, as long as the basic setup parameters (ray generators, distances, screen properties, etc.) are available. The complete display-camera setup considered in this paper, with all adjustable parameters, is illustrated in Fig. 7 and is discussed in more detail in the following two sections. To streamline the text in the rest of this paper, we use the notations for various sampling patterns emerging from the display setup as in Fig. 7. Subscripts *p*, *s*, and *c* are used to denote the parameters related to the RG, screen, and camera/viewer plane, respectively, with *z* increasing in the direction of the observer and *z* = 0 being relative to the parameter’s origin, e.g. for parameters originating on the screen plane, *z* = 0 is on the screen plane. Practical angles are denoted by $\alpha $ in contrast to ‘theoretical’ angles *φ* used in the LF parameterization. The estimated and optimized parameters are denoted by hat and bar, respectively, e.g. ${\widehat{\alpha}}_{p}$ and ${\overline{x}}_{p}$. Finally, tilde is used to denote parameters after the lattice basis reduction operation.

The proposed optimization technique can be extended for other display-camera configurations, than the one shown in Fig. 7, as long as those configurations result in regular sampling patterns in the angular-spatial domain and as such can be described by sampling matrices as illustrated for cases under consideration next.

#### 3.1. Light field display configuration optimization

For the purpose of stating the problem under consideration, we start from the center of Fig. 7, namely, the screen plane. We require that the display should be able to reproduce a LF with a desired bandwidth or, equivalently, a LF with a given density at the screen plane – the density being defined by the spatial and angular resolution. This determines the values (*x _{s}*,

*α*) in the ray space representation, and in turn, it determines the desired sampling pattern at the screen plane. We assume that the pattern is rectangular – this is realistic assumption due to the properties of the screen and the requirements that both directions (spatial and angular) should be treated in a similar manner. The sampling pattern is uniquely defined through the following sampling matrix:

_{s}*x*,

_{p}*α*) and distance between the RG plane and the screen plane

_{p}*z*for which the sampling pattern mapped to the screen plane will match the desired one, that is, grids described by sampling lattices $\text{\Lambda}\left(V\left({x}_{p},{\alpha}_{p},{z}_{p}\right)\right)$ and $\text{\Lambda}\left(V\left({x}_{s},{\alpha}_{s}\right)\right)$ should match. With reference to Eq. (7), this will ensure the same Fourier domain bandwidth of the desired LF. Mismatches between the lattices $\text{\Lambda}\left(V\left({x}_{p},{\alpha}_{p},{z}_{p}\right)\right)$ and $\text{\Lambda}\left(V\left({x}_{s},{\alpha}_{s}\right)\right)$will manifest themselves either as aliasing effects in the reconstructed continuous LF or as inefficient utilization of the display bandwidth. The targeted lattice matching is done by an optimization technique presented below aimed at mitigating the aforementioned two problems.

_{p}The ray generators’ sampling matrix mapped to screen plane is defined as

*x*,

_{s}*α*), to find (

_{s}*x*,

_{p}*α*,

_{p}*z*) that minimizes

_{p}*δ*

_{p}**(**

*V**x*,

_{s}*α*) being the desired sampling matrix at the screen plane and $\tilde{V}\left({x}_{p},{\alpha}_{p},{z}_{p}\right)$ being the lattice basis reduced sampling matrix of the ray generators

_{s}**(**

*V**x*,

_{p}*α*) mapped to the screen plane. It should be pointed out that $V\left({x}_{s},{\alpha}_{s}\right)=\tilde{V}\left({x}_{s},{\alpha}_{s}\right)$. Furthermore, when implementing Eq. (10), it should be kept in mind that the reduced matrix is unique up to the sign and sequence of basis vectors, that is, $\text{\Lambda}\left(V\left({v}_{1},{v}_{2}\right)\right)\equiv \text{\Lambda}\left(V\left(\pm {v}_{1},\pm {v}_{2}\right)\right)\equiv \text{\Lambda}\left(V\left(\pm {v}_{2},\pm {v}_{1}\right)\right)$.

_{p}The lattice basis reduced sampling matrix of the ray generators mapped to the screen plane can be expressed as

**(**

*V**x*,

_{s}*α*) and $\tilde{V}\left({x}_{p},{\alpha}_{p},{z}_{p}\right)$ corresponds to minimizing

_{s}*Δx*and

_{k}*Δα*(for

_{k}*k*= 1,2) depending on the unknowns (

*x*,

_{p}*α*,

_{p}*z*), and in the ideal case should be zero (in practice they can never be zero but one can attempt making them small enough). The minimization of the measure

_{p}*δ*in Eq. (12) is illustrated in Fig. 8(a). The lattice basis reduction procedure $V\left({x}_{p},{\alpha}_{p},{z}_{p}\right)\to \tilde{V}\left({x}_{p},{\alpha}_{p},{z}_{p}\right)$ is an iterative procedure with no analytical solution and there is no analytical relation between

_{p}*Δx*and

_{k}*Δα*(for

_{k}*k*= 1,2) and unknowns (

*x*,

_{p}*α*,

_{p}*z*).

_{p}The above problem can be tackled by fixing one of the parameters (*x _{p}*,

*α*,

_{p}*z*) and finding a solution for the other two that achieves the smallest

_{p}*δ*. Unfortunately, the optimization problem is not convex and has multiple local minima, which complicates finding the global minimum. However, since there are only three unknowns, a good practical approach is to do a grid search over a reasonable range of the unknown variables. This is a time consuming yet a reliable way to obtain the global minima. We will illustrate this by an example in Section 4.

_{p}Practical limitations of a projection-based light field display are related with the physical size and resolution of the ray generators, the number of generated rays, and other screen properties – see [3] for more details. These limitations translate to a finite number of ray sources with high angular and lower spatial density at the RG plane, i.e. small *α _{p}*, and larger

*x*Desired spatial resolution at the screen plane is higher, which can be achieved by reducing the angular resolution. This leads to practical limitations expressed as

_{p}The sampling grid at the RG plane is described by ** V**(

*x*,

_{p}*α*). After shearing that grid by distance

_{p}*z*it turns into the sampling grid at the screen plane described by sampling matrix

_{p}**(**

*V**x*,

_{p}*α*,

_{p}*z*) as given in Eq. (9). The question is: Which sampling points in the original grid contribute to the basis vectors after shearing and lattice basis reduction? The approach for finding a good candidate can be graphically visualized as shown in Fig. 9. The original pattern corresponding to

_{p}**(**

*V**x*,

_{p}*α*) in Fig. 9(a) is sheared to position

_{p}*z*=

*z*in Fig. 9(b). The best approximation of the pattern

_{p}**(**

*V**x*,

_{s}*α*) is achieved when (see also Fig. 8(b) for illustration)

_{s}*x*,

_{p}*α*,

_{p}*z*):

_{p}*α*isThe reason for this selection lies in the fact that the sampling grid on the screen plane is a sheared version of the sampling grid at the RG plane with shearing performed only in the horizontal direction according to Eq. (3). Under these circumstances, the selection of

_{p}*α*according to Eq. (17) will ensure that there exist a point in the sheared grid that approximately matches the desired sampling vector ${\left[\begin{array}{cc}0& {\alpha}_{s}\end{array}\right]}^{T}$ thereby minimizing $\text{\Delta}{\alpha}_{2}$ and $\Delta {x}_{2}$. This is illustrated in Fig. 9(b).

_{p}The estimated parameters, as illustrated in Section 4 by means of examples, will be very close to the optimal ones, e.g. the optimal value of ${\tilde{x}}_{p}$ will be in the range ${\widehat{x}}_{p}\pm {x}_{s}/2$. Based on this, we can formulate the optimization technique as follows:

- 1. Select a value ${\widehat{\alpha}}_{p}$ according to available hardware resources and Eq. (17).
- 2. Use estimation formulas given by Eqs. (15) and (16) to get ${\widehat{x}}_{p}$ and ${\widehat{z}}_{p}$.
- 3. Refine the result by applying iterative search / general purpose optimization in range ${\widehat{x}}_{p}\pm {x}_{s}/2$ thereby obtaining an optimal set of parameters $({\overline{x}}_{p},{\overline{\alpha}}_{p},{\overline{z}}_{p})$.

The evaluated sampling density in ray space $({\overline{x}}_{p},{\overline{\alpha}}_{p})$ determines the spatial and angular resolution of the display. This technique will be illustrated by means of examples in Section 4.

#### 3.2. Camera setup optimization

The camera setup should provide the rays required by the display for proper recreation of the LF of the scene. While the display is a band-limited device, the 3D visual scene is not (except for simple scenes with low-frequency spatial content, limited depth, and no occlusions) This means that when discussing the optimization of the camera setup, we have two problems to consider. First, how to estimate the optimal camera setup in terms of minimal amount of data that will provide the information needed for rendering all rays generated by the display. Second, how to ensure an alias-free capture of the scene to be recreated by the display.

Both problems are directly related to the display parameters and the corresponding display bandwidth they determine. The ultimate goal is to match that bandwidth with an optimal camera setup which allows rendering all rays needed by the display in an anti-aliassed pass band manner. The solution goes through matching the sampling patterns of the display and cameras at the screen plane. With reference to Fig. 7, the optimization problem is formulated as follows: for a given display sampling pattern described by (*x _{p}*,

*α*), find (

_{p}*x*,

_{c}*α*,

_{c}*z*) that minimizes

_{c}*z*) indicates that the camera sampling matrix

_{c}**(**

*V**x*,

_{c}*α*), is mapped to the screen plane with the minus being there due to the orientation of the

_{c}*z*axis. Two comments regarding the above optimization criteria. First, we are doing the matching on the screen plane since this is the place where the D/C conversion takes place – sampling criteria must be satisfied at that plane. Second, in order to speed up the optimization, instead of $\tilde{V}({x}_{p},{\alpha}_{p},{z}_{p})$ we could also use

**(**

*V**x*,

_{s}*α*), assuming that the display sampling grid at the screen plane approximates well enough the desired one. This is perfectly fine in practice since it is expected that practical limitations (e.g. anti-aliasing filter, screen’s D/C conversion) will affect the overall visual performance much more than mismatch between the desired and obtained display properties.

_{s}By applying an iterative optimization as described in the previous section, we can determine the optimal camera setup in terms of camera parameters $({\overline{x}}_{c},{\overline{\alpha}}_{c},{\overline{z}}_{c})$. In comparison to display optimization, there are additional restrictions that have to be taken into account, e.g. reasonable distance of a viewer from the screen, practical camera resolutions, $FO{V}_{c}\ge FO{V}_{p}$, and camera-to-camera distance that cannot be too small.

After determining the minimal camera sampling pattern $\text{\Lambda}\left(V\left({\overline{x}}_{c},{\overline{\alpha}}_{c}\right)\right)$ for a given ${\overline{z}}_{c}$, we map the optimized display unit cell in the frequency domain at the screen plane ${P}^{*}\left(V\left({\overline{x}}_{p},{\overline{\alpha}}_{p},{\overline{z}}_{p}\right)\right)$ to the camera plane where it turns into ${S}_{\overline{z}}{}_{c}\left({P}^{*}\left(V\left({\overline{x}}_{p},{\overline{\alpha}}_{p},{\overline{z}}_{p}\right)\right)\right)$ where the ${S}_{z}\left({P}^{*}\right)$ is the mapping (shearing) operator. Since ${P}^{*}$ is a convex set with points being the vertices of the unit cell in frequency domain that can be defined as

- 1. Capture the scene with sampling rate (large number of cameras) that will ensure proper anti-alias capture. This depends on the scene. However, the smallest bandwidth that has to be captured is marked by ${P}^{*}\left(V\left({x}_{c}^{BIG},{\alpha}_{c}^{BIG}\right)\right)$.
- 2. Filter the captured content with filter having the passband determined by ${S}_{\overline{z}}{}_{c}\left({P}^{*}\left(V\left({\overline{x}}_{p},{\overline{\alpha}}_{p},{\overline{z}}_{p}\right)\right)\right)$.
- 3. Down-sample the filtered signal to ${P}^{*}\left(V\left({\overline{x}}_{c},{\overline{\alpha}}_{c}\right)\right)$.

This sensing procedure will result in properly pre-processed minimal amount of data that at the same time maximally utilizes the visualization capabilities of the display.

## 4. Examples

We illustrate the proposed optimization procedure on a ‘realistic’ display with reasonable quality as it can be built today, illustrate the optimization approach for optimal capture setups, and finally show / discuss what would be the setup of a display matching the requirements of the HVS.

#### 4.1. Display optimization examples

First we illustrate the proposed display configuration optimization on a display having desired spatial and angular resolution at the screen plane such that *x _{s}* = 1 mm and

*α*= 1°. We fix the angular resolution of the ray generators at the RG plane to

_{s}*α*= 0.0391° (this resolution corresponds to a spatial resolution of 1024px over FOV of 40 degrees). For fixed

_{p}*α*, we evaluate the matching error

_{p}*δ*on the screen plane for various values of

_{p}*x*$(10\text{\hspace{0.17em}}\text{mm}\le {x}_{p}\le 40\text{\hspace{0.17em}}\text{mm})$ and

_{p}*z*$(600\text{\hspace{0.17em}}\text{mm}\le {z}_{p}\le 1800\text{\hspace{0.17em}}\text{mm)}$. The results of the optimization are shown in Fig. 11. In the figure, the left column shows the overall optimization range and the right column shows a zoomed-in range around the minimal value. Top row shows the result of overall optimization whereas middle and bottom row show the best solutions for a given

_{p}*z*and

_{p}*x*. As it can be seen, there is a dominant minimum at $({\overline{x}}_{p},{\overline{z}}_{p})=(26.01\text{\hspace{0.17em}}\text{mm},1465.90\text{\hspace{0.17em}}\text{mm})$ with an error value of

_{p}*δ*= 0.04249. The Voronoi unit cell $P\left(V\left({\overline{x}}_{p},{\overline{\alpha}}_{p},{\overline{z}}_{p}\right)\right)$ of such optimized display is shown in Fig. 12. As it can be seen, the match with the desired $P\left(V\left({x}_{s},{\alpha}_{s}\right)\right)$ one is almost perfect. The downside of such grid-based search is in the need to evaluate many combinations of (

_{p}*x*,

_{p}*z*) not knowing which one will result in optimal solution. This can be considerably speeded-up by using the estimation approach described in Section 3.1. Following Eqs. (15) and (16), the estimated values for the problem under consideration are $({\widehat{x}}_{p},{\widehat{z}}_{p})=(25.58\text{\hspace{0.17em}}\text{mm},1465.36\text{\hspace{0.17em}}\text{mm)}$. As can be seen they are very close to the ones above obtained by the grid search. By performing single gradient-based optimization from the estimate, we end up with $({\overline{x}}_{p},{\overline{z}}_{p})=(26.00\text{\hspace{0.17em}}\text{mm},1465.34\text{\hspace{0.17em}}\text{mm})$ with an error value of

_{p}*δ*= 0.04249. This is almost identical to the one obtained by the grid-based search and is obtained with a small amount of computational resources – fraction of a second instead of 10-15 min needed by the grid-based approach. Since almost identical result is obtained with both approaches, we can conclude that our proposed estimation method is correct and useful.

_{p}By using the fast estimation method, we can easily calculate optimal display setups for various screen parameters. First, Fig. 13 shows display optimization results for *x _{s}* = 1 mm and

*α*= 1° for various values of

_{s}*α*. It is seen that for a good approximation we need small values of

_{p}*α*. However, very small values of

_{p}*α*require impractically large values of

_{p}*z*and

_{p}*x*. Therefore, in practice, a compromise between those has to be made. For illustration, the unit cells for optimal solutions for several values of

_{p}*α*are shown in Fig. 14.

_{p}Next, we investigate the influence of different values of (*x _{s}*,

*α*) on (

_{s}*z*,

_{p}*x*) for fixed

_{p}*α*= 0.0391°. As seen in Fig. 15 a similar reconstruction error

_{p}*δ*can be achieved independently of the choice for

_{p}*x*and

_{s}*α*. Furthermore, distance

_{s}*z*is influenced only on the desired

_{p}*x*and finally,

_{s}*x*has to be increased if either

_{p}*x*or

_{s}*α*is increased. These figures give us a good understanding about the relation between involved parameters and can help us in making proper selection decisions.

_{s}#### 4.2. Camera optimization examples

For an optimized display as described in the previous section, the display bandwidth is uniquely defined by $P\left(V\left({\overline{x}}_{p},{\overline{\alpha}}_{p},{\overline{z}}_{p}\right)\right)\text{}$ with a good approximation being described by $P\left(V\left({x}_{s},{\alpha}_{s}\right)\right)$. Content captured by any means has to be pre-filtered to this bandwidth. The question here is what is the optimal camera/viewer setup, that is, what are the optimal parameters (*x _{c}*,

*α*,

_{c}*z*) that would support the display bandwidth in the best possible way. In comparison to display optimization where it was logical to fix parameter

_{c}*α*, here it is more convenient to fix the screen to viewer distance

_{p}*z*since the viewer distance is typically ‘fixed’ / ‘selected’ by the user preferences / general recommendation for ‘TV’ watching. For a fixed distance

_{c}*z*= 2000 mm, the result of optimization are shown in Fig. 16. Matching is performed again at the screen plane. There is dominant minimum at $({\overline{x}}_{c},{\overline{\alpha}}_{c})=(35.72\text{\hspace{0.17em}}\text{mm},0.0284\xb0)$ with an error value of

_{c}*δ*= 0.06832. For comparison purpose, optimized ray generators and camera unit cells are shown in Fig. 17.

_{c}The grid search can be made faster by a better initial estimation. This can be done by assuming that the unit cell at the screen distance is ideal, that is, it is defined by (*x _{s}*,

*α*). By following the approach presented in Section 3.2, we obtain $({\widehat{x}}_{c},{\widehat{\alpha}}_{c})=(34.91\text{\hspace{0.17em}}\text{mm},0.0286\xb0)$. This is very close to the aforementioned optimal solution. Due to a high nonlinearity (see Fig. 16, middle row, right), one cannot use gradient based optimization but can perform a grid search only in the vicinity of the estimated values. Since this drastically limits the search space, it can be performed much faster than the full grid search.

_{s}The sampling pattern in the spatial domain can be converted to the frequency domain by using Eq. (7). By converting the frequency domain unit cell belonging to optimized display pattern from the screen plane to the camera plane, we obtain the bandwidth of the display – shown in blue in Fig. 18. As discussed before, one should sample the scene with wide enough bandwidth to avoid aliasing, then pre-filter and then downsample. After downsampling, one obtains the maximum amount of data required by the display – display cannot show more and as such there is no point to provide more. It should be pointed out that this is in line with similar analysis performed for autostereoscopic displays [19,20].

#### 4.3. Consideration related to an ‘ideal’ HPO 3D display

An ideal display should deliver the resolution required by the HVS. For estimating the required display angular-spatial resolution, in this section, we follow the discussion presented in [7]. It is assumed that an eye at distance *z _{c}* from the display can differentiate spatial changes equal to 1/60° – this is equivalent to resolution of 30cpd (cycles per degree). This maps to

*d*and can be estimated as [7]Assuming that average pupil size, as reported in the literature, is

_{p}*d*= 3 mm and the viewing distance is fixed at

_{p}*z*= 2000 mm, we end up with required display resolution of (

_{c}*x*,

_{s}*α*) = (0.58 mm, 0.086°). This means that an ‘ideal’ HPO display with 60 degree FOV, for the assumed fixed distance, is required to reproduce at least 2·10

_{s}^{9}rays per square meter of the screen surface.

Following the proposed display optimization, one can determine that for fixed *α _{p}* = 0.0313°, the optimal parameters of the ray generators should be $({\overline{x}}_{p},{\overline{z}}_{p})=(1.74\text{\hspace{0.17em}}\text{mm},1062.86\text{\hspace{0.17em}}\text{mm})$ with the matching error being

*δ*= 0.0322. By mapping this values to the camera plane (c.f. Figure 19), we can determine the necessary sampling rates as discussed in the previous section.

_{p}## 5. Concluding remarks

In this paper we presented a sampling model of the LF display-camera channel. We have shown that, from the sampling theory viewpoint, we can start with the required properties of the display specified in the ray space at the screen plane and then calculate the display setup fulfilling those requirements. Having the display setup, we can estimate the minimal set of data that the display needs to maximally utilize its visualization capabilities together with filter bandwidth for data pre-filtering aimed at alias-free reproduction.

Several points should be emphasized beyond the scope of this paper. First, we did not discuss all practical (hardware) aspects of implementing such displays, e.g. additional limitations to the design might be enforced by the available components and space like overall size, available ray sources, etc. Nevertheless, the same methodology presented in the paper still applies. Second, we assumed ideal D/C properties of the (holographic) screen. In practice, the screen will introduce additional smoothing that will further band limit the content the display can reproduce. Third, it should be always kept in mind that while the display is a bandlimiting device, a typical visual scene is not bandlimited. This means that special care has to be taken when sensing a scene and preparing the content for its optimal anti-aliasing filtering prior of its visualization on the LF display.

The discussion in the paper concentrated on projection-based displays employing diffusion-based holographic screen for reconstructing the continuous light field. This specific setting allows to clearly demonstrate the importance of ray sampling patterns for characterizing the display bandwidth and to directly relate it with the ray acquisition setting. However, the proposed approach can be used with any type of display system that attempts recreating a continuous light field and has an underlying (not necessary uniform) sampling of the input light field in the angular-spatial domain. Examples include autostereoscopic [21] and super multi-view displays [22,23]. Further work and a more comprehensive analysis is required for displays having a non-uniform density of input light rays or /and ones that are capable of changing the density of light rays based on content (e.g. tensor displays [24]) and it will be a topic of further research.

The estimates of ‘ideal’ projection-based LF display parameters as obtained in Section 4.3 were based on geometrical assumptions about the resolution power of the human eye. They show that a projection-based LF display matching the sampling density of the HVS is possible given that the individual optical modules are spaced 1,74 mm apart, while each module delivers rays with 0.03 rad angular step. Such a display is still difficult to produce and can be attempted in the future. Any other display designs with lower resolutions shall greatly benefit from the solution presented in this paper for delivering alias-free imagery. While the resolution power of the HVS estimated by geometrical assumptions is quite high, further studies are needed to characterize the perceptual threshold of continuous parallax, in the fashion how the window of visibility in disparity domain has been estimated [25]. Such perceptual characterization of continuous parallax would be more instructive when specifying the desired LF display bandwidth. A similar problem does exist with LF content creation. To cope with the anti-aliasing requirements, a very dense set of cameras is required for capturing content to be further processed for the specific display. Future development of intermediate view generation out of a set of sparsely captured views and employing signal processing sparsification approaches is of great interest.

## Acknowledgments

The research leading to these results has received funding from the PROLIGHT-IAPP Marie Curie Action of the People programme of the European Union’s Seventh Framework Programme, REA grant agreement 32449 and from the Academy of Finland, grant No. 137012: High-Resolution Digital Holography: A Modern Signal Processing Approach.

## References and links

**1. **A. Boev, R. Bregović, and A. Gotchev, “Signal processing for stereoscopic and multi-view 3D displays,” in *Handbook of Signal Processing Systems, 2nd edition*, S. Bhattacharyya, E. Deprettere, R. Leupers, and J. Takala, eds. (Springer, 2013).

**2. **W. A. IJsselsteijn, P. J. H. Seuntiëns, and L. M. J. Meesters, “Human factors of 3D displays,” in *3D Video Communication*, O. Schreer, P. Kauff, and T. Sikora, eds. (Wiley, 2005).

**3. **T. Balogh, “The HoloVizio system,” Proc. SPIE **6055**, 6055OU (2006).

**4. **J. H. Lee, J. Park, D. Nam, S. Y. Choi, D. S. Park, and C. Y. Kim, “Optimal projector configuration design for 300-Mpixel multi-projection 3D display,” Opt. Express **21**(22), 26820–26835 (2013). [CrossRef] [PubMed]

**5. **E. Adelson and J. Bergen, “The plenoptic function and the elements of early vision,” in *Computational Models of Visual Processing*, M. Landy and J.A. Movshon, eds. (MIT, 1991).

**6. **A. Stern, Y. Yitzhaky, and B. Javidi, “Perceivable light fields: Matching the requirements between the human visual system and autostereoscopic 3-D displays,” Proc. IEEE **102**(10), 1571–1587 (2014). [CrossRef]

**7. **S. A. Benton and V. M. Bove, *Holographic Imaging* (Willey, 2008).

**8. **R. Bregović, P. T. Kovács, T. Balogh, and A. Gotchev, “Display-specific light-field analysis,” Proc. SPIE **9117**, 911710 (2014). [CrossRef]

**9. **M. Levoy and P. Hanrahan, “Light field rendering,” in SIGGRAPH ‘96 Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (Computer Graphics) (1996), pp. 31–42. [CrossRef]

**10. **S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen, “The lumigraph,” in SIGGRAPH ‘96 Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (Computer Graphics) (1996), pp. 43–54. [CrossRef]

**11. **C. K. Liang, Y. C. Shih, and H. H. Chen, “Light field analysis for modeling image formation,” IEEE Trans. Image Process. **20**(2), 446–460 (2011). [CrossRef] [PubMed]

**12. **C. Zhang and T. Chen, “Spectral analysis for sampling image-based rendering data,” IEEE Trans. Circ. Syst. Video Tech. **13**(11), 1038–1050 (2003). [CrossRef]

**13. **E. Dubois, “The sampling and reconstruction of time-varying imagery with application in video systems,” Proc. IEEE **73**(4), 502–522 (1985). [CrossRef]

**14. **E. Dubois, “Video sampling and interpolation,” in *The Essential Guide to Video Processing*, J. Bovik, ed. (Academic Press, 2009).

**15. **P. Q. Nguyen and D. Stehlé, “Low-dimensional lattice basis reduction revisited,” ACM Trans. Algorithms **5**(4), 338–357 (2009). [CrossRef]

**16. **F. Aurenhammer, “Voronoi diagrams – A survey of a fundamental geometric data structure,” ACM Comput. Surv. **23**(3), 345–405 (1991). [CrossRef]

**17. **E. B. Tadmor and R. E. Miller, *Modeling Materials: Continuum, Atomistic and Multiscale Techniques*, (Cambridge University, 2011).

**18. **X. Cao, Z. Geng, and T. Li, “Dictionary-based light field acquisition using sparse camera array,” Opt. Express **22**(20), 24081–24095 (2014). [CrossRef] [PubMed]

**19. **M. Zwicker, W. Matusik, F. Durand, and H. Pfister, “Antialiasing for automultiscopic 3D displays,” Proc. Eurographics Symposium Rendering, 1–10 (2006).

**20. **A. Boev, R. Bregović, and A. Gotchev, “Methodology for design of antialiasing filters for autostereoscopic displays,” IET Signal Process. **5**(3), 333–343 (2011). [CrossRef]

**21. **N. Holliman, N. Dodgson, G. Favalora, and L. Pockett, “Three-dimensional displays: A review and application analysis,” IEEE Trans. Broadcast **57**(2), 362–371 (2011). [CrossRef]

**22. **Y. Takaki and N. Nago, “Multi-projection of lenticular displays to construct a 256-view super multi-view display,” Opt. Express **18**(9), 8824–8835 (2010). [CrossRef] [PubMed]

**23. **Y. Takaki, Y. Urano, S. Kashiwada, H. Ando, and K. Nakamura, “Super multi-view windshield display for long-distance image information presentation,” Opt. Express **19**(2), 704–716 (2011). [CrossRef] [PubMed]

**24. **G. Wetzstein, D. Lanman, M. Hirsch, and R. Raskar, “Tensor displays: Compressive light field synthesis using multilayer displays with directional backlighting,” ACM Trans. Graph. **31**(4), 1–11 (2012). [CrossRef]

**25. **D. Kane, P. Guan, and M. S. Banks, “The limits of human stereopsis in space and time,” J. Neurosci. **34**(4), 1397–1408 (2014). [CrossRef] [PubMed]