Systematic method for modeling and characterizing multilayer light field displays

Mohan Xu; Hong Hua

doi:10.1364/OE.381047

1. Introduction

The ultimate goal for three-dimensional (3D) displays is to create a vivid viewing experience for users by rendering high-fidelity visual cues to stimulate human visual perception as if they are looking at a natural 3D scene. A stereoscopic 3D (S3D) display is one of the most commonly adopted approaches, which stimulates the perception of 3D space and shapes by showing a pair of two-dimensional (2D) images, one for each eye, with binocular disparities and other pictorial depth cues. Such an S3D-based display method often fails to correctly render some of the important visual cues that contribute to the 3D perception of the human visual system (HVS) in viewing a natural 3D scene and leads to significant visual conflicts and discomfort. For instance, in most of the existing S3D displays, the pair of binocular images is typically rendered on a screen of a fixed focal distance from the viewer, which leads to the broadly-recognized problem of vergence-accommodation conflict (VAC). To elaborate, the binocular disparities rendered by the pair of images offer the vergence cue which drives the two eyes to converge at the proper depth of the object of interest to fuse two separate images into one. In the meanwhile, the fixed screen distance from the viewer offers the accommodation cue which drives the change of the refractive power of the eye lenses to perceive sharply focused images of the rendered scene. The depth of the vergence cue and that of the accommodation cue are generally mismatched unless the rendered scene is located at the same depth as the screen (i.e. zero disparity), which leads to several visual cue conflicts and reportedly visual discomfort and mis-perception [1,26].

Many methods have been explored to address the VAC problem [2–18, 26], among which light field 3D (LF-3D) displays can potentially render correct or nearly correct focus cues by reproducing the light rays emitted by the actual 3D scene [20]. Each small bundle of the sampled light rays indicates the subtle difference of an object seen from the slightly different viewing positions, and its light luminance is modulated by a LF-3D display engine to replicate the luminance of a real ray bundle in the same direction from the natural 3D scene. The projection of the 3D scene through each of the sampled viewing position is often regarded as an elemental view. To enable a viewer’s eye to accommodate at the depth of a reconstructed 3D object rather than the source from which the rays are actually originated and effectively mitigate the VAC problem, a true LF-3D display requires that multiple elemental views from different viewing directions enter the eye pupil at the same time to form the perceived retinal image. In this sense, the retinal image blur is the summed effect of these elemental views and is determined by both the intensity distributions and lateral displacements of the elemental views, while the displacements of the elemental views on the retina vary with the state of eye accommodation.

Several different optics architectures have been explored to implement LF-3D displays, such as super multi-view systems utilizing an array of projectors [17] or integral imaging methods using a lenslet array or pinhole array [13,19]. A computational multilayer light field display is a relatively new, emerging method of implementing LF-3D displays. It mainly consists of a stack of light attenuation layers and a directional or uniform backlight [16,22]. The light field of a 3D scene is computationally decomposed into a number of masks representing the transmittance or reflectance of each layer of the light attenuators. When rendering a 3D scene, the light rays from the backlight propagate through the attenuation layers, and their luminance are modulated in a multiplicative fashion by varying the transmittance or reflectance of the pixels on each layer intercepted by the rays to simulate the genuine rays emitted by the actual 3D scene.

Engineering prototypes of computational LF-3D display systems have been demonstrated, and some aspects of the imaging properties such as the depth of field (DOF) have been investigated [14–16,19,21–22,27]. For instance, adapted from the well-known parallax-barrier autostereoscopic displays by introducing content-adaptive optimization to compute the proper attenuation values for the mask layers, Wetzstein et al. demonstrated a new tensor display comprised of a stack of time-multiplexed light-attenuating layers illuminated by a directional backlight [16]. Maimone and Fuchs pioneered the work to apply the multilayer computational light field display technique for its usage in HMDs and demonstrated the first computational multilayer AR display [27]. More recently, Wetzstein et al. extended their multi-layer factored light field autostereoscopic display method and demonstrated a light field stereoscope for immersive VR applications [15]. The pinlight display, consisting of a spatial light modulator (SLM) and an array of point light sources, namely the pinlights, may also be considered as a computational multi-layer display where the back layer of light attenuators next to the backlight is replaced by a pinhole array or an array of point sources [19].

Despite these pioneering works, developing and optimizing multi-layer LF-3D displays confronts many critical challenges such as understanding the angular-spatial resolution tradeoffs, determining the optimal number of layers and layer separations, or finding the optimal modulation patterns to achieve high performance LF rendering. Unfortunately none of the pioneering works have attempted to rigorously address these challenges. To fill this gap, the ultimate goal of our research has three folds: (1) to develop a generalized framework to analytically model the retinal image formation process of a computational LF display, (2) utilize this analytical framework to establish relevant performance metrics in relation to the engineering parameters of a system, to predict the accommodative response of a standard observer to a multilayer LF-3D display and to even to predict the performance on visual tasks such as depth perception, and (3) utilize the above framework to develop a general optimization method for engineering a multi-layer LF system and develop engineering guidelines for different application domains in the future.

This paper mainly focuses on the first aspect of our three objectives and we will present follow-up papers to address the other two objectives. To create generalized framework with a wide range of utilities, it is necessary to develop a systematic analysis method that is able to link the key engineering parameters for constructing a multi-layer LF-3D display with the performance metrics of such a system such as angular and spatial resolution, image contrast, etc. Unlike a conventional 2D display, the perceived results and artifacts of a LF-3D display largely depends on the ocular parameters of an observer such as its accommodative state. Therefore, a sound analytical model shall consider not only the effects of key engineering parameters, LF factorization algorithms, the view-dependency properties of a reference light field, but also offer the capability of simulating the perceived retinal images and visual artifacts unique to a computational LF display. Furthermore, the model shall offer the flexibility to scale up the display parameters such as pixel pitch or fill factor, the number of layers and layer separations, or adopt different LF factorization algorithms. Through the analytical models, we shall be able to examine the effects of the engineering parameters such as the pixel pitch and spacing of the attenuation layers on the perceptual quality of the rendered image and on the effectiveness of stimulating accommodative responses.

The rest of the paper is organized as follows. Section 2 describes the development of a generalized framework for modeling the retinal image formation process of a multilayer LF-3D system. The implementation process of the modeling framework is presented in Section 3 where the simulation of perceived retinal images and the accumulated PSF of a rendered light field are quantitatively demonstrated. The remaining sections of the papers demonstrate the utility of the analytical model. Section 4 examines the effects of the display parameters such as the pixel pitch of the modulation layers and the layer separation on the rendered LF quality. Section 5 further investigates the effects of view dependency of a multilayer LF-3D display by examining how the content of a target light field and its depth position, as well as the subtle view-dependence appearance, affect the quality of reconstruction. Section 6 examines the accommodative response of a standard observer for targets rendered at different distances and quantitatively analyzes the accommodation error.

2. Methods

Huang and Hua proposed a generalized framework for modeling the image formation process of a LF-3D display and they demonstrated the utility of the framework for characterizing the perceived retinal image properties as well as the accommodative response of an observer to an integral-imaging (InI) based LF-3D display concerning the directional and positional sampling properties of a 4-D light field [24,25]. By adapting their generalized model, Fig. 1(a) illustrates the systematic model for a computational multilayer light field display and its retinal image formation process. The adapted model consists of a multilayer light field display engine and an eye model. The display engine simulates the mechanism and process of reconstructing the light field of a 3D scene through a stack of attenuation layers. The eye model, which simulates the optical properties of the eye, is placed so that its entrance pupil matches the location of the viewing window creates by the display engine through which a viewer observes the reconstructed scene. Similar to the work by Huang and Hua [24,25], in this paper, the perceived retinal image along with the accumulated retinal point spread function (PSF) and its corresponding modulation transfer function (MTF) are modeled to characterize the image quality by a LF-3D system due to the unique properties of LF sampling and dependence on the ocular parameters of the observer. The accumulated PSF characterizes the image formation of a point source rendered by the system and its width indicates the resolution limit. MTF value measures the normalized contrast response of the display for rendering targets of different spatial frequencies. By calculating the MTF through a range of rendering depths, the depth of field (DOF) of a system may be examined by locating the threshold depth range in which the MTF values corresponding to the boundary spatial frequencies are equal or above a threshold value.

Fig. 1. The generalized systematic analysis model of a two-layers light field display system with a schematic eye model and the perceived retinal image formation of this display for the reconstruction point B locates at the distance z from eye pupil that (a) equals to, (b) smaller than and (c) larger than the eye accommodative distance z_A. (d) is the viewing window or the eye entrance pupil with footprints of multiple elemental views.

Download Full Size | PDF

For the convenience of characterizing the imaging properties and eye accommodative response from the point of view of a viewer, the center of the viewing window is set as the origin O of the coordinate system OXYZ, the Z-axis is along the viewing direction pointing straight toward the display engine, and the OXY plane is parallel to the viewing window. We further define a reference frame, O’X’Y’Z’, for the retinal image plane, with the origin O’ located at a distance, Z’_retina, which is the effective distance from the entrance pupil of the eye to the retinal plane. The corresponding axes of the two reference coordinates are parallel to each other. For consistency, the distances and coordinates in the visual space will be defined with respect to the reference frame OXYZ, while all the coordinates in the retinal image space will be defined with respect to the reference O’X’Y’Z’.

Without loss of generality, the schematics hereby assume that the display engine is constructed by two attenuation layers illuminated by a uniform backlight for simplicity, though it can be readily extended to more than two attenuation layers or a non-uniform backlight. Specifically, for configurations with more than two layers, the same process described hereby can be followed to calculate the perceived retinal image as in the dual-layer configuration. A minor difference is to trace more rays based on the pixels on the extra layers for factorization and define the elemental views according to the directional sampling determined by the pixels on all the layers.

The front layer is situated at the distance z₁, and the back layer is at the distance z₂ from the entrance pupil of the eye model, respectively. In addition, we define the separation between the two layers as h, the pixel pitch of the layers as p, and the viewing window size as E_b. The dimensions of the layers directly affect the apparent field of view (FOV) of the system and have a major impact on the computational time required for rendering the attenuation masks. To avoid excessive computational overhead, the FOV in the remaining paper was limited to about 3 degrees centered on the fovea of the retina, though the model itself does not prevent from simulating wider FOV. To simulate the perceived retinal image with adequate accuracy for standard observers, we need to choose a schematic eye model that can not only correctly model the paraxial optical properties of a biologic eye but also clinical levels of aberrations for both the on and off-axis fields. Furthermore, to evaluate an observer’s response in viewing a LF-3D display, we further require the model is capable of adjusting its optical parameters such as the radius curvature, surface thickness and the refractive index of the eye lens according to the eye’s accommodation state. To fulfill these requirements, the Arizona Eye Model was selected [23].

As illustrated in Fig. 1(a), to reconstruct a point B located at a distance z from the entrance pupil, all of the light rays passing through the target point B from the backlight integrally contribute to the reconstruction of its light field, and each of the rays intersects with pixels on the attenuation layers. Due to a finite pixel pitch of the attenuation layers, the continuously-distributed light rays from the backlight are sampled into discrete ray bundles with a finite aperture, each of which is considered as an elemental view of the target point. Because the pixel pitch of the attenuation layers is small compared to the layer separation, all the rays within each bundle can be adequately approximated by a parallel beam as if they were originated from optical infinity, which is a significant difference from an integral-imaging based LF-3D display engine where the rays of each elemental view can appear to be diverging or converging or parallel [24]. The luminance of each ray bundle is sequentially modulated by the transmittance or reflectance of the corresponding pixel pairs to render the luminance of the rays emitted by a reference light field in the same direction. In the process of forming the retinal image of the target point, only a subset of the ray bundles passing through the target point B enters the eye pupil and thus contributes to the perceived retinal image of the reconstructed point. The resulted retinal image depends on the accommodative state of the eye model. When the accommodation depth of the eye, z_A, matches with rendering distance z, as shown in Fig. 1(a), the chief rays of the different elemental views intersect the retina at the same location, although each ray bundle may appear to have a small amount of defocus blur if there is a large depth separation between the rendered target and the infinite depth of the ray bundle due to its parallel nature. On the other hand, if z_A is larger (Fig. 1(b)) or smaller than z (Fig. 1(c)), the chief rays of the elemental views no longer converge toward the same location on the retina but are displaced from each other. It is important to notice that both the displacement and the elemental view image vary with the state of eye accommodation. In this sense, the accommodate state of the eye model may be varied to determinate the sharpest retinal image and its corresponding eye accommodation distance is considered as the actual accommodative response of the eye for the rendered object.

According to the retinal image formation process described above, the retinal image of a reconstructed target point is the summation of all the modulated elemental views entering the eye pupil, and its normalized intensity can be expressed as

(1)$${I_{retina}}(x^{\prime},y^{\prime},{z_A}) = \frac{{\sum\limits_{n = 1}^N {\sum\limits_{m = 1}^M {\sum\limits_{q = 1}^Q {L(m,n,{\lambda _q}) \cdot w({\lambda _q}) \cdot s({d_{xm}},{d_{yn}}) \cdot {{|{PS{F_{mnq}}(x^{\prime},y^{\prime},{z_A})} |}^2}} } } }}{{\sum\limits_{n = 1}^N {\sum\limits_{m = 1}^M {\sum\limits_{q = 1}^Q {L(m,n,{\lambda _q}) \cdot w({\lambda _q}) \cdot s({d_{xm}},{d_{yn}})} } } }},$$

where M and N are the total numbers of elemental views entered the eye pupil along the X- and Y-directions, respectively. Q is the number of the sampled wavelengths in the model, PSF_mnq is the retinal PSF of a given elemental view indexed as (m,n) at a given wavelength, λ_q, and L is the luminance value of an elemental view indexed as (m,n) received by the eye from the point of reconstruction. Moreover, two additional weighting functions, w and s, are included in Eq. (1). w is applied to account for the relative visual response to different wavelengths and the wavelength mixing factor of the light source. s is an apodizing filter to account for the well-known Stiles-Crawford effect for which the relative efficiency of an elemental view indexed as (m,n) depends on its entry position, d_xm and d_yn, on the eye pupil as illustrated in Fig. 1(d). Due to this effect, the elemental view entering the central portion of the eye pupil has more contribution to the retinal image than the elemental views entering the pupil periphery.

Although the retinal image of a reconstructed point given in Eq. (1) for a multi-layer LF-3D display appears to share great similarity to that of an InI-based display engine, several major differences set the two methods apart. InI-based systems have two distinctive planes, a rendering plane and a modulation plane, to perform their distinctive functions of positional and directional samplings, respectively. Each pixel on the rendering plane, which defines a luminance value, L, in Eq. (1), is a unique elemental view of a 3D point of reconstruction, and each elemental on the modulation plane has a unique footprint on the viewing window. As a result, an accumulated point spread function (PSF) can be readily obtained by normalizing the weighted sum of a set of known elemental pixels at known entry positions to model the system response, which can then be directly utilized to characterize and examine the effects of ray directional and positional samplings, as demonstrated by Huang and Hua [24,25]. In a computational multi-layer LF-3D system, however, all the attenuation layers collectively sample the ray directions of a light field and they share equivalent functions and effects on the sampling of a rendered light field. More importantly, each pixel on a modulation layer modulates all the rays that intersect with the corresponding pixel and therefore does not represent a unique elemental view of a 3D point. Therefore, there is no unique mapping between a pixel and an elemental view of a 3D point of reconstruction. Instead, a pixel that contributes to the reconstruction of one 3D point likely modulates the light field of many other points, and the extent to which a pixel may impact on other points largely depends on the designed viewing window and the viewer’s eye position. In this sense, the luminance value of each elemental view, L, in Eq. (1) becomes highly dependent on the rendered content and the computational algorithms for rendering the attenuation layers, and the perceived retina image of a reconstructed light field varies with the observer’s eye position within the viewing window. To analytically model a multi-layer LF-3D system and simulate the perceived retinal image of a reconstructed scene, we therefore have to develop methods to characterize content-independent system response functions, model the light field rendering process, and model the retinal imaging formation and reconstruction process.

2.1 Modeling the retinal response of a multi-layer display

To account for the unique optical construction and computational nature of a multilayer LF-3D system described above, based on Eq. (1), we separate the imaging effects of the physical parameters inherent to a multi-layer display engine from the computational effects of the light field rendering. An accumulated PSF is defined to model imaging effects of a multi-layer display engine by assuming uniform luminance values of 1 for all the elemental views in the Eq. (1), expressed as

(2)$$PS{F_{Acc}}(x^{\prime},y^{\prime},{z_A}) = \frac{{\sum\limits_{n = 1}^N {\sum\limits_{m = 1}^M {\sum\limits_{q = 1}^Q {w({\lambda _q}) \cdot s({d_{xm}},{d_{yn}}) \cdot {{|{PS{F_{mnq}}(x^{\prime},y^{\prime},{z_A})} |}^2}} } } }}{{\sum\limits_{n = 1}^N {\sum\limits_{m = 1}^M {\sum\limits_{q = 1}^Q {w({\lambda _q}) \cdot s({d_{xm}},{d_{yn}})} } } }}.$$

By considering the parallel nature of the ray bundle from each elemental view, the monochromatic, coherent, retinal PSF of a given elemental view indexed as (m,n) can be adapted from the analytical model in [24] and expressed as

(3)$$\begin{array}{l} PS{F_{mnq}}(x^{\prime},y^{\prime},{z_A}) = \frac{{{e^{j\frac{{2\pi }}{{{\lambda _q}}}{{z^{\prime}}_{retina}}}}{e^{j\frac{\pi }{{{\lambda _q}{{z^{\prime}}_{retina}}}}({{x^{\prime}}^2} + {{y^{\prime}}^2})}}}}{{j{\lambda _q}{{z^{\prime}}_{retina}}}}\cdot \int {\int\limits_{ - \infty }^\infty {P(x - {d_{xm}},y - {d_{yn}})\cdot {e^{j{{\vec{k}}_{mn}} \cdot \vec{r}}}} } \\ \textrm{ }\cdot \exp [j\frac{{2\pi }}{{{\lambda _q}}}{W_{eye}}({d_{xm}},{d_{yn}},x,y,{\lambda _q},{z_A})]\cdot \exp [j\frac{\pi }{{{\lambda _q}}}( - \frac{1}{{{z_A}}})({x^2} + {y^2})]\\ \textrm{ }\cdot \exp [ - j\frac{{2\pi }}{{{\lambda _q}}}(\frac{{x^{\prime}}}{{{{z^{\prime}}_{retina}}}}x + \frac{{y^{\prime}}}{{{{z^{\prime}}_{retina}}}}y)]dxdy \end{array}$$

where ${\vec{k}_{mn}}$ represents the direction vector of the incident ray bundle for an elemental view indexed as (m,n). The direction vector of a ray bundle may be defined either by the coordinates of the pixel pair with which the ray bundle intersects or the coordinates of target point B and the intersection point of the ray on the viewing window. W_eye is the eye wavefront aberration with the eye optics accommodated at the depth of z_A, which can not only model the residual aberrations of a standard observer but also include observer-specific refractive errors such as myopia or hyperopia. P is a binary aperture function defining the footprint of the ray bundle from each elemental view on the pupil of the eye model, and its size depends on the pixel pitch, p, of the attenuation layers projected on the viewing window as well as the shape and fill factor of each pixel. By assuming the footprints of the elemental views are evenly distributed on the viewing window in a rectangular array symmetric to the optical axis and are perfectly square as the pixels, the aperture function P can be expressed as

(4)$$P(x - {d_{xm}},y - {d_{yn}}) = rect(\frac{{\alpha (x - {d_{xm}})}}{p},\frac{{\alpha (y - {d_{yn}})}}{p}).$$

Where α is the fill factor of the pixels. For the purpose of this paper, a fill factor of 1 is assumed. For a given elemental view indexed as (m,n), its entry position, (d_xm, d_yn), on the viewing window can be expressed as

(5)$${d_{xm}} = z\frac{{{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over k} }_{mn}} \cdot {{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over e} }_x}}}{{{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over k} }_{mn}} \cdot {{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over e} }_z}}};\begin{array}{{cc}} {}&{} \end{array}{d_{yn}} = z\frac{{{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over k} }_{mn}} \cdot {{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over e} }_y}}}{{{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over k} }_{mn}} \cdot {{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over e} }_z}}}$$

In which, ${\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over e} _x}$ ${\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over e} _y}$ and ${\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over e} _z}$ are unit vectors along X,Y,Z directions. Based on the accumulated PSF in Eq. (2), the corresponding MTF, which identifies the contrast modulation for each spatial frequency of the perceived retinal image, can be expressed by a Fourier transform of the accumulated PSF as

(6)$$MT{F_{Acc}}(\xi ^{\prime},\eta ^{\prime}) = \frac{{\int {\int_{ - \infty }^\infty {PS{F_{Acc}}(x^{\prime},y^{\prime})\exp [ - j(\xi ^{\prime}x^{\prime} + \eta ^{\prime}y^{\prime})]dx^{\prime}dy^{\prime}} } }}{{\int {\int_{ - \infty }^\infty {PS{F_{Acc}}(x^{\prime},y^{\prime})dx^{\prime}dy^{\prime}} } }},$$

In which, $\xi ^{\prime}$ and $\eta ^{\prime}$ are the spatial frequencies on the retina in the direction of X’ and Y’.

The viewing density, which is defined as the number of views per unit area on the viewing window [24], for a multi-layer LF-3D display is inversely proportional to the pixel pitch, p, and is typically significantly higher than that of an integral-imaging based LF-3D system, as small pixel pitch is highly desirable for high image quality. Consequently, the footprint size of each elemental view is typically much smaller, and the total number of views integrated by the eye pupil is higher than an InI-based engine. It is worth further pointing out that the accumulated PSF and MTF of the perceived light field on the retina calculated by Eqs. (2) and (6), respectively, mostly varies with the depth of reconstruction, z, the ocular factors such as eye accommodation state and eye pupil diameter, and the pixel pitch of the attenuation layers. Being irrelevant to the display content, the accumulated PSF and MTF can be valuable means for characterizing the general effects of some of the main display parameters and ocular factors on the perceived spatial resolution, depth of field, and eye accommodative response, but they do not provide accurate prediction of the perceived retinal image specific to display contents.

2.2 Rendering light fields

To accurately simulate the perceived retinal image of a multi-layer LF-3D system, it is necessary to model the computational effects of light field rendering. Figure 2 illustrates the two-step process of light field rendering and reconstruction in a multi-layer LF-3D engine. The first step is to computationally generate the transmittance values for each pixel on the attenuation layers to render a reference light field, and the second step is to computationally simulate the reconstructed light field by the display engine. For a given display configuration and a target reference light field, the rendering of the attenuation maps is adapted from the established light field factorization algorithm by [22]. Instead of defining rays of a 4-D light field with pairs of pixels on the attenuation layers as in [24], we define an array of dense sampling points in the viewing window as view positions and then trace rays from each view sampling point to every pixel on each of the display layers. Each of the casted rays intersects with the reference target and yields a luminance measurement of the reference light field, and all the rays collectively yield a linear vector, $\overline {\textbf l} $ for the luminance samples of the reference light field. Consider the red dash line in Fig. 2(a) as an example representing a ray casted from a sampling point indexed as (m, n) on the viewing window. The ray passes through a target point, B, of the reference target and intersects with a pair of pixels marked by two red squares. Providing that L₀ is the luminance of the uniform backlighting, t₁ and t₂ are the transmittance values of the two selected pixels intersected by the ray direction on the front and back layers, respectively, the luminance value of the reconstructed ray, L, is expressed as

(7)$$L(m,n) = {L_0} \cdot {t_1}(m,n) \cdot {t_2}(m,n),$$

By applying the logarithmic operation, Eq. (7) can be rewritten as:

(8)$$- \log (\frac{{L(m,n)}}{{{L_0}}}) ={-} \log ({t_1}(m,n)) - \log ({t_2}(m,n)).$$

Fig. 2. Factorization process to calculate the transmittance of the pixels on the display layers and the reconstruction of the target plane rendered by the multilayer light field display engine.

Download Full Size | PDF

If we define $l ={-} \log (L/{L_0})$ as the reconstructed light field and a=-log(t) as the absorbance of a pixel on the display layers, we can simply state that the reconstructed light field is the summation of the absorbance of the attenuation layers. For convenience, we can organize all the ray-tracing results and express the reconstructed light field in a matrix-vector form as ${\textbf l} = {\textbf P}a$, where P is a 2D projection matrix with its one dimension indexed by all of the rays being cast and another dimension indexed by all of the pixels of attenuation layers, and a is a linear vector for the absorbances of all pixels, as illustrated in Fig. 2(b). The absorbance a of the pixels on the layers can then be formulated and solved by the non-negative least-squares problem $\min ||\bar{{\textbf l}} - {\textbf P}a|{|^2}$ through iterative algorithms similar to the ones in [22], and the transmittance value of each pixel t can then be calculated by $t = \exp ( - a)$.

As an example, Fig. 2(b) demonstrates the formulation of the projection matrix and absorbance vector for rendering a reference light field in which the reconstructed geometry is a 2D planar object textured by a black-white grating pattern. To demonstrate the capability of rendering the subtle view-dependence appearance of a 3D object when viewed from slightly different positions, the local luminance amplitude of the grating pattern is modulated by a Gaussian function based on the angle between the surface normal and the observation direction. The standard deviation of the Gaussian modulation used in the example of Fig. 2(b) is 1°. Figure 2(b) further shows the attenuation maps of the two attenuation layers obtained from the computational method above.

2.3 Modeling the perceived retina image of reconstructed light fields

Following the computational steps in Sec. 2.2 to obtain the attenuation maps for each of the display layers, another key step toward the retinal perception is to computationally simulate the reconstructed light field by the attenuation maps. It is important to notice that the reconstructed light field by the display engine is an approximation of the reference light field due to the fact that each pixel on the attenuation layers is likely been used by multiple rays for reconstructing different object points. Furthermore, the reconstructed light field varies with the viewing position and size of the observation window due to the inherent nature of viewing dependency. Finally, and also most importantly, the perceived retinal image of a reconstructed light field depends on the state of the eye accommodation. As demonstrated in [16,21,22], the reconstructed luminance value of a ray for a given point of reconstruction and a given point of observation can be computed straightforwardly with Eq. (7) by multiplying the backlight luminance with the transmittance values of the corresponding pixels intersected by the ray. However, obtaining the perceived retinal image of a reconstructed light field requires considering the integration effect of the eye pupil with a finite size and the accommodation state of the observer’s eye.

To obtain the perceived retinal image of a 3D reconstructed scene by a multi-layer LF-3D display, one method is to compute the perceived retinal image of every sampled point on the target of reconstruction using Eq. (1) where the luminance value of each ray reconstructing the point of interest is obtained with Eq. (7) and corresponding PSF is computed with Eq. (3). The perceived retinal image of the entire scene can thus be obtained after iterating the process over all the sampled points. This process is naturally quite computationally expensive and slow. Alternatively, we can divide the process into two sub-steps. The first step is to compute the intensity image of a reconstructed scene received by a finite observation window equivalent to the eye pupil without considering the imaging effects of the eye model. Figure 2(c) illustrates the process of computing the intensity image of a reconstructed light field received by a finite observation window. An observation position within the viewing window is firstly specified, at which the center of the observation window is placed. We then define an array of observation sampling points within the observation window and an array of sampling points on the target geometry for reconstruction. The observation sampling points are different from and denser than the view sampling points for rendering. Rays are traced from each of the view sampling points to every sampling point on the target. The observed intensity of a reconstructed 3D point, I_Rec, can be obtained by integrating the luminance values of all the rays casted from the observation points within the observation window through the point of interest, expressed as,

(9)$${I_{{\mathop{\rm Re}\nolimits} c}} = \frac{1}{{NM}}\sum\limits_{n = 1}^N {\sum\limits_{m = 1}^M {L(m,n)} } ,$$

where N and M are the total number of observation sampling points within the observation window in the horizontal and vertical directions, respectively. The intensity image of a 3D reconstructed scene observed at the observation window can be obtained by repeating the computation of Eq. (9) for every sampled point on the target. Compared with the previous method for computing the reconstructed light field of a multilayer light field display system [16,21,22], which only consider the reconstruction from a single observation point as the reconstructed light field without considering the integration effect of an observation area such as the eye pupil, the reconstruction result given in Eq. (9) is the integration of the reconstruction for all observation sampling positions through the eye pupil and reflects more accurately how the eye perceives the reconstructed light field. Figure 2(d) demonstrates the reconstructed image of the grating target viewed from a series of discrete observation points in comparison to the integrated image viewed from the eye pupil of a finite size.

Finally, the perceived retinal image of the reconstructed scene under a given state of eye accommodation can be computed by the convolution of the image of the reconstructed object with the accumulated PSF, expressed as

(10)$${I^{\prime}_{retina}}(x^{\prime},y^{\prime},{z_A}) = I{^{\prime}_{{\mathop{\rm Re}\nolimits} c}}(x^{\prime},y^{\prime}) \otimes PS{F_{Acc}}(x^{\prime},y^{\prime},{z_A}),$$

where I’_Rec is the intensity of the reconstructed scene which was obtained by scaling the coordinate of I_Rec obtained with Eq. (9) by the magnification factor m_p of the eye optics to map the reconstructed object visual space OXYZ to the retinal space O’X’Y’Z’. It can be expressed as the ratio of image distance z’_retina to the reconstructed object distance z with the proper sign. This convolution method assumes that the accumulated PSF of a multi-layer display given in Eq. (2), which mainly depends on the physical parameters of the display layers and the ocular factors of the eye model, is shift-invariant. It is an estimation of the accurate calculation of the perceived retinal image as described in Eq. (1).

This convolution method of obtaining the perceived retinal image of a reconstructed light field by a multi-layer display offers a way to separate the computational effects of rendering a target light field from the inherent effects of the display parameters and ocular factors on the retinal image. It offers us a way to investigate how the physical parameters of the display and the ocular factors such as eye aberrations and accommodation state affect the retinal image, which in turn offers an opportunity to understand the accommodative responses and visual performance and develop engineering guidelines for designing a multi-layer LF-3D system.

3. Implementation process and the test setup

Based on the analytical models described in Sec. 2, Fig. 3 illustrates the implementation process of simulating the perceived retinal image and the accumulated PSF. The simulation of the perceived retinal image is divided into several major steps. Following the initialization of the display system setup including the number of the display layers, location of the display layers, pixel pitch, rendering area (FOV) and viewing window size, the first step is to computationally render the light field of a target scene based on the factorization described in Sec. 2 and calculate the transmittance values of the attenuation layers and these transmittance values are then utilized to compute the luminance value for each ray of the reconstructed light field. Based on Eq. (3), the second step is to compute the PSF of each elemental view under a given eye accommodative state, which would allow us to compute the retinal image of a 3D reconstructed scene based on Eq. (1) and simulate the accumulated PSF to model the effects of the display engine based on Eq. (2). In the meanwhile, we can computationally simulate the integrated intensity distribution of the reconstructed light field observed at a given observation position over the finite area of an eye pupil, from which the perceived retinal image of the reconstructed scene can be obtained based on Eq. (10). All the computations described above are implemented with MATLAB. For the purpose of verification, we also modeled the display engine and eye model in Zemax to allow us to obtain the elemental view PSF through the optics design software and compare it against the result obtained via Eq. (3) directly.

Fig. 3. The implementation of the perceived retinal image and the accumulated PSF of the multilayers light field displays.

Download Full Size | PDF

To demonstrate the implementation process and characterize the perceived retinal image of a multi-layer LF-3D system, we created a test setup of a two-layer light field display on which the simulation results in the following sections are all based. The key parameters of the test setup are shown in Table 1. The dioptric center of the display is 1 diopter (or 1 meter) away from the eye entrance pupil. The front layer is located at 1.05 diopters and the back layer at 0.95 diopters from the viewing window. The separation h between the layers is around 100 mm. Unless otherwise specified, the pixel pitch on the display layers is set to be 0.29 mm which was intentionally chosen to yield an angular resolution of 1 arcmin per pixel when viewed at a 1 diopter observing distance. 1 arcmin per pixel is considered to be the nominal resolution target of high-performance display and also the limit of a standard observer with 20/20 normal vision. The rendering area on the display layers is restricted to match 3 degrees of field of view (FOV) to reduce the amount of computation and limit the simulated retinal image to only cover the fovea region. In this simulation, we use the wavelengths of 611nm, 549nm, and 464 nm to simulate a full-color light field display, corresponding to the dominant wavelengths of the primaries in the sRGB color space. The relative weight, w, for these wavelengths in Eq. (1) and (2) were set according to the human eye’s photopic response curve [23]. The viewing window size is set to be 10 mm in diameter, and a total of 101 points, at an interval of 0.1 mm, were sampled along each radial direction of the viewing window as the observation sampling.

Table 1. The parameters of the test setup

View Table

In addition, the Arizona eye model is selected as the schematic eye model. The entrance pupil diameter of the eye model is assumed to be 3mm, corresponding to the average pupil size when viewing displays of typical luminance around 200 cd/m². The accommodate state, z_A, of the eye model is controlled by varying the shape, position, and refractive index of the crystalline lens [23]. A normal observer with 20/20 vision is assumed. To computationally compute the retinal PSF for each elemental view in Eq. (3), the residual wavefront aberration of the eye model, W_eye, is calculated based on the exact ray tracing of the Arizona Eye Model for normal vision. The weighting function, s, for modeling the Stiles-Crawford effect in Eq. (1) and (2) was modeled by a transmission filter expressed as $s(x,y) = {e^{ - \beta [{{(x - {x_0})}^2} + {{(y - {y_0})}^2}]}}$, where the coefficient, β, is 0.116mm⁻² and the filter is decentered by 0.47mm nasally (x₀) and 0.2 superiorly(y₀).

In order to demonstrate the effects of viewing position in a multi-layer LF-3D system, the eye model position can be shifted within the viewing window. As examples, we simulated the retinal image at eight different eye positions as shown in Fig. 4, namely view 0 to view 7 from the center to the edge of the viewing window.

Fig. 4. The eye positions in viewing window for simulation, namely view 0 to view7

Download Full Size | PDF

Figure 5 demonstrates a few simulation results of the above tested system based on analytical models described in Sec. 2. In this example, the target light field is a simple 2D planar object textured by a black-white grating pattern with a spatial frequency of 5 cycles/degree (cpd) and the target is located at 1 diopter away from the viewing window. The target light field object was computationally rendered to obtain the attenuation maps using the rendering method described in Sec. 2. The eye model was placed at the center of the viewing window and its accommodative depth, z_A, was set to coincide with the depth of the target object. For the purpose of validation, Fig. 5(a) plots the normalized monochromatic retinal PSF of the on-axis elemental view obtained independently through the numerical calculation implemented in MATLAB based on Eq. (3) and ray tracing simulation via Zemax software. The PSF results from two different methods matched very well and therefore all the results in the remaining paper will be based on the analytical methods for the convenience of integrating the steps for computational light field rendering and retinal image simulation. Figures 5(b) and 5(c) plot the normalized polychromatic accumulated PSF and MTF calculated from Eqs. (2) and (6), respectively. For the purpose of comparison, Fig. 5(d) shows the simulated retinal images of the target object at different stages of the simulation chain. More specifically, the top sub-image is the target light field before being imaged through the eye model and the second sub-image is the simulated retinal image of the target light field assuming a real physical target was placed at the same depth or an ideally reconstructed light field was achieved. The third sub-image is the intensity image of the reconstructed light field on the target plane by the display engine obtained via Eq. (9), and the bottom sub-image shows the perceived retinal image of the reconstructed light field computed via Eq. (1). All the sub-images are properly scaled to match the retinal image scale. In the following sections, all the perceived retinal images will be illustrated in the same format. In this particular example, it is clear that the display engine is able to reconstruct the target properly, although the perceived retinal image of the reconstructed light field demonstrates noticeably degraded image contrast.

Fig. 5. Implementation results: (a) is an example of the normalized elemental view PSF obtained computationally via Eq. (3) and through Zemax simulation; (b) is an example of the normalized accumulated PSF computationally obtained via Eq. (2); (c) is an example of the MTF of the accumulated PSF computer via Eq. (6); and (d) is the simulation of retinal images of the center view for a reference light field composed of a 5 cpd square wave.

Download Full Size | PDF

4. Characterizing the effects of display parameters

As analyzed in Sec. 2, the physical parameters of a multi-layer LF-3D display engine play significant roles in the perceived quality and accuracy of a reconstructed light field. Among the display parameters, the pixel pitch, p, and the layer separation, h, are the two most important parameters in a multi-layer system. One important advantage of analytically simulating the image chain is the ability to investigate how each of the physical display parameters affects the perceived light field image without the need of building many variations of physical prototypes. In this section, we demonstrate the use of the model to examine the effects the physical parameters of a multi-layer system affect the rendered LF quality in terms spatial resolution and depth, and artifacts.

In a multilayer LF display engine, the pixel pitch of the attenuation layers not only affects the spatial resolution of the attenuation maps and thus the quality and accuracy of the reconstructed light field due to ray sampling, but also affects the limiting resolution of the perceived retinal image due to diffraction effects modeled by Eq. (3). Based on the analytical model in Sec. 2, we simulated and compared the performance of two different display configurations, one with a pixel pitch of 0.29 mm and another with a pixel pitch of 2 mm for comparison purpose. The pixel pitch of 0.29mm corresponding to an angular resolution of 1 arcmin per pixel when viewed at a distance of away represents the target performance of very high-resolution systems, while the pitch of 2mm corresponding to about 6.9 arcmins per pixel when viewed at the same distance represents the typical resolution performance of many state-of-the-art commercial VR systems. All the other parameters of the display engine and eye model are all the same as described in Sec. 3. For each of the display configurations, the accommodative state of the eye model was set at four different depths, 0 diopters, 0.8 diopters, 1 diopter and 1.2 diopters, respectively, to examine the effects of accommodation. Figures 6(a) and (c) compare the normalized polychromatic retinal PSFs of an on-axis elemental view while Figs. 6(b) and (d) compare the accumulated PSFs between the two display configurations. Each figure contains four sub-figures correspondings to the four different eye accommodation states, for 0, 0.8, 1, and 1.2 diopters, respectively, matching with the accommodative states of the eye model.

Fig. 6. Effects of pixel pitch: (a) and (c) are the normalized elemental view PSFs for two different display configurations with a pixel pitch of 0.29 mm and 2mm, respectively, at four different eye accommodation distances, 0, 0.8, 1, and 1.2 diopters, respectively; (b) and (d) are the accumulated PSF of the two display pitch configurations (0.29 mm and 2mm) for reconstructing an on-axis object point located at four different depths, 0, 0.8, 1, and 1.2 diopters, respectively, matching with the accommodation distances of the eye model.

Download Full Size | PDF

It is clear that when the pixel pitch is adequately small (e.g. 0.29mm) the footprint of each elemental view projected on the eye pupil is small and the retinal PSF of an elemental view is diffraction dominated. As seen from the subfigures in Fig. 6(a), the eye accommodation state and aberrations have negligible effects on the shapes of the elemental view PSF and their corresponding accumulated PSFs, as shown in Fig. 6(b), which suggests a large depth of field for 3D reconstruction. On the other hand, when the pixel pitch is relatively large (e.g. 2mm), the retinal PSF of an elemental view and the corresponding accumulated PSFs vary dramatically with the state of eye accommodation where diffraction is no longer the dominant effect. As the eye accommodative distance shifts from infinity (0 diopter), corresponding to a parallel beam, closer to the viewer (e.g. 1 diopter), the defocus phase term in Eq. (3) becomes more significant and the accumulated PSFs become broadened, which can be easily observed from Figs. 6(c) and 6(d). Naturally not only a narrow depth of field but also degraded image quality is expected for 3D reconstruction.

To further demonstrate the effects of pixel pitch, we simulated the perceived retinal image of a target light field which is the same as the one in Fig. 5(d). The reference target was placed at three different depths, 0.8, 1, and 1.2 diopters, respectively. Figures 7(a)-(b) show the simulated images of the reconstructed target for the center view (i.e. view #0) for the two display configurations of 0.29mm and 2mm pixel pitch, respectively. Each figure contains 3 sub-figures corresponding to the three different target depths. The eye accommodation depth was always set to coincide with its corresponding target depth. The layout of each sub-figure is organized in the same way as Fig. 5(d). As demonstrated in Fig. 7(a), the small pixel pitch of 0.29mm can nearly correctly reconstruct the target light fields at the dioptric center of the display engine (1 diopter), but the targets away from the display layers (e.g. 0.8 and 1.2 diopters) are barely reconstructed with reduced retinal image contrast and noticeable artifacts. For instance, besides the low image contrast, although the period of the grating appears to The, the ratio between the black and white bars is not accurate. On the other hand, the configuration with a large pixel pitch of 2mm can hardly reconstruct the targets away from the layer. Even for the target very near to the layers (e.g. 1 diopter), the reconstructed light field shows more prominent artifacts. The large pixel pitch, however, yields sharper elemental view PSFs with suitable accommodation level, and higher contrast of retinal images as shown in Fig. 7(b). Clearly, trade-offs between pixel pitch and the elemental view PSF and the retinal image contrast of the reconstruction shall be considered.

Fig. 7. Simulated retinal images of display configurations of different pixel pitches for three reference targets with depths of 0.8,1, and 1.2 diopters, respectively, and a frequency of 5 cycles/degree (a) is for the display of 0.29mm pixel pitch and the accommodation distance of the eye model matches with the target depth in each sub-figure; and (b) is the perceived retinal images for the display of 2mm pixel pitch.

Download Full Size | PDF

As mentioned above, in a multilayer LF display engine, the attenuation layer separation is another important parameter. It affects the accuracy of the reconstructed light field and thus the contrast of the perceived retinal image. Based on the analytical model in Sec.2, we calculated and compared the quality of perceived retinal images in two different display configurations, one with a layer separation around 100mm and another with a layer separation as 533mm, corresponding. The first display configuration places the front layer at 1.05 diopters and back layer at 0.95 diopters with a 100mm separation in-between, which is similar to the 80mm separation of the multilayer LF-3D display prototype [16]. This configuration represents a direct-view type display such as desktop monitors where a small separation is preferred to keep the physical volume compact. The second configuration places the front layer at 1.25 diopters and back layer at 0.75 diopters with a 533mm layer separation. It is chosen to represent a magnified-view type display such as HMDs where a magnifying optics is placed between the layers and the eye. Viewed through the optics, a small separation of a few millimeters between the physical SLM layers is magnified into a large separation of hundreds of millimeters between the virtual attenuation layers observed by the eye. Both the display engines have its dioptric center at 1 diopter away from the viewing window. All the remaining parameters of the display engines and the eye model are the same as described in Section 3. The target was placed at three different depths, 0.8, 1, and 1.2 diopters, respectively. The eye accommodation depth is set to match with the corresponding target depth. Figures 8(a)-(b) demonstrate the center view of the reconstructed light field for the two display configurations with different layer separation as mentioned above. Each figure includes three sub-figures with three different target depths and the layout of each subfigure is also in the same way as Fig. 5(d). As shown in Fig. 8(a), the configuration with a smaller layer separation (e.g. 100mm) can correctly reconstruct the target light field located at the depth of 1 diopter, but can barely reconstruct the targets away from the layers (e.g. 0.8 and 1.2 diopters) with incorrect ratio between the black and white bars of the reconstruction and low retinal image contrast. On the contrary, as demonstrated in Fig. 8(b), the display configuration with a larger layer separation (e.g. 533mm) can nearly reconstruct the targets at all three depths correctly with some minor artifacts. For instance, the reconstruction of the target at the display dioptric center (1 diopter) shows the incorrect ratio between the black and white bars and its corresponding retinal image shows slightly reduced contrast compared with the center sub-figure of Fig. 8(a). Obviously, the reconstruction performance of the target in-between the display layers and away from the layers are highly dependent on the layer separation. Therefore, the layer separation should be carefully chosen for the required target light field.

Fig. 8. Simulated retinal images of display configurations of different layer separations for three reference targets with depths of 0.8,1, and 1.2 diopters, respectively, and a frequency of 5 cycles/degree: (a) is the reconstruction for display layers separated by 100mm and (b) is the reconstruction for display layers separated by 533mm. The dioptric center of the two display configurations is located at 1 diopter away from the viewing window

Download Full Size | PDF

5. Characterizing the effects of view dependency

Equation (1) demonstrates that the perceived retinal image is the integration of the modulated elemental view PSFs. As analyzed in Sec. 2, the perceived retinal image quality largely depends on not only the content of a target light field and the computational algorithms for rendering it but also the view positions within the viewing window and how sensitive the target light field to viewing positions. This section will focus on examining these effects in relation to the perceived retinal images of a reconstructed light field. The display configuration we used here for the simulations is the same as the one described in Sec. 3.

We began by investigating how the content of a target light field and its depth position affects the quality of reconstruction and how the perceived retinal image varies with the viewing position. In this simulation, the reference light fields are planar objects textured by Lambertian gratings of different spatial frequencies ranging from 1 up to 30 cycles per degree, respectively, and the targets are placed at two different depths, 0.8 diopters and 1 diopter, respectively. By examining the simulated retinal images, it is clear that the display engine with an angular resolution of 1 arc minutes or 30cpd and a layer separation of about 100mm is capable of correctly reconstructing high-frequency content (e.g. 10cpd) when the reconstruction plane is near the display layers, and is only capable of correctly reconstructing low-frequency content (e.g. 1cpd) when the reconstruction plane is further away from the display layers. Figures 9(a) through 9(c) show examples of the simulated images of the reconstructed targets with a spatial frequency of 1, 5 or 7.5 cpd located at 1 diopter from the viewing window (or at the midpoint of the display layers), while Figs. 9(d) through 9(f) show examples of the reconstructed targets with a spatial frequency of 1, 2 or 5 cpd located at 0.8 diopters from the viewing window (or about 200mm away from the back layer). Each figure further consists of four sub-figures, showing the perceived retinal images at four different viewing positions, view #7, #4, #2, and #0. As shown by the examples in Figs. 9(a) through (c), the targets of low frequencies (e.g. 1 and 5 cpd) were correctly reconstructed with good image contrast across all of the viewing positions, and targets of higher frequencies (e.g. 7.5cpd) were nearly correctly reconstructed with reduced contrast and slightly incorrect ratio between black and white bars. On the contrary, as demonstrated by the example in Figs. 9(d) through 9(f), through the same display engine, targets of 1cpd are correctly reconstructed across all of the viewing positions, targets of 2cpd are reconstructed with correct period but incorrect ratio between the black and white bars across the viewing positions, and the targets of 5cpd are barely reconstructed at any of the viewing positions.

Fig. 9. The perceived retinal image for the target at 0.8 diopters and 1 diopter with different eye positions. (a)-(c) show the reconstruction for target distance as 1 diopter and tested frequencies are 1,5,7.5 cpd, respectively. (d)-(f) show the reconstruction for target distance as 0.8 diopters and tested frequencies are 1,2,5 cpd, respectively.

Download Full Size | PDF

Figures 10(a) and 10(b) further demonstrate the numerical results of the simulation above by plotting the intensity profiles of the retinal image across the field for targets placed at 0.8 diopters away from the viewing window when viewed at the edge (View #7) and the center (View #0) of the viewing window, respectively. Each figure consists of four sub-figures for targets of 4 different frequencies, including 1, 3, 5, and 7.5 cpd, respectively, and each sub-figure plots the intensity profiles of the retinal image for the original reference light field and the reconstructed light field targets, respectively. It is clear that the intensity profiles of the reconstructed light field for both views are noticeably different from the original reference light field except the low-frequency targets of 1cpd. For targets with frequencies of 3cpd or higher, the intensity profiles of the reconstructed light fields not only yield reduced contrasts, but also noticeably shape deviation. For instance, the perceived retinal image profiles of the reference light field with the spatial frequency of 1cpd can still maintain the square-wave shape, but the profiles of the targets with a spatial frequency of 3 cpd become more like a sinusoidal signal or an irregular form or nearly no signal. Overall, the results shown in Fig. 10 further validates the observation that a multi-layer LF-3D engine offers limited capability of rendering high-resolution contents extending to a depth further away from the display layers.

Fig. 10. The perceived retinal image plot for the reconstruction distance of 0.8 diopters for a square wave reference light field with the spatial frequency of 1,3,5, and 7.5 cpd, respectively, for the (a) edge view (view7) and (b) center view (view 0).

Download Full Size | PDF

A worth-noting feature of a light field display is its potential capability of rendering objects with subtle view-dependence appearance when viewed from slightly different positions. The following example examines how well a multi-layer 3D display is able to render such visual effects. Instead of a texture with Lambertian luminous properties above, here the view dependence is modeled by modulating the local luminance amplitude of the grating texture of the reference objects through a Gaussian function based on the angle between the surface normal and the observation direction. The angular dependence of the Gaussian modulation function was varied through the standard deviation (σ) from 0.25 to 2 degrees. We repeated the same simulations as the ones above for the new targets. As already demonstrated earlier, the display engine is unable to correctly reconstruct objects located far from the display layers with frequencies higher than 1cpd. Therefore Figs. 11(a) and 11(b) only show the perceived retinal images for targets located at the midpoint of the display engine with the standard deviation of the modulation of 0.5 and 0.25 degrees, respectively. Figures 11(a) and (b) use the reference light fields with a spatial frequency of 2 and 7.5 cpd, respectively. Each figure contains four sub-figures for four different viewing positions, view #7, #5, #3, and #0. Besides the simulated retinal images, an additional plot was added to each sub-figure, showing the amplitude modulation projected on retina for the reference light field (in red solid line) and the perceived retinal image of the reconstructed target (in blue dash lines). It can be observed that the display engine can adequately reconstruct the amplitude modulation for the center view (view #0) where the peaks of the modulation overlap for the reference and the reconstructed. However, as the view position is shifted toward the edge of the viewing window, the observed peaks of the reconstructed light field are shifted toward the side. Furthermore, the ability to reconstruct the view-dependency effects is further compromised as the spatial frequencies of the target increase. For instance, the reconstructed amplitude modulation in Fig. 11(b) is severely reduced for targets of 7.5cpd.

Fig. 11. The perceived retinal image for the view-depended reference light field. (a) is reconstruction for modulation function with standard deviation as 0.5 degrees and target frequency as 2cpd. (b) is the reconstruction for modulation function with standard deviation as 0.25 degrees and target frequency as 7.5cpd.

Download Full Size | PDF

6. Characterizing the accommodative response

The perceived retinal image quality largely depends on the eye accommodation due to the fact that both images of the elemental views and their displacements vary with the state of eye accommodation. Since the retinal image blur is the sole true stimulus for eye accommodation, the retinal image blur rendered by the multilayer LF display may drive the eye to accommodate at the depth rather than the target depth to balance the blur from elemental views and displacements. To characterize the eye accommodative response to a multilayer LF display, the accommodate state of the eye model may be varied through the change of optical parameters of the schematic eye model to find the sharpest retinal image. The accommodative distance which corresponds to the maximum retinal image contrast in MTF and the maximum contrast gradient is considered as the actual accommodative response of the eye to a reconstructed light field. The difference between this actual accommodative response and the rendering depth is defined as the accommodation error.

For a multilayer LF display, there is no unique mapping between a pixel and an elemental view of 3D point of reconstruction. The luminance value of each elemental view is highly dependent on the rendered content. In order to evaluate the accommodative response of a multilayer LF display independent of the rendered content and the effects of rendering algorithm, we use the accumulated PSF and its MTF calculated by Eqs. (2) and (6) to predict the perceived retinal image performance and eye accommodative responses which mainly driven by the display configuration and ocular parameters. As we discussed in Sec. 2, the accumulated PSF and MTF formulation neglect the artifacts introduced by content rendering and assume a perfect reconstruction of a target light field. Therefore, different from an InI-based LF display system in [24,25], the study of eye accommodative responses based on the accumulated PSF and MTF for a multi-layer LF display can provide an estimate of eye accommodative response to a given multilayer LF system and the actual response to a rendered content may be deviated slightly due to inaccurate rendering.

To demonstrate the accommodative response, we simulated the retinal image contrast in MTF based on the same display engine and eye model configuration as the one described in Sec. 3. Figures 12(a)-(e) are the MTF plots of the perceived retinal image as a function of eye accommodation shift $\Delta z$ for five different target depths, 0.2, 0.6,1,1.4 and 1.8 diopters, respectively. Each figure includes the plot for the targets of four different spatial frequencies, 2,4,6, and 8 cycles per degree, respectively. In each figure, the red arrow marks the rendered target depths, while the black arrow indicates the actual accommodation distance yielding maximal image contrast. Across the depth range simulated from 0.2 to 1.8 diopters, it is clearly that the actual eye accommodative depth tends to be shifted away negatively toward far distance due to the parallel beam nature of each elemental view. The difference between the target distance and the accommodative response indicates the accommodation error, which is summarized in Fig. 12(f) for the targets of different depths. The red line shows the actual accommodative response for different target distance when viewing a natural 3D scene for comparison, and the fitted plots show the accommodative response to a multilayer light field display for targets of different frequencies. The slopes are similar to that of viewing a natural scene, but the intercepts are all shifted in the range of 0.2 diopters to 0.3 diopters.

Fig. 12. (a)-(e) are MTF plots of the perceived retinal images as a function of eye accommodation shift for five target distance at 0.2,0.6,1,1.4 and 1.8 diopters, respectively. The target frequencies are 2,4,6 and 8 cpd, respectively. (f) is the accommodation error plot as a function of target depths for targets of different frequencies in a multilayer LF display in comparison to a natural viewing.

Download Full Size | PDF

To further illustrate the accommodative response, Figs. 13(a) through (c) demonstrate the perceived retinal images for the central view for three different eye accommodation conditions, 0.8, 1, and 2 diopters, respectively, for a target located at the depth of 1 diopter with a frequency of 4cpd. The display configuration is the same as the one in Fig. 12, and the result is a visual illustration of the plot in Fig. 12(c). Among the three eye accommodation states, the depth of 0.8 diopters corresponds to the depth of maximum image contrast plot for 4cpd, 1 diopter corresponds to zero accommodation shift (or match with the rendered depth or red arrow), the depth of 2 diopters is the maximum accommodation shift. For the 0.8 diopters the perceived retinal image reaches its maximum contrast and the perceived retinal image for the 1.0 diopter has only slightly lower contrast and was visually similar to 0.8 diopter, which matches with the prediction in Fig. 12(c), while for the 2 diopters of accommodation distance the retinal image contrast dropped significantly.

Fig. 13. The perceived retinal image for the rendered object locates at 1 diopter (1000mm) while the eye accommodation distance is (a) 0.8 diopters, (b) 1 diopter and (c) 2 diopters, respectively.

Download Full Size | PDF

7. Conclusion

We described a generalized systematic analysis method for multilayer light field displays. This method is based on a generalized model of the multilayer light field display which can change the display parameters as required and be able to calculate the perceived image on the retina and to render the view dependency of a reference light field as well as predict the accommodative response in observation. In this model, we consider not only the display factorization and reconstruction for the rendering content but also the display factors and the ocular factors as well as the diffraction effect. We further explain the calculation method for the elemental view PSF and perceived retinal image and then investigate the influence of the pixel pitch and layer separation on the display quality. We also examined the effects of viewing dependency of a multi-layer display as well as how the content, depth and subtle view-dependence appearance of a target light field affects the quality of reconstruction. We finally demonstrate the characterization of the accommodative response based on the MTF of perceived retinal images in a multi-layer display and then investigate the accommodation error. In the future, we plan to develop the multilayer light field display system evaluation method based on this analysis method and the guidelines to optimize the design of the multilayer light field displays.

Disclosures

Dr. Hong Hua has a disclosed financial interest in Magic Leap Inc. The terms of this arrangement have been properly disclosed to The University of Arizona and reviewed by the Institutional Review Committee in accordance with its conflict of interest policies.

References

1. D. M. Hoffman, A. R. Girshick, K. Akeley, and M. S. Banks, “Vergence–accommodation conflicts hinder visual performance and cause visual fatigue,” J. Vis. 8(3), 33 (2008). [CrossRef]

2. G. Westheimer, “The maxwellian view,” Vision Res. 6(11-12), 669–682 (1966). [CrossRef]

3. R. Konrad, N. Padmanaban, K. Molner, E. A. Cooper, and G. Wetzstein, “Accommodation-invariant Computational Near-eye Displays,” ACM Trans. Graph. 36, 1–12 (2017). [CrossRef]

4. S. Shinichi, O. Katsuyuki, and K. Fumio, “Proposal for a 3-D display with accommodative compensation: 3DDAC,” J. Soc. Inf. Disp. 4(4), 255–261 (1996). [CrossRef]

5. S. Takashi, K. Takashi, O. Keiji, O. Masaki, M. Nobuyuki, Y. Yoshihiro, and I. Tsuneto, “Stereoscopic 3-D display with optical correction for the reduction of the discrepancy between accommodation and convergence,” J. Soc. Inf. Disp. 13(8), 665–671 (2005). [CrossRef]

6. S. Liu, D. Cheng, and H. Hua,"An optical see-through head mounted display with addressable focal planes,” 2008 7th IEEE/ACM International Symposium on Mixed and Augmented Reality, 33–42(2008).

7. J. P. Rolland, M. W. Krueger, and A. Goon, “Multifocal planes head-mounted displays,” Appl. Opt. 39(19), 3209–3215 (2000). [CrossRef]

8. S. Liu and H. Hua, “Time-multiplexed dual-focal plane head-mounted display with a liquid lens,” Opt. Lett. 34(11), 1642 (2009). [CrossRef]

9. S. Liu and H. Hua, “A systematic method for designing depth-fused multi-focal plane three-dimensional displays,” Opt. Express 18(11), 11562 (2010). [CrossRef]

10. Q. Gao, J. Liu, X. Duan, T. Zhao, X. Li, and P. Liu, “Compact see-through 3D head-mounted display based on wavefront modulation with holographic grating filter,” Opt. Express 25(7), 8412 (2017). [CrossRef]

11. A. Maimone, A. Georgiou, and J. S. Kollin, “Holographic Near-eye Displays for Virtual and Augmented Reality,” ACM Trans. Graph. 36(4), 1–16 (2017). [CrossRef]

12. H. Yeom, H. Kim, S. Kim, and J. Park, “Design of holographic Head Mounted Display using Holographic Optical Element,” in 2015 Conference on Lasers and Electro-Optics Pacific Rim, (Optical Society of America, 2015), paper 27P_105.

13. H. Hua and B. Javidi, “A 3D integral imaging optical see-through head-mounted display,” Opt. Express 22(11), 13484 (2014). [CrossRef]

14. A. Maimone and H. Fuchs, “Computational augmented reality eyeglasses,” 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Adelaide, SA, 2013, pp. 29–38. (IEEE, Oct 2013).

15. F. Huang, K. Chen, and G. Wetzstein, “The Light Field Stereoscope: Immersive Computer Graphics via Factored Near-eye Light Field Displays with Focus Cues,” ACM Trans. Graph. 34(4), 60–61 (2015). [CrossRef]

16. G. Wetzstein, D. Lanman, M. Hirsch, and R. Raskar, “Tensor Displays: Compressive Light Field Synthesis Using Multilayer Displays with Directional Backlighting,” ACM Trans. Graph. 31(4), 1–11 (2012). [CrossRef]

17. Y. Takaki and N. Nago, “Multi-projection of lenticular displays to construct a 256-view super multi-view display,” Opt. Express 18(9), 8824 (2010). [CrossRef]

18. D. Lanman and D. Luebke, “Near-eye Light Field Displays,” ACM Trans. Graph. 32(6), 1–10 (2013). [CrossRef]

19. A. Maimone, D. Lanman, K. Rathinavel, K. Keller, D. Luebke, and H. Fuchs, “Pinlight Displays: Wide Field of View Augmented Reality Eyeglasses Using Defocused Point Light Sources,” ACM Trans. Graph. 33(4), 1–11 (2014). [CrossRef]

20. H. Hua, “Advances in Head-Mounted Light-Field Displays for Virtual and Augmented Reality,” Inf. Disp. 32, 14–21 (2016). [CrossRef]

21. D. Lanman, G. Wetzstein, M. Hirsch, and R. Raskar, “Depth of Field Analysis for Multilayer Automultiscopic Displays,” J. Phys.: Conf. Ser. 415, 012036 (2013). [CrossRef]

22. G. Wetzstein, D. Lanman, W. Heidrich, and R. Raskar, “Layered 3D,” ACM Trans. Graph. 30(4), 1 (2011). [CrossRef]

23. J. Schwiegerling, Field guide to visual and ophthalmic optics (SPIE Press, 2004).

24. H. Huang and H. Hua, “Systematic characterization and optimization of 3D light field displays,” Opt. Express 25(16), 18508–18525 (2017). [CrossRef]

25. H. Huang and H. Hua, “Effects of ray position sampling on the visual responses of 3D light field displays,” Opt. Express 27(7), 9343–9360 (2019). [CrossRef]

26. H. Hua, “Enabling Focus Cues in Head-Mounted Displays,” Proc. IEEE 105(5), 805–824 (2017). [CrossRef]

27. A. Maimone and H. Fuchs, “Computational augmented reality eyeglasses,” 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)29–38 (2013).

$z_{1}$	$z_{2}$	$h$	$p$	$λ$
1.05 diopter (952mm)	0.95 diopter (1052mm)	100mm	0.29mm	R 611nm
				G 549nm
				B 464nm
FOV	$E_{b}$	EP size	Pupil sampling	View positions
$3^{\circ}$	10mm	3mm	31	0.5mm increment

Systematic method for modeling and characterizing multilayer light field displays

Abstract

1. Introduction

2. Methods

2.1 Modeling the retinal response of a multi-layer display

2.2 Rendering light fields

2.3 Modeling the perceived retina image of reconstructed light fields

3. Implementation process and the test setup

4. Characterizing the effects of display parameters

5. Characterizing the effects of view dependency

6. Characterizing the accommodative response

7. Conclusion

Disclosures

References

Cited By

Figures (13)

Tables (1)

Equations (10)

Optics Express