Optimizing multiview video plus depth retargeting technique for stereoscopic 3D displays

Feng Shao; Wenchong Lin; Randi Fu; Mei Yu; Gangyi Jiang

doi:10.1364/OE.25.012478

1. Introduction

Due to the creation of immersive realistic experiences with depth perception, great variety of developments in 3D display technologies have been occurred in the last years. However, even immersive realistic experiences can be provided by 3D displays, the main challenge to hinder its extensive applications is how to provide satisfactory Quality of Experience (QoE) for users [1]. Therefore, simply extensions of the existing 2D techniques (e.g., image retargeting) to 3D applications usually fail to deliver a satisfactory QoE for users, because the QoE models in 2D and 3D semantics are different [2]. Although some promising holographic, integral imaging and Super MultiView displays [3] are being developed to offer a high QoE, the essence of QoE optimization considered in this work is to eliminate intrinsic content limitations (e.g., accommodation-vergence conflict [4,5]) by taking advantage of new means.

With the diversity of various stereoscopic 3D (S3D) display devices, such as phones, table PCs and TVs, similar to 2D media, 3D media also requires content adaptation to different display devices [6]. In addition to adapting to the device spatial resolution (retargeting along the x and y directions), for S3D displays, depth adaptation is especially critical to adapt the comfort zone of different displays [7], otherwise, undesirable visual discomfort may be occurred. Particularly, the depth ranges should be adjustable on different S3D displays to avoid accommodation-vergence conflicts of the scene. Without a specific explanation, S3D displays in this work refer to binocular 3D displays. According to the accessibility of depth information, sources for S3D displays mainly originate from two aspects, i.e., fixed-viewpoint stereoscopic images and unfixed-viewpoint multiview video plus depth (MVD) data [8]. Currently, many works have been focused on designing effective coding methods for MVD data [9–12]. To manipulate the depth according to different sources, in the first aspect, depth adaptation is achieved by changing the disparity/depth ranges of stereoscopic images [13,14]. In the second aspect, depth map is adjusted according to specific display requirements, and is aided to generate stereoscopic images by depth-image-based rendering (DIBR) technique [15]. However, these methods still rest on the original image resolution to adjust the depth information, but if the display device is changed, it is desirable to control the resolution and depth simultaneously for S3D displays. Since existing depth adaptation methods cannot retarget the size of stereoscopic image, the advantage of this work is to generate resolution and depth adjustable stereoscopic image pairs from MVD data by retargeting.

This work focuses on optimizing the resolution and depth of MVD data to adapt to the comfort zone of the display collecting camera parameters of virtual camera for view synthesis, which are not presented in fixed-viewpoint stereoscopic image system. This paper provides new insights on how the selected virtual camera affects the retargeting process and further affects the 3D perception of view synthesis. A novel method, the Multiview Video plus Depth ReTargeting (MVDRT), aiming at providing a solution to optimize the factors, such as image resolution, depth range, and the perceived visual comfort. Previous methods only consider fixed-viewpoint depth remapping or content retargeting, while the MVDRT utilizes the parameters of virtual camera and viewing conditions to optimize the 3D perception of view synthesis. This optimization provides a very valuable tool to guide the retargeting for MVD data. The major advantages of the MVDRT are as follows:

1) The optimization is performed in display space so that the viewing conditions (including display device and viewing distance) are propagated to the warping process. Thus, it aggregates the contributions of different parameters that influence the 3D perception of view synthesis. Also, the MVDRT takes camera parameters of virtual camera into account.
2) Besides considering the influence of various factors on view synthesis, the perceived visual comfort of the virtual view (influenced by depth range and consistent deformation of the reference views) are particularly designed to address the viewing experience in a more natural way. With these adjustable parameters, the MVDRT has the flexibility for S3D displays.
3) We simultaneously adjust the horizontal, vertical and depth coordinates in a uniform optimization framework. To obtain the retargeted depth map, we use a five-parameter fitting function to remap the original depth range to the target one. It has been validated that the depth remapping step has better performance in view synthesis than previous approaches.

The rest of the paper is organized as follows: we first review the related work in Section 2, detail our method in Section 3, and present results in Section 4 and discussion in Section 5.

2. Related work

3D media retargeting. The problem of 2D image/video retargeting has received much attention over the recent years, followed the classification of discrete and continuous methods [16–18]. Towards retargeting for 3D media, some approaches have been proposed. Seam carving technique is to determine the seams for one view of the stereoscopic image, and find the corresponding seams in another view based on the correspondence constraint as well as the spatial coherence [19–21]. However, seam caving for stereoscopic image will cause shape distortion in or between left and right images. Also, due to depth preservation essence of the seam carving methods, their depth adaptation on different display devices is comparatively poor.

For continuous methods, warping-based retargeting is performed to solve mesh deformation problem with several deformation constraints. The key of this type of methods is to uniformly distribute more distortion on those less important regions and less distortion on more important regions. Chang et al. [22] presented a content-aware method for editing stereoscopic image to different displays, which optimizes energy function based on sparse stereo correspondences and disparity consistency constraints. Lin et al. [23] proposed an object-coherence warping method for content-aware stereoscopic image retargeting, in which the object correspondences between the left and right images are utilized to generate an object-based significance map and preserve object consistency during warping. Li et al. [24] presented a warping-based stereo retargeting framework by imposing both shape preservation and depth preservation constraints, in which region-wise depth-preserving constraints are derived to control the local warping functions for depth preservation. Lee et al. [25] proposed a layer-based stereoscopic image resizing method, in which each layer is warped by its own mesh deformation and the warped layers are composited together to form the retargeted images. However, although these methods can handle MVD data, it is limited to altering the perceived depth without the constraint from user’s visual experience.

Depth adaptation. Another way for depth adaptation is to retarget depth/disparity range, or control depth perception or visual comfort for the 3D content. Disparity shifting methods are relatively simple that adjust the zero disparity plane (ZDP) of a scene but maintain its overall disparity range [26]. Various linear and nonlinear disparity/depth remapping methods are developed to adjust the comfortable viewing zone [27], or adjust the depth perception [28], or manipulate the perceived depth [29], with the aims to enhance the 3D visual experience. However, these methods do not take into account the distortion from display resolution, and their depth adaptation to this situation is poor.

To optimize the disparity/depth by warping, Lang et al. [13] presented a set of disparity mapping operators for stereoscopic images that uses disparity and saliency estimation to compute a deformation of the input views to meet the target disparities. Yan et al. [14] used image warping to simultaneously adjust the positions of image pixels in the left and right images in order to remap the depth range of stereoscopic image. Wang et al. [30] proposed stereoscopic video disparity adjustment framework that takes disparity range, motion, and stereoscopic window violation into account. However, these methods only focus on disparity adjustment for stereoscopic contents, and develop several disparity mapping operations with consistent size constraint. If considering the distortions from different spatial resolutions and different camera geometries, these methods may fail in these situations.

Although multiple virtual views can be conveniently generated via DIBR technique, depth adaptation for view synthesis is considered in some methods. Huang et al. [31] presented a warping-based method for synthesizing virtual views from a binocular stereoscopic image based on dense feature correspondences instead of using DIBR. Li et al. [32] investigated the influences of stereoscopic parameters on multiview synthesis in S3D display and presented a parameter adjustment method for reducing visual discomforts and geometric distortions. Lei et al. [33] proposed a projection-based disparity control method to generate multiple virtual views by shifting the ZDP. However, in the practical applications, it is expected to view the virtual views naturally and comfortably under different viewing conditions. Thus, the virtual camera parameters should be taken into account in synthesizing the multiple virtual views.

In order to illustrate the limitations of the existing 3D retargeting and depth adaptation methods, we list the main properties of some representative works in Table 1. Considering resolution and depth adaptation to generate virtual views from MVD data in this work, we combine image retargeting and depth adaptation techniques to present a possible solution. The motivation of this work mainly comes from two aspects:

Table 1. Main properties of the representative works

View Table | View all tables in this article

1) Although we can separately retarget color and depth videos, depth is not a simple spatial dimension to be viewed, but inversely to generate the virtual views. Therefore, the resolution and depth should be optimized in a uniform framework.
2) The factors considered in MVD retargeting will be not the same with those in stereoscopic image retargeting, because the factors affected the 3D perception of view synthesis should be addressed to present the viewing experience in a more natural way.

3. The proposed method

3.1 Framework overview

We propose a retargeting framework that aims to generate resolution and depth adjustable stereoscopic images from MVD data. As shown in Fig. 1, taking two-view MVD data as input (denoted as L-Reference and R-Reference respectively), we first transform the input MVD data to the display space based on the target displays and viewing conditions. Then, warping energy function is constructed based on the constraints of shape preservation, line bending, and visual comfort. The energy function is used to optimize the mesh warping procedure to generate the retargeted reference images and the corresponding retargeted depth maps. Finally, we generate the virtual view to construct the stereoscopic images by DIBR. Different with the existing stereoscopic image retargeting methods that preserve the original depth information after retargeting, our method simultaneously optimizes the resolution and depth to match the target displays and viewing conditions.

Fig. 1 Illustration of our proposed MVD retargeting framework.

Download Full Size | PDF

3.2 Relationship between the depth image and the display space

Figure 2 illustrates cameras geometry in different spaces. The scene is captured by a reference camera in 3D space shown in (a), while the viewer perceives the 3D scene in the display space shown in (b). The baseline between reference and virtual cameras is not equal to the interocular distance between two eyes, and the depth of the convergence plane is not equal to the viewing distance. Considering the effects of different displays and viewing conditions, the camera arrangements used for the creation of stereoscopic contents are not unique. Therefore, transformation from the 3D space to display space is the first step to fulfill the following optimization. Given the L-Reference data ( $I_{L} (x, y)$ and $D_{L} (x, y)$ ), the real depth z_x_,_y of a pixel point in $I_{L} (x, y)$ can be calculated from the quantized depth value D_x_,_y in $D_{L} (x, y)$ by

z_{x, y} = \frac{1}{\frac{D_{x, y}}{255} (\frac{1}{z_{n e a r}} - \frac{1}{z_{f a r}}) + \frac{1}{z_{f a r}}}

where z_near and z_far are the farthest and nearest depth values.

Fig. 2 Illustration of camera geometry in different spaces.

Download Full Size | PDF

Considering the relationship between the camera coordinate and 3D world coordinate, the imaging geometry of the reference camera is formulated by projecting a matrix $[u_{L}, v_{L}, 1]$ in the camera coordinate to the matrix $[X, Y, Z]$ in the world coordinate as follows:

z_{x, y} \cdot [\begin{matrix} u_{L} \\ v_{L} \\ 1 \end{matrix}] = [\begin{matrix} f_{u} & 0 & u_{0} \\ 0 & f_{v} & v_{0} \\ 0 & 0 & 1 \end{matrix}] \cdot [\begin{matrix} r_{11} & r_{12} & r_{13} & t_{x} \\ r_{21} & r_{22} & r_{23} & t_{y} \\ r_{31} & r_{32} & r_{33} & t_{z} \end{matrix}] \cdot [\begin{matrix} X \\ Y \\ Z \end{matrix}]

\begin{array}{l} [\begin{matrix} X \\ Y \\ Z \end{matrix}] = [\begin{matrix} r_{11} & r_{12} & r_{13} \\ r_{21} & r_{22} & r_{23} \\ r_{31} & r_{32} & r_{33} \end{matrix}] \cdot {[\begin{matrix} f_{u} & 0 & u_{0} \\ 0 & f_{v} & v_{0} \\ 0 & 0 & 1 \end{matrix}]}^{- 1} \cdot [\begin{matrix} u_{L} \\ v_{L} \\ 1 \end{matrix}] \cdot z_{x, y} + [\begin{matrix} t_{x} \\ t_{y} \\ t_{z} \end{matrix}] \\ = R_{1} A_{1}^{- 1} [\begin{matrix} u_{L} \\ v_{L} \\ 1 \end{matrix}] \cdot z_{x, y} + T_{1} \end{array}

where f_u and f_v are the focal lengths along the x and y axes, and (u₀,v₀) is the coordinate of the principal point, A₁ (3 × 3) and R₁ (3 × 3) are the intrinsic and rotation matrices of the reference camera, and T₁ (3 × 1) is the translation vector of the reference camera. Taking the virtual camera into consideration, the same world coordinate

[X, Y, Z]

is projected into the virtual image plane via

[\begin{matrix} u' \\ v' \\ w' \end{matrix}] = A_{2} R_{2}^{- 1} \cdot [\begin{matrix} X \\ Y \\ Z \end{matrix}] - A_{2} R_{2}^{- 1} T_{2}

where

(u', v', w')

is the homogeneous coordinates of the virtual image plane, and A₂, R₂ and T₂ are the intrinsic matrix, rotation matrix and translation vector of the virtual camera, respectively. The corresponding pixel location in the virtual image plane is

(u_{v}, v_{v}) = (u' / w', v' / w')

. The matrices A₁, R₁ and T₁, as well as A₂, R₂, T₂, are known for specific cameras. Since

(u_{L}, v_{L})

and

(u_{V}, v_{V})

are the same points in both cameras, the disparity d between the reference and virtual cameras is

d = u_{V} - u_{L}

From (5), it can be seen that the binocular disparity d incorporates the effects of camera distance (T₁ and T₂), focus length (A₁ and A₂) and camera rotation (R₁ and R₂), as well as the distance of objects to the camera (z). These parameters are common for a given camera arrangement. However, as demonstrated in Figs. 2(a) and 2(b), scene geometry from different camera sensors and screens will be not the same. Therefore, the distortions in image plane may be not the same with the perceived distortions on displays. Since the actual size of a real scene is generally too large or too small to be reproduced on different displays, the scaling on all dimensions besides horizontal and vertical dimensions is necessary. Taking the left and virtual images as a stereoscopic image pair, based on the principal of binocular vision, the point coordinates perceived by the viewer in the display space can be determined as follows [32]:

{\begin{matrix} X_{D} = \frac{d_{e} \cdot (u_{L} + \frac{d_{e}}{2})}{d_{e} - d \cdot (W / R)} - \frac{d_{e}}{2} \\ Y_{D} = \frac{d_{e} \cdot v_{L}}{d_{e} - d \cdot (W / R)} \\ Z_{D} = \frac{d_{e} \cdot L_{D}}{d_{e} - d \cdot (W / R)} \end{matrix}

where L_D be the viewing distance from the viewer to the screen, d_e be the interocular distance between the viewer’s two eyes, W and R denote the width and horizontal resolution of display. By optimizing the horizontal, vertical and depth coordinates in the display space, the factors of displays (W and R) and viewing conditions (L_D) are integrated to promote user’s experience.

3.3 Construction of warping energy function

For resolution and depth adaptation, warping energy optimization is a feasible way to describe the relationship among camera arrangement, display, and stereoscopic viewing conditions. These factors are usually utilized in the energy functions to characterize the overall QoE perceived by the viewers when watching 3D content. In the practical situations, the QoE will be affected by different viewing conditions. Given the inputs $I_{L} (x, y)$ and $D_{L} (x, y)$ , a set of 2D grid meshes with size of 40 × 40 in the reference image plane are denoted as $U_{L} = {U_{L, k}}$ , $U_{L, k} = {u_{L, k}^{1}, u_{L, k}^{2}, u_{L, k}^{3}, u_{L, k}^{4}}$ . For each vertex (taking $u_{L, k}^{1}$ as example), it has $u_{L, k}^{1} = {x_{L, k}^{1}, y_{L, k}^{1}}$ , where $x_{L, k}^{1}$ and $y_{L, k}^{1}$ denote the horizontal and vertical coordinates in the reference image plane, respectively. By transforming the coordinates to the display space by (6), a set of 3D surface meshes $V = {V_{k}}$ are created to cover the reference and virtual cameras. For each 3D surface mesh V_k, it consists of four vertices, $V_{k} = {v_{k}^{1}, v_{k}^{2}, v_{k}^{3}, v_{k}^{4}}$ , where $v_{k}^{1}$ , $v_{k}^{2}$ , $v_{k}^{3}$ and $v_{k}^{4}$ denote the left-top, left-bottom, right-top, and right-bottom vertices, respectively. The coordinate of each vertex (taking $v_{k}^{1}$ as example) is represented as $v_{k}^{1} = {X_{k}^{1}, Y_{k}^{1}, Z_{k}^{1}}$ , where $X_{k}^{1}$ , $Y_{k}^{1}$ and $Z_{k}^{1}$ denote the horizontal, vertical and depth coordinates in the display space, respectively.

1) Shape preservation energy: Since our method uses four vertices of a mesh to control the selected object, the shape and depth of the selected object may be deformed while implementing warping. This problem is solved by introducing a shape preservation energy term that preserves the shapes of the high-significance objects as much as possible, while allows to conduct large deformation for those low-significance objects. For each mesh V_k with four vertices, the distortion energy for the mesh is defined as

E_{S P} (V_{k}) = {‖ ρ_{k} V_{k} - {\tilde{V}}_{k} ‖}^{2}

where

ρ_{k}

is a scale parameter to be optimized. By a linear least-squares solution

A P = b

, we obtain the optimal

ρ_{k}

as

ρ_{k} = {(A_{k}^{T} A_{k})}^{- 1} A_{k}^{T} {\tilde{b}}_{k}

where

A_{k} = [\begin{matrix} X_{k}^{1} & - Y_{k}^{1} & 0 & 1 & 0 & 0 \\ Y_{k}^{1} & X_{k}^{1} & 0 & 0 & 1 & 0 \\ 0 & 0 & Z_{k}^{1} & 0 & 0 & 1 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ X_{k}^{4} & - Y_{k}^{4} & 0 & 1 & 0 & 0 \\ Y_{k}^{4} & X_{k}^{4} & 0 & 0 & 1 & 0 \\ 0 & 0 & Z_{k}^{4} & 0 & 0 & 1 \end{matrix}], {\tilde{b}}_{k} = [\begin{array}{l} {\tilde{X}}_{k}^{1} \\ {\tilde{Y}}_{k}^{1} \\ {\tilde{Z}}_{k}^{1} \\ ⋮ \\ {\tilde{X}}_{k}^{4} \\ {\tilde{Y}}_{k}^{4} \\ {\tilde{Z}}_{k}^{4} \end{array}]

Considering visual importance of each mesh, the total shape preservation energy is the weighted sum of the distortions of all meshes, defined as

\begin{array}{l} E_{S D} = \sum_{V_{k} \in V} S (k) \cdot {‖ A_{k} {(A_{k}^{T} A_{k})}^{- 1} A_{k}^{T} {\tilde{b}}_{k} - {\tilde{b}}_{k} ‖}^{2} \\ = \sum_{V_{k} \in V} S (k) \cdot {‖ C_{k} {\tilde{b}}_{k} ‖}^{2} \end{array}

where

C_{k} = A_{k} {(A_{k}^{T} A_{k})}^{- 1} A_{k}^{T} - I

and I is the identity matrix. To acquire the visual importance of each mesh, 3D saliency and depth sensitivity of each vertex are combined by:

S_{i} (k) = ζ_{i} (k) + α \cdot ψ_{i} (k)

where α is weigh for depth sensitivity. Here, the 3D saliency value

ζ_{i} (k)

and depth sensitivity value

ψ_{i} (k)

are computed based on our previous works [34,35]. The importance of the depth sensitivity included in the significance map is to propagate the influence of depth deformation on view synthesis to the warping process, so that the geometry changes induced by the depth deformation can be relieved in the virtual views. The significance value of a mesh is calculated by averaging the significance values of all pixels within the mesh. Figure 3 shows an example of the 3D saliency map and the depth sensitivity map.

Fig. 3 Example of the 3D saliency map and the depth sensitivity map.

Download Full Size | PDF

2) Line bending energy: Refer to [22], to minimize the bending of the mesh edges, we define a line bending energy by measuring the angle between the deformed edge and its original edge. Let $e = v_{i} - v_{j}$ be original edge between two vertices v_i and v_j, and $\hat{e} = {\hat{v}}_{i} - {\hat{v}}_{j}$ be the deformed edge between two vertices ${\hat{v}}_{i}$ and ${\hat{v}}_{j}$ , the angle between e and $\hat{e}$ is approximated by

Δ (\hat{e}) = {‖ s_{e} e - \hat{e} ‖}^{2}

where s_e is a scale parameter. Refer to the solution in (8), minimization of (12) yields a linear least-squares solution as

Δ (\hat{e}) = {‖ e {(e^{T} e)}^{- 1} e^{T} \hat{e} - \hat{e} ‖}^{2}

Finally, the total line bending energy is defined as

E_{L B} = \sum_{v_{i}, v_{j} \in V} {‖ (e {(e^{T} e)}^{- 1} e^{T} - I) \cdot \hat{e} ‖}^{2}

3) Visual comfort energy: According to our warping framework, the depth of each mesh will be changed together with the resolution. To ensure the naturalness and visual comfort of the retargeted color-plus-depth data, we propose to adjust the depth range to adapt the new camera geometry. Considering different displays and viewing distances, the original depth range in the 3D space is mapped to the display space, obtaining a new depth range $[Z_{D}^{f a r}, Z_{D}^{n e a r}]$ . To ensure the depth range $[Z_{D}^{f a r}, Z_{D}^{n e a r}]$ fall into the comfort zone, we define a comfortable depth range based on the Percival’s Zone of Comfort (PZC) as follows:

Z_{\max} = \frac{d_{e} L_{D}}{d_{e} - η_{1} L_{D}}, Z_{\min} = \frac{d_{e} L_{D}}{d_{e} - η_{2} L_{D}}

where η₁ and η₂ denote the negative and positive retinal disparity limits (e.g., ± 1° disparity for the PZC).

The scaling factor is defined to scale the original depth range to a new comfortable depth range by:

K = \frac{Z_{\max} - Z_{\min}}{Z_{D}^{f a r} - Z_{D}^{n e a r}}

The depth scaling function is defined to remap the original depth Z_D to a new depth by

f (Z_{D}) = K (Z_{D} - Z_{D}^{n e a r}) + Z_{\min}

Since objects at the screen yield null disparity (d = 0), while objects in front of and behind the screen produce negative and positive disparity, respectively, it leaves large space for the objects at the screen to adjust their depth information. Also, since the depth scaling function in (17) linearly remaps the depth values, the scene layout will be still the same with the original one. Considering the impacts of different disparity planes on visual comfort, the visual comfort energy is defined as

E_{V C} = \sum_{V_{k} \in V} \sum_{i = 1}^{4} ω (Z_{k}^{i}) \cdot {‖ f (Z_{k}^{i}) - {\tilde{Z}}_{k}^{i} ‖}^{2}

ω (Z_{k}^{i}) = e x p (\frac{| Z_{k}^{i} - L_{D} |}{Z_{D}^{f a r} - Z_{D}^{n e a r}})

3.4 Optimization for virtual color-and-depth generation

By combining the above shape preservation, line bending and visual comfort energy terms, the final optimization for the mesh warping is formulated as

\underset{\tilde{V}}{\arg \min} (E_{S P} + E_{V C} + E_{L B})

Minimizing the energy function corresponds to solving a least-squares linear system $A {\tilde{V}}^{T} = b$ and finds a set of deformed meshes $\tilde{V}$ , which is satisfied the boundary conditions. However, in a typical DIBR system, typically two views of color and depth videos are used to synthesize a virtual view. That is, the quality of the virtual view will be affected by the detailed view blending in DIBR, that combines the synthesized virtual views from different reference views. To ensure the consistent deformation across different reference views, by re-projecting the deformed vertices $\tilde{V}$ to the left and right reference image planes, the deformed vertices ${\tilde{U}}_{L}$ and ${\tilde{U}}_{R}$ in the left and right reference images are obtained. Further, by establishing the warping relationships between the original vertices $U_{L}$ and the deformed vertices ${\tilde{U}}_{L}$ , and between the original vertices $U_{R}$ and the deformed vertices ${\tilde{U}}_{R}$ , we can warp the left and right images to the target resolution, denoted as ${\tilde{I}}_{L} (x', y) = I_{L} (x, y)$ and ${\tilde{I}}_{R} (x'', y) = I_{R} (x, y)$ , respectively.

Since the depth coordinate of each vertex in changed after the above mesh optimization, we cannot directly use the same warping relationship on the depth maps. To obtain the retargeted depth map, the retargeted depth values are inversely transformed from the display space to 3D space. Then, we establish a five-parameter fitting function by minimizing the reconstruction error between the original and retargeted depth values for all vertices as [36]

f (D) = β_{1} \cdot (\frac{1}{2} - \frac{1}{1 + \exp (β_{2} \cdot (D - β_{3}))}) + β_{4} \cdot D + β_{5}

where β₁, β₂, β₃, β₄ and β₅ are parameters determined by using the original and retargeted depth values. With the fitting function and the same warping relationship between the original and deformed meshes, we can obtain the retargeted depth maps as

{\tilde{D}}_{L} (x', y) = f (D_{L} (x, y))

and

{\tilde{D}}_{R} (x'', y) = f (D_{R} (x, y))

.

Finally, with the retargeted ${\tilde{I}}_{L} (x', y)$ and ${\tilde{D}}_{L} (x', y)$ , ${\tilde{I}}_{R} (x'', y)$ and ${\tilde{D}}_{R} (x'', y)$ , the virtual view is synthesized by DIBR technique. Different with the general procedure that synthesizes the virtual view first and retargets it to target resolution, the advantage of the framework is that the viewing condition and the location of virtual camera are propagated to the warping process, so that the virtual camera parameters are considered and stereoscopic vision is addressed in a more natural way. Especially, in multiview displays, multiple virtual views should be synthesized simultaneously from the reference cameras. In this situation, the same warping relationship derived from the two-view case can be propagated to other virtual views to add the efficiency of multi-view view synthesis. Since virtual views are synthesized from the adjacent two views (i.e., L-Reference and R-Reference in Fig. 1), we only consider two-view MVD format in this work, and it can be easily extended to multiple views. For example, for three-view MVD format, the virtual view can be synthesized from two adjacent views.

4. Experimental results

In the experiment, seven typical 3D video sequences, ‘Balloons’ (1024 × 768), ‘Kendo’ (1024 × 768), ‘Bookarrival’ (1024 × 768), ‘Newspaper’ (1024 × 768), ‘GT_Fly’ (1920 × 1088), ‘Undo_Dancer’ (1920 × 1088) and ‘Poznan_Street’ (1920 × 1088) are tested. The selections of left, right and virtual views for these sequences are shown in Table 2. For the view synthesis, view synthesis reference software (VSRS) 1D fast [37] is employed. Although different parameter setting will have certain influences on the retargeting results, based on the practical subjective tests, we set the interocular distance d_e = 65mm, the width of display W = 750mm, the horizontal resolution R = 1920 and the viewing distance L_D = 800mm.

Table 2. Selected views for experiment.

View Table | View all tables in this article

4.1 Analysis the influence of depth remapping

As an important step of the proposed method, the final retargeted depth maps are obtained by remapping the original depth values to its retargeted values. Figure 4 shows the examples of the retargeted depth maps and the corresponding synthesized virtual images with and without depth remapping step (for better display of the depth maps, the depth values are mapped to [0, 255]). The retargeted depth maps with depth remapping step have distinctly different appearances with those of the retargeted depth maps without depth remapping step, because the depth values are simultaneously optimized in our warping framework, while the retargeted depth maps without depth remapping step omit this property (still preserve the original depth values). Using the depth maps without depth remapping step, the synthesized images have significant ghost in the boundary area, while the proposed method can eliminate the phenomenon. Also, duo to the influence of incorrect depth maps in the existing 3D video sequences, which are estimated by depth estimation software, the synthesized images will introduce a certain geometric change especially at the edge areas, while optimizing the depth in the display space can somewhat reduce the influence. More results will be presented in the next sections.

Fig. 4 Examples of the retargeted depth maps and the corresponding synthesized images with and without depth remapping step.

Download Full Size | PDF

4.2 Comparison to other methods

Since the proposed method is very related to the stereoscopic image retargeting in the two-view case, we compare our method with state-of-the-art stereoscopic image retargeting methods, including linear scaling (SCL), stereoscopic seam carving (SSC) [20], and single-layer warping (SLW) [22]. For the SCL method, we independently apply the same scaling factor on each reference view. For the SSC and SLW methods, the left reference image and right reference image are taken as a stereoscopic image pair. The same retargeting relationships are directly performed on the depth maps for these methods. Figure 5, Fig. 6, and Fig. 7 illustrate the retargeting results on ‘Poznan_Street’, ‘Undo_Dancer’ and ‘Newspaper’ test sequences. In these figures, all images are shrunk by 40%. The first row in the figures shows the retargeted left reference images. The second row shows the retargeted left depth maps and the third row show the corresponding histograms of the depth maps. The synthesized images from the left and right reference views are shown in the fourth row. As shown in the retargeted left reference images, the SCL method may distort images due to the lack of object information, and the SSC method will suffer from more or less object shape distortion due to the essence of discrete seam removal. Although it is effective in shape preservation, the SLW method will lost the boundary contents, e.g., the car in Fig. 5 and girl in Fig. 7. As shown in the histograms of the depth maps, due to the nature of depth preservation, the SCL and SLW methods have the similar depth distributions with the original ones, while our method changes the depth distributions due to depth remapping. Using the above retargeted reference images and the retargeted depth maps to synthesize the virtual images, the synthesized images with the SCL and SLW methods have significant ghost in the boundary area. The reasons may be that, on one hand, the inconsistent deformation on the left and right reference views with the methods may change the inter-view geometry, while image retargeting in fact does not consider this factor. On the other hand, due to depth preservation essence, a reference point will be projected to the original instead of the retargeted location in the virtual view, while the matched reference points in the left and right reference views will be changed after retargeting. Even the ghost effect is not obvious for the SSC method, the synthesized images still suffer from significant shape deformation due to the poor quality of the retargeted reference images. In contrast, the proposed method preserves most of the structure information without the ghost effect in the synthesized images.

Fig. 5 Comparison between our method and the SCL, SSC and SLW on ‘Poznan_Street’ test sequence.

Download Full Size | PDF

Fig. 6 Comparison between our method and the SCL, SSC and SLW on ‘Undo_Dancer’ test sequence.

Download Full Size | PDF

Fig. 7 Comparison between our method and the SCL, SSC and SLW on ‘Newspaper’ test sequence.

Download Full Size | PDF

4. 3 Subjective evaluation results

A subjective assessment test is conducted to validate the proposed MVDRT model and investigate the influence of the above-mentioned factors (e.g., shape preservation energy, line bending energy and visual comfort energy) on the perceived QoE. The original motivation of this study is to optimize the MVD retargeting for multiview displays with multiple virtual views. The difference between the stereoscopic and multiview displays is that, stereoscopic displays delivers the same pair of information across the entire view zone, whereas multiview displays treats the same space as discrete pieces and provides a different view of the virtual world. However, due to the lack of multiview displays, this test is carried out on a stereoscopic display with a given stereoscopic image pair. The main characteristics of the test are detailed as follows:

Methodology: The Absolute Category Rating (ACR) [38] methodology was used to evaluate the test materials. During the test, the observers were asked to provide a score for the overall 3D QoE, consid ering shape preservation, depth perception, visual comfort, etc. They used the five-grade quality scale. After detecting and discarding outliers of all opinion scores, the final mean opinion score (MOS) for each stereoscopic image was calculated as the mean of the remaining opinion scores.

Environment: The subjective tests were conducted in the laboratory designed for subjective tests according to the recommendation BT.500-11 [39]. A 65” Samsung UA65F9000 stereoscopic display war used to carry out the subjective tests. It is an Ultra HD 3D-LED display with active shutter glasses. The viewing distance was 3 times the height of the screen.

Observers: A total of 25 undergraduate students (12 males and 13 females) in Ningbo University were participated in the subjective test, all of them having normal vision. The ages of the participants were between range 20 and 25. The participants were asked to rank the stereoscopic images based on their own judgment.

The user study results on seven MVD sequences are reported in Table 3. To construct a stereoscopic image pair, the reference left image and the synthesized virtual image are used. Considering the factors of shape preservation, depth perception and visual comfort in evaluating the QoE, we evaluate the user’s visual experience in viewing a stereoscopic image pair. For the SCL and SLW methods, due to the effect of ghost, visual discomfort is obvious in viewing the stereoscopic image pairs, leading to lower opinion scores. For the SSC method, shape deformation is particularly serious for ‘Balloons’, ‘Newspaper’ and ‘Undo_Dancer’ test sequences (also can be found in Figs. 5-7), leading to very poor visual experience, while for some sequences, e.g., ‘Balloons’ and ‘GT_Fly’, the object shape preservation performances are better than our method. Overall, our approach can achieve a better tradeoff among shape preservation, depth perception and visual comfort factors to provide a better visual experience for users.

Table 3. Quantitative subjective assessment results of different MVD sequences.

View Table | View all tables in this article

5. Conclusions

We have presented a retargeting framework for MVD data to optimize the visual experience for S3D displays. The most important technical innovation of our framework is that the parameters of virtual camera and viewing conditions are utilized to optimize the resolution and depth of MVD data for view synthesis. Our results demonstrate the effectiveness and power of our approach. Currently, we generate fixed-viewpoint S3D output from MVD data. For dense virtual view generation, other constraint conditions, such as displaying parameters, should be integrated into our MVDRT framework.

Funding

National Natural Science Foundation of China (NSFC) (grant 61622109, 61271021, 61471212, and U1301257); K.C. Wong Magna Fund in Ningbo University.

References and links

1. S. Baek and C. Lee, “Depth perception estimation of various stereoscopic displays,” Opt. Express 24(21), 23618–23634 (2016). [CrossRef] [PubMed]

2. A. K. Moorthy and A. C. Bovik, “A survey on 3D quality of experience and 3D quality assessment,” Proc. SPIE 8651, 86510M (2013). [CrossRef]

3. H. Urey, K. V. Chellappan, E. Erden, and P. Surman, “State of the art in stereoscopic and autostereoscopic displays,” Proc. IEEE 99(4), 540–555 (2011). [CrossRef]

4. D. M. Hoffman, A. R. Girshick, K. Akeley, and M. S. Banks, “Vergence-accommodation conflicts hinder visual performance and cause visual fatigue,” J. Vis. 8(3), 33 (2008). [CrossRef] [PubMed]

5. M. Lambooij, M. Fortuin, H. Heynderickx, and W. IJsselsteijn, “Visual discomfort and visual fatigue of stereoscopic displays: A review,” J. Imaging Sci. Technol. 53(3), 030201 (2009). [CrossRef]

6. W. J. Kim, S. D. Kim, J. Kim, and N. Hur, “Resizing of stereoscopic images for display adaptation,” Proc. SPIE 7237, 72371S (2009). [CrossRef]

7. M. Urvoy, M. Barkowsky, and P. Le Callet, “How visual fatigue and discomfort impact 3D-TV quality of experience: A comprehensive review of technological, psychophysical, and psychological factors,” Ann. Telecommun. 68(11–12), 641–655 (2013). [CrossRef]

8. P. Merkle, A. Smolic, K. Muller, and T. Wiegand, “Multi-view video plus depth representation and coding,” in Proc. IEEE International Conference on Image Processing (2007), pp. 201–204. [CrossRef]

9. Y. Liu, Q. Dai, Z. You, and W. Xu, “Rate-prediction structure complexity analysis for multi-view video coding using hybrid genetic algorithms,” Proc. SPIE 6508, 650804 (2007). [CrossRef]

10. A. De Abreu, P. Frossard, and F. Pereira, “Optimized MVC prediction structures for interactive multiview video streaming,” IEEE Signal Process. Lett. 20(6), 603–606 (2013). [CrossRef]

11. A. De Abreu, P. Frossard, and F. Pereira, “Fast MVC prediction structure selection for interactive multiview video streaming,” in Proc. Picture Coding Symp. (2013), pp. 169–172. [CrossRef]

12. A. Fiandrotti, J. Chakareski, and P. Frossard, “Popularity-aware rate allocation in multi-view video coding,” Proc. SPIE 7744, 77440Q (2010). [CrossRef]

13. M. Lang, A. Hornung, O. Wang, S. Poulakos, A. Smolic, and M. Gross, “Nonlinear disparity mapping for stereoscopic 3D,” ACM Trans. Graph. 29(4), 75 (2010). [CrossRef]

14. T. Yan, R. W. H. Lau, Y. Xu, and L. Huang, “Depth mapping for stereoscopic videos,” Int. J. Comput. Vis. 102(1–3), 293–307 (2013). [CrossRef]

15. M. S. Farid, M. Lucenteforte, and M. Grangetto, “Depth image based rendering with inverse mapping,” in Proc. IEEE International Workshop on Multimedia Signal Processing (2013), pp. 135–140. [CrossRef]

16. M. Rubinstein, D. Gutierrez, O. Sorkine, and A. Shamir, “A comparative study of image retargeting,” ACM Trans. Graph. 29(6), 160 (2010). [CrossRef]

17. S. Avidan and A. Shamir, “Seam carving for content-aware image resizing,” ACM Trans. Graph. 26(3), 118 (2007). [CrossRef]

18. G. X. Zhang, M. M. Cheng, S. M. Hu, and R. R. Martin, “A shape-preserving approach to image resizing,” Comput. Graph. Forum 28(7), 1897–1906 (2009). [CrossRef]

19. K. Utsugi, T. Shibahara, T. Koike, K. Takahashi, and T. Naemura, “Seam carving for stereo images,” in Proc. of 3DTV-Conference: The True Vision -Capture, Transmission and Display of 3D Video (2010), pp. 1–4. [CrossRef]

20. T. Dekel Basha, Y. Moses, and S. Avidan, “Stereo seam carving a geometrically consistent approach,” IEEE Trans. Pattern Anal. Mach. Intell. 35(10), 2513–2525 (2013). [CrossRef] [PubMed]

21. F. Shao, W. Lin, W. Lin, G. Jiang, M. Yu, and R. Fu, “Stereoscopic visual attention guided seam carving for stereoscopic image retargeting,” J. Disp. Technol. 12(1), 22–30 (2016). [CrossRef]

22. C. H. Chang, C. K. Liang, and Y. Y. Chuang, “Content-aware display adaptation and interactive editing for stereoscopic images,” IEEE Trans. Multimed. 13(4), 589–601 (2011). [CrossRef]

23. S. S. Lin, C. H. Lin, S. H. Chang, and T. Y. Lee, “Object-coherence warping for stereoscopic image retargeting,” IEEE Trans. Circ. Syst. Video Tech. 24(5), 759–768 (2014). [CrossRef]

24. B. Li, L. Y. Duan, C. W. Lin, T. Huang, and W. Gao, “Depth-preserving warping for stereo image retargeting,” IEEE Trans. Image Process. 24(9), 2811–2826 (2015). [CrossRef] [PubMed]

25. K. Y. Lee, C. D. Chung, and Y. Y. Chuang, “Scene warping: Layer-based stereoscopic image resizing,” in Proc. of IEEE Computer Vision and Pattern Recognition (2012), pp. 49–56.

26. F. Shao, Z. Li, Q. Jiang, G. Jiang, M. Yu, and Z. Peng, “Visual discomfort relaxation for stereoscopic 3D images by adjusting zero-disparity plane for projection,” Displays 39, 125–132 (2015). [CrossRef]

27. H. Sohn, Y. J. Jung, S. Lee, F. Speranza, and Y. M. Ro, “Visual comfort amelioration technique for stereoscopic images: Disparity remapping to mitigate global and local discomfort causes,” IEEE Trans. Circ. Syst. Video Tech. 24(5), 745–758 (2014). [CrossRef]

28. J. Lei, S. Li, B. Wang, K. Fang, and C. Hou, “Stereoscopic visual attention guided disparity control for multiview images,” J. Disp. Technol. 10(5), 373–379 (2014). [CrossRef]

29. F. Shao, Q. Jiang, R. Fu, M. Yu, and G. Jiang, “Optimizing visual comfort for stereoscopic 3D display based on color-plus-depth signals,” Opt. Express 24(11), 11640–11653 (2016). [CrossRef] [PubMed]

30. M. Wang, X. J. Zhang, J. B. Liang, S. H. Zhang, and R. R. Martin, “Comfort-driven disparity adjustment for stereoscopic video,” Comput. Visual Media 2(1), 3–17 (2016). [CrossRef]

31. Y. H. Huang, T. K. Huang, Y. H. Huang, W. C. Chen, and Y. Y. Chuang, “Warping-based novel view synthesis from a binocular image for autostereoscopic displays,” in Proc. of IEEE International Conference on Multimedia and Expo (2012), pp. 302–307. [CrossRef]

32. D. Li, X. Qiao, D. Zang, L. Wang, and M. Zhang, “On adjustment of stereo parameters in multiview synthesis for planar 3D displays,” J. Soc. Inf. Disp. 23(10), 491–502 (2015). [CrossRef]

33. J. Lei, M. Wang, B. Wang, K. Fan, and C. Hou, “Projection-based disparity control for toed-in multiview images,” Opt. Express 22(9), 11192–11204 (2014). [CrossRef] [PubMed]

34. Q. Jiang, F. Shao, G. Jiang, M. Yu, Z. Peng, and C. Yu, “A depth perception and visual comfort guided computational model for stereoscopic 3D visual saliency,” Signal Process. Image Commun. 38, 57–69 (2015). [CrossRef]

35. F. Shao, W. Lin, G. Jiang, M. Yu, and Q. Dai, “Depth map coding for view synthesis based on distortion analyses,” IEEE J. Em. Sel. Top. C. 4(1), 106–117 (2014).

36. P. G. Gottschalk and J. R. Dunn, “The five-parameter logistic: a characterization and comparison with the four-parameter logistic,” Anal. Biochem. 343(1), 54–65 (2005). [CrossRef] [PubMed]

37. VSRS-1D-Fast. [Online]. Available: https://hevc.hhi.fraunhofer.de/svn/ svn_3DVCSoftware, 2014.

38. ITU-T P.910, “Subjective video quality assessment methods for multimedia applications,” (1999).

39. ITU-R BT-500.11, “Methodology for the subjective assessment of the quality of television pictures,” (2012).

Method								Avg
SCL	2.00	2.24	1.92	2.08	2.88	2.04	2.20	2.19
SSC	2.16	3.88	2.88	2.28	4.04	1.88	2.12	2.75
SLW	2.12	2.16	2.08	2.12	3.04	2.08	2.52	2.30
Our	3.84	3.80	3.92	4.08	3.88	3.68	3.88	3.87

Method								Avg
SCL	2.00	2.24	1.92	2.08	2.88	2.04	2.20	2.19
SSC	2.16	3.88	2.88	2.28	4.04	1.88	2.12	2.75
SLW	2.12	2.16	2.08	2.12	3.04	2.08	2.52	2.30
Our	3.84	3.80	3.92	4.08	3.88	3.68	3.88	3.87

Optimizing multiview video plus depth retargeting technique for stereoscopic 3D displays

Abstract

1. Introduction

2. Related work

3. The proposed method

3.1 Framework overview

3.2 Relationship between the depth image and the display space

3.3 Construction of warping energy function

3.4 Optimization for virtual color-and-depth generation

4. Experimental results

4.1 Analysis the influence of depth remapping

4.2 Comparison to other methods

4. 3 Subjective evaluation results

5. Conclusions

Funding

References and links

Cited By

Figures (7)

Tables (3)

Equations (21)

Optics Express

Metric	Depth adjustment	Resolution adjustment
Lang et al. [13]	yes	no
Yan et al. [14]	yes	no
Basha et al. [20]	no	yes
Chang et al. [22]	no	yes
Li et al. [24]	no	yes
Li et al. [32]	yes	no

Sequences	Left	Right	Virtual
Balloons	1	5	3
Kendo	3	5	4
Bookarrival	6	10	8
Newspaper	2	6	4
GT_Fly	1	9	5
Undo_Dance	1	9	5
Poznan_Street	3	5	4