Dense-view synthesis for three-dimensional light-field display based on unsupervised learning

Duo Chen; Xinzhu Sang; Peng Wang; Xunbo Yu; Binbin Yan; Huachun Wang; Mengyang Ning; Shuai Qi; Xiaoqian Ye

doi:10.1364/OE.27.024624

1. Introduction

In recent years, the three-dimensional (3D) light-field display, as a potential future display method, has attracted a lot of researchers’ attention. Due to specifically designed optical equipment, panel pixels can be multiplexed, and 3D light-field display can provide a large view angle and dense viewpoints. The realization of this 3D display technology will have a great influence on many areas of society [1,2]. However, there are still some problems to be addressed, especially the capture of dense views in real 3D scenes.

There are many approaches to directly capture multiple views. Light field camera, such as Lytro Illum, is able to capture dense viewpoints with one shot, but the total baseline between views is too short and the view angle is rather narrow [3]. A dense camera array can be used to capture multiple viewpoints, like Stanford large camera arrays [4]. However, in many cases, a similar hardware system is too complex to be built and maintained. There are also some other researchers working on sparse cameras associated with view synthesis algorithms to obtain dense viewpoints [5,6].

View synthesis algorithm is used to synthesize a novel viewpoint by warping nearby posed views without 3D reconstruction, called Image Based Rendering (IBR) [7]. It is able to represent the real scene with realistic images by algorithms. Several IBR techniques can interpolate acceptable results within a small view angle. However, when the camera baseline becomes wider, there will be significant errors and blurs on synthesized views, due to scene depth discontinuity [8,9].

Convolutional neural network (CNN) has become a prevalent method for many computer vision (CV) tasks. Supervised CNN can be used to synthesize novel views based on supervised learning [10–14], for example DVM, DeepStereo, and EPI reconstruction, etc. DVM uses CNN to find the correspondence between views, and to blend the warped images together as a novel view [10]. DeepStereo uses the network to compute scene distributions at different depth by plane sweeping, and blends them to synthesize a novel view [13]. EPI reconstruction applies the network to super-resolve light field in the angular dimensions to interpolate novel dense views [14]. However, there are still some limitations for them. These above methods based on supervised learning require a considerable amount of training target views, which are sometimes difficult to be obtained. For DVM and DeepStereo, the position of training target is relatively fixed, mostly at the middle of input views. It is necessary to iteratively synthesize dense views, where the accumulation of view errors would lead to an unsatisfactory result. The limitation for EPI reconstruction lies in that the maximum disparity of this algorithm is very narrow. Unsupervised CNN can also be used to synthesize novel views based on unsupervised learning, for example MPVN [15], which does not require dense training targets. Multiple views with uniform horizontal parallax are plane swept and are input into the network to generate a novel view. However, this method has a very large limitation in practice, that is, it makes very strict physical requirements on the hardware system to obtain multiple uniform horizontal viewpoints, which is hard to be widely used for real scene capture.

Here, a method of dense-view synthesis for 3D light-field display based on unsupervised learning is proposed. This method is able to synthesize arbitrary virtual views with multiple free-posed views based on unsupervised learning, and dense enough viewpoints of the real 3D scene can be provided for 3D light-field display. Since multiple cameras can be placed in free-posed positions, the proposed method does not have such strict physical requirements as MPVN, and it can be flexibly used for the real scene capture. Postures of multiple free-posed cameras are estimated by the multi-view calibration. Multiple views are reprojected to the target position and input into the neural network. The network outputs a color tower and a selection tower indicting the scene distribution along the depth direction. A single image is yielded by the weighted summation of two towers. The network is end-to-end trained based on unsupervised learning by minimizing the errors during reconstructions of posed views. A virtual view is predicted by reprojecting posed views to the desired position. And a dense-view sequence can be generated by repeated synthesis without any error accumulation. Experimental results demonstrate the validity of our proposed method. PSNR of the virtual views are around 30dB and SSIM are over 0.90.

Comparing with MPVN, there is a great improvement of the proposed method, even though they both adopt unsupervised learning to output novel views. Because MPNV was used to handle uniform horizontal parallax problems, the collection hardware has to be precisely designed and the collection procedure is required to be carefully carried out, which means that MPVN is not suitable for widespread use in practice. In contrast, the proposed method is able to solve non-uniform full parallax problems, where multiple cameras can be free placed in the space. So that there are no strict physical requirements of the proposed method, and it can be flexibly used for the real scene capture. The reason for this is that their reprojection methods are distinct. As shown in Fig. 1, MPVN sweeps uniform horizontal views within maximum disparity to compute the distribution at different depths. However, non-uniform full parallax images cannot be processed in similar ways without knowing camera postures. Unlike MPVN, the proposed method takes camera postures into account, and reprojects free-posed views to the desired position at different depths to compute scene distributions. This enables us to deal with different situations, and flexibly use free-posed views to synthesize arbitrary virtual viewpoints.

Fig. 1 The schematic comparison of MPVN and our proposed network. MPVN can only utilize an image array with uniform horizontal parallax to synthesize a novel view by sweeping planes. The proposed method is able to handle the case of free-posed views by reprojection after taking camera postures into consideration.

Download Full Size | PDF

2. The proposed method

The overall approach of proposed method is given in Fig. 2. Multiple free-posed views are captured with a sparse camera array. Camera postures are estimated by the multi-view calibration. Posed views are reprojected to the target position based on homography transformation. CNN is trained based on unsupervised learning by reconstructing every posed view with these warped views. An arbitrary virtual view can be predicted by reprojecting posed views to the desired position. And a sequence of dense virtual views can also be obtained by repeated predictions with the network.

Fig. 2 The overall approach of our proposed method. (a) Multiple views are captured by a sparse camera array. (b) Camera postures are estimated by multi-view calibration. (c) Posed views are reprojected to a target position at different depth. (d) These warped views are input in the network, which is trained by unsupervised learning. (e) By changing the reprojection pose, a virtual view can be predicted. (f) A sequence of dense virtual view can be obtained by repeated predictions.

Download Full Size | PDF

2.1 Multi-view calibration

When capturing the real scene, multiple cameras are placed in free-posed positions to decrease the difficulty of collection hardware system. Since the optical axes of multiple free-posed cameras are difficult to keep parallel, there will be obvious non-uniform full parallax between different viewpoints. In order to solve this problem, camera postures are taken into account here.

In the proposed method, multiple camera postures are estimated by the traditional multi-view calibration [16]. The intrinsic parameters of each camera are calibrated with a dozens of chessboard images. The extrinsic parameters, including rotation matrix and translation vector, are estimated by structure from motion (SFM) and bundle adjustment (BA). The SFM pipeline matches a sparse set of 2D feature points between the first two images, and decompose the related essential matrix to obtain a rotation matrix and a translation vector. Other image views are incremented in the pipeline and estimated their poses by solving perspective-n-point(PnP) problems. BA is used to globally refine camera poses and 3-D point locations by minimizing projection errors.

In general, if the postures of multiple free-posed cameras are not precise, the quality of the synthetic virtual view will be very poor and not acceptable. Multi-frame calibration is employed to rectify and improve the accuracy of postures. Multiple cameras are free-posed to capture a certain number of discrete frames. Related posture arrays can be estimated from these multi-frame view arrays, respectively. For generality, the first two images and the incremental image sequence are changed in calibrating view arrays on different frames. Although the results of SFM could be very different, the relative postures are very similar. The posture results are all transformed taken the leftmost camera as the coordinate origin. The average of these posture arrays can be a better and precise result. Note that, the rotation matrix should be converted into linear space, such as Euler angle, before the average. And the trimmed mean is the average operation to remove the effects of outliner postures. As shown in Fig. 3, the refocus image with rectified postures is much clearer than the original refocus image, which means multi-frame calibration is very effective for reducing camera postures errors.

Fig. 3 The comparison of posture rectification for refocus results. The power adapter (a) and the ceiling line (b) becomes much clearer after posture rectification.

Download Full Size | PDF

2.2 Homography transformation of view reprojection

For view synthesis, it is necessary to reproject free-posed views into one camera coordinate to compute the scene distribution at different depths. In MPVN, different views are horizontally swept to compute the distributions without considering camera postures. DeepMVS takes camera poses into consideration to estimate the disparity range, and sweeps image planes to produce a disparity map [17]. In the proposed method, image views are reprojected to the target view. A similar method was proposed in MVSNet for depth estimation [18], but feature maps are warped in the network rather than original images.

The reprojection way is that viewpoints are back-projected onto different depth planes from their positons and projected on the screen of the target camera, which can be seen as a series of processes of homography transformation. Therefore, the results reprojected at different depth planes can be quickly obtained by warping the image view with related homography matrices.

A simple way here is used to estimate the homography matrix $H_{m}$ between camera $C_{i}$ and the target camera $C_{t}$ , as shown in Fig. 4. The current depth plane is $z_{m}$ . The intrinsic matrix K, the rotation matrix R, and the translation vector T of two cameras are already known by calibrations. The original four-corner points $p_{i} (u_{i}, v_{i}, 1)$ are back-projected into 3D space. 3D points $P_{i} (x_{i}, y_{i}, 1)$ and the position of camera $C_{i} (x_{c}, y_{c}, z_{c})$ are given by the following expression,

P_{i} = R_{i}^{- 1} (K_{i}^{- 1} p_{i}^{T} - T_{i}), p_{i} \in {(0, 0, 1), (w, 0, 1), (0, h, 1), (w, h, 1)},

C_{i} = - R_{i}^{- 1} T_{i},

where

K_{i}

,

R_{i}

,

T_{i}

are the intrinsic matrix, the rotation matrix and translation vector of camera

C_{i}

, and

K_{t}

,

R_{t}

,

T_{t}

are the intrinsic matrix, rotation matrix and translation vector of the target camera

C_{t}

. The back-projected rays

\vec{C P_{i}}

are extended to the depth plane

z_{m}

. The intersection points

P_{s} (x_{s}, y_{s}, z_{m})

can be computed by the following expression,

{\begin{matrix} x_{s} = \frac{z_{m} - z_{i}}{z_{c} - z_{i}} \times (x_{c} - x_{i}) + x_{i} \\ y_{s} = \frac{z_{m} - z_{i}}{z_{c} - z_{i}} \times (y_{c} - y_{i}) + y_{i} \end{matrix} .

The intersection points

P_{s} (x_{s}, y_{s}, z_{m})

are projected on the screen of the target camera

C_{t} (x_{t}, y_{t}, z_{t})

, and the final four-corner points

p_{t} (u_{t}, v_{t}, 1)

are obtained by the following projection formula,

p_{t}^{T} = K_{t} (R_{t} P_{i} + T_{t}) .

The homography matrix

H_{m}

can be obtained by solving the following equation,

{(p_{t}^{T})}^{3 \times 4} = H_{m}^{3 \times 3} {(p_{i}^{T})}^{3 \times 4},

where

p_{i} (u_{i}, v_{i}, 1)

are the original four-corner points, and

p_{t} (u_{t}, v_{t}, 1)

are the final projected four-corner points. Since a series of homography matrices

H_{m}

are applied on the original view, a series of homography warped views can be obtained, corresponding to results reprojected at different depth planes.

Fig. 4 The homography transformation of view reprojection. 2D points $p_{i} (u_{i}, v_{i}, 1)$ are back-projected onto different depth planes as 3D points $P_{s} (x_{s}, y_{s}, z_{m})$ . These 3D points are projected on the screen of the target camera as 2D points $p_{t} (u_{t}, v_{t}, 1)$ . The homography matrix $H_{m}$ can be computed from the transformation between $p_{i} (u_{i}, v_{i}, 1)$ and $p_{t} (u_{t}, v_{t}, 1)$ .

Download Full Size | PDF

2.3 The unsupervised learning algorithm

Unsupervised learning does not require extra training labels when dealing with CV problems [19,20]. The unsupervised learning algorithm used in our method is very similar to the one mentioned in MPVN. In the training part, multiple viewpoints are reprojected to a target posed view based on homography transformation. The network is trained to output the target view with these warped images. And every posed view is used as a training target. When this part is completed, the network is able to predict a virtual view by changing the reprojection position. Benefited from this view-dependent method, arbitrary virtual viewpoint can be efficiently synthesized with free-posed views.

For example, Fig. 5 shows the schematic diagram of proposed unsupervised learning algorithm. In the training part, N free-posed views, $v_{1}, \dots, v_{N}$ , are reprojected to one of them, regarded as the target view $v_{t}$ , with corresponding homography matrices $H_{1 t, m}, \dots, H_{N t, m}$ . Warped images are stacked as N view towers and input into CNN. The network is trained with unsupervised learning by minimizing the loss between the output view $v_{c}$ and the target view $v_{t}$ . In the predicting part, N test views, $v_{1}^{'}, \dots, v_{N}^{'}$ , are reprojected to the desired virtual view $v_{x}^{'}$ with related homography matrices $H_{1 x, m}^{'}, \dots, H_{N x, m}^{'}$ , and stacked as new view towers. N new view towers are fed into CNN. And the desired virtual view $v_{x}^{'}$ is generated in a high quality.

Fig. 5 The schematic diagram of proposed unsupervised learning algorithm. In the training part, $v_{1}, \dots, v_{N}$ are reprojected to one of them, which is regarded as the target view $v_{t}$ . CNN is trained with unsupervised learning by minimizing the error between the output view $v_{c}$ and the target view $v_{t}$ . In the predicting part, $v_{1}^{'}, \dots, v_{N}^{'}$ are reprojected to the desired position, and CNN is able to output the virtual view $v_{x}^{'}$ in an acceptable quality.

Download Full Size | PDF

2.4 View synthesis method

The schematic diagram of the view synthesis procedure with network is shown in Fig. 6. Multiple views are warped to the target view at various depths based on homography transformations, and stacked as N view towers, respectively. For each depth plane m, images from different view towers are concatenated and input into the network. The 2D color network outputs a multi-scale color result, which are 3-channel RGB images, and very similar to the refocus result of all warped images at that depth. The 2D + 3D selection network outputs a multi-scale selection tower. Each plane of the selection tower is a 1-channel probability map, which suggests the likelihood distribution of color plane in focus. The color results are stacked as a multi-scale color tower corresponding to the selection tower. The final multi-scale view can be synthesized by per-plane weighted summation of the two towers, as shown in Fig. 7.

Fig. 6 The schematic diagram of the view synthesis procedure. Each plane of the view towers is concatenated and input into the network. The 2D color network outputs a multi-scale color result. The 2D + 3D selection network outputs a multi-scale selection tower. Each plane of the two towers are shown in the figure.

Download Full Size | PDF

Fig. 7 The final multi-scale view is weighted summation of the multi-scale color tower and the multi-scale selection tower.

Download Full Size | PDF

For an image scale r, the result view $v_{c}^{r}$ is yielded by the per-plane weighted summation of the color tower $C T^{r}$ and the selection tower $S T^{r}$ . $v_{c}^{r}$ is given by the following expression,

v_{c}^{r} = \sum_{i = 0}^{M \times r} C T_{i}^{r} \times S T_{i}^{r},

where

C T_{i}^{r}

is the color image of color tower at plane i, and

S T_{i}^{r}

is the probability map of selection tower at plane i. The selection tower

S T^{r}

has been pixel-wise normalized along the depth plane using Softmax,

\sum_{i = 0}^{M \times r} S T_{i}^{r} = 1 .

The pixel-wise L1 loss

l_{c}^{r}

is introduced to compute the difference between the yielded view

v_{c}^{r}

and the target view

v_{c}^{r}

, which is given by the following expression,

l_{c}^{r} = \sum {‖ v_{t}^{r} - v {}_{c}^{r} ‖}_{1} .

Two smoothness losses are also considered. The color smoothness loss

s_{c}^{r}

is computed on the color result gradients to suppress color pixel noises, which is given by,

s_{c}^{r} = \sum {‖ \nabla v_{c}^{r} ‖}_{1} .

The depth smoothness loss

s_{d}^{r}

is computed on the depth map gradients. Since the depth discontinuity is always on the edge of color image, the color gradient is employed as the weight coefficient as an edge-aware item.

s_{d}^{r}

is given by the following expression,

s_{d}^{r} = \sum {‖ \nabla v_{d}^{r} \times \exp (- \nabla v_{t}^{r}) ‖}_{1} .

The network is trained by minimizing the total loss L, containing three losses mentioned above, $l_{c}^{r}$ , $s_{c}^{r}$ , and $s_{d}^{r}$ , under all image scales, which is given by the following expression,

L = \sum_{r} a^{r} l_{c}^{r} + b^{r} s_{c}^{r} + c^{r} s_{d}^{r} a n d \sum_{r} a^{r} + b^{r} + c^{r} = 1,

where

a^{r}

is the weight of color image loss

l_{c}^{r}

,

b^{r}

is the weight of color smoothness loss

s_{c}^{r}

, and

c^{r}

is the weight of depth smoothness loss

s_{c}^{r}

, under the image scale r. Note that, since there are no depth data sets used as training labels, and the depth smooth loss

s_{d}^{r}

is very crucial for the final image quality.

2.5 Network architecture

The detailed architecture of proposed network is shown in Fig. 8, which is similar to the AutoEncoder [21]. This kind of network structure is very effective for features extraction with multiple layers of convolution and pooling operation, and for features reconstruction with deconvolution layers. Multiple warped images are concatenated and input into the network. And multi-scale CNN is used to compute features under different receptive fields. Two 2D networks are applied for color results and selection results, respectively. Their designed structures are the same, except the outcome results. The 2D color network outputs a multi-scale color image, and each of them is a 3-channel RGB image. And the 2D selection network outputs an intermediate multi-scale selection feature with 16-channel. These features are stacked as a multi-scale selection feature tower and input into 3D selection network, which is shown in Fig. 9. 3D convolution layers are used to refine the selection towers by improving the correlation of features between different planes. The 3D network directly outputs a multi-scale selection tower. Each plane of the selection tower is a 1-channel probability map.

Fig. 8 The architecture of proposed 2D network. The structure of 2D color network and 2D selection network are the same, except the outcome result. The 2D color network outputs a 3-channel RGB image. The 2D selection network outputs a 16-channel selection feature, which is stacked as selection feature tower.

Download Full Size | PDF

Fig. 9 The architecture of 3D selection network. A multi-scale selection feature tower is input into the 3D selection network, which is used to refine the selection towers by improving the correlation of features between different planes. The 3D selection network outputs a multi-scale selection tower.

Download Full Size | PDF

3. Implementation and simulation

3.1 Method implementation

In the training part, there are three types of image data sets used for training our network, including multi-view arrays of Stanford light field data sets, multi-view arrays of virtual scene rendered by 3Ds Max, multi-view arrays of real 3D scenes captured by ourselves with multiple video cameras. The image array is a group of 5-view images which is used for synthesizing virtual views. More than 4000 image arrays are prepared for training. The image resolution is resized to 960 × 540. Five 96 × 96 patches are clipped from five warped images. 64 depth planes are used for tower construction. The image scale is set as 3. The loss weights $a^{1}$ is 1, $b^{1}$ is 0.5, $c^{1}$ is 0.1, $a^{0.5}$ is 0.5, $b^{0.5}$ is 0.25, $c^{0.5}$ is 0.05, $a^{0.5}$ is 0.25, $b^{0.25}$ is 0.125, $c^{0.25}$ is 0.025. Note that, these weights should be normalized before computing.

The CNN network is programmed with Tensorflow. RMSProp is employed as training update rule, and the initial learning rate is 2 × 10⁻⁵. The network is trained with 2 batches for 50,000 iterations on two NVIDIA Quadro P6000 GPUs. Each training iteration costs about 3.7 seconds. In the predicting part, it nearly takes about 2.2s to synthesize a single virtual view with resolution of 960 × 540, and about 1.5s for resolution of 640 × 480. If the dense view number is 50, different views are predicted repeatedly, spending nearly 110s for the resolution of 960 × 540, or nearly 75s for the resolution of 640 × 480. As the number of view increases, more time will be consumed.

In order to generate a dense-view sequence for 3D display, an innovative 15.6-inch 3D light-field display is used here. A micro-lens array is equipped on a 4K LCD panel. The size of each micro-lens is 1cm, and the focal length is 1.01cm. The space between lens is 12mm. The off screen distance is set as 20cm, and the gap between the display panel and lens array is 1.07cm. The 3D light-field display supports 50° view angles with 50 viewpoints, where the optimal horizontal parallax is about 16% of the image width.

Peak Signal-Noise Ratio (PSNR) and Structural Similarity index (SSIM) are calculated for estimating synthesized results. In general, when SSIM is higher than 0.95 or PSNR is higher than 30dB, the image quality is satisfied. When SSIM is lower than 0.9 or PSNR is lower than 20dB, the image quality cannot be accepted. The residual error map is also calculated, which describes the absolute error between the original image and the synthesized image.

3.2 Simulation assessment

For the proposed method, it is able to synthesize novel view with free-posed views based on unsupervised learning. So that it can handle the non-uniform parallax problems. To verify the effectiveness of the propose method, an image data set of virtual scene is utilized. 5 × 26 views are rendered in uniform horizontal and vertical parallaxes. The maximum horizontal parallax is about 20% of the image width, and the maximum vertical parallax is about 12% of the image height. Different sparse views are picked out and input into the trained network. The poses of input views are changed solely on horizontal or vertical direction. For each type of view poses, there are 3 to 4 times of horizontal parallax changes and 5 times of vertical changes. Since the output views are mainly for 3D light field display, the output dense views are horizontally synthesized and estimated with the ground truth by using PSNR and SSIM. And the third row of views is chosen as the ground truth. The simulation results are shown in Fig. 10. It can be clearly seen that, the qualities of synthetic views around the poses of input views are much higher than others. And the changes of view poses affect the quality distribution around directly. But the influence between the two sides of the image array is not apparent, which is because the two side generated views can be provided enough information from the middle input view while little are used from each other.

Fig. 10 Simulation assessment of horizontal synthesized dense views under different parallax situations. 26 views are generated by the proposed method with sparse input views. (a) Input views are posed at different horizontal poses. (b) Input views are posed at different vertical poses. Every row of view array with different parallax are input the network. Circles represent different viewpoints. The empty circles ○ represent views at fixed poses. The solid black circles ● represent views at changed poses. The arrows $\to$ are the direction of poses changing. The crosses × mean there are no input views at current poses.

Download Full Size | PDF

In Fig. 10 (a), distributions of PSNR and SSIM of synthesized dense views are computed under different view numbers and different horizontal input poses. The top three rows are results of three input views. PSNR of the images between posed views are close to 30dB and SSIM are around to 0.88. When the two side posed views moving inward, the outside synthesized views dramatically gets worse, showing the proposed network is unable to generate views out of the range of input poses. The middle synthesized views are obviously higher than the views around, higher than 33dB of PSNR and 0.94 of SSIM, because features from the other two can be used to generate the middle one. The fourth and fifth rows are the results of four input views. It can be seen that, after added one more view, the synthetic qualities become higher than 30dB and 0.90 SSIM. But the other side is still very low. For the six and the left rows, five views are input into the network, and relative good results are mostly obtained. Views between view #6 and view #21 are mostly better than others, where PSNR are over 31dB and SSIM are around 0.92. When the input view 2 moves from left to right, the qualities of output views between #6 and #11 are increased from 30dB to 31dB of PSNR and from 0.90 to 0.92 of SSIM. But the views between #1 and #6 are decreased from 31dB to 29dB of PSNR and from 0.91 to 0.88 of SSIM. Within the 7% parallax, PSNR of dense views are all higher than 30dB and SSIM are over 0.9. When the leftmost or the rightmost views are moving inward, views outside the range of posed views are still hard to be predicted, which is around 26dB and 0.85 SSIM. From Fig. 10 (a), five views are preferred to generate virtual views. And for a better result, the maximum horizontal parallax should be less than 20%. And the horizontal changes of input view #2 and view #4 should be less than 7%.

In Fig. 10 (b), distributions of PSNR and SSIM of synthesized dense views are computed under different vertical input poses. The qualities of output views around input views are higher than others. From the vertical simulation, the vertical changes of the leftmost and rightmost views greatly affect the outputs. The qualities are decreased from 31dB to 28dB of PSNR and 0.91 to 0.87 of SSIM. But within 3% parallax, PSNR are kept around 30dB and SSIM are around 0.90. The vertical changes of the second left and right views have slight effects on the output views, where PSNR are decreased from 32dB to 31dB and SSIM reduces from 0.92 to 0.90 within 10%. When the poses of two left or right views are changed meantime, the qualities of views on that side are gradually getting worse to 27dB of PSNR and around 0.86 of SSIM. Moreover, when the vertical parallax changes at the opposite directions, the result qualities are still very similar. The main reason is that, since only the horizontal synthesized views are concerned here, the input views with the same absolute vertical parallax would have similar effects on the output result, although image qualities would be different at other poses. From Fig. 10 (b), five views with vertical parallaxes can be used to generate virtual views. For a better result, the vertical parallax changes of input view #1 and #5 should not be more than 3%. And the input view #2 and view #4 should be less than 10%.

4. Experiment

The experiments are carried out on three image arrays to demonstrate the effectiveness of the proposed method, as shown in Fig. 11, including an image array of Stanford images (CD Cases) [22,23], an image array of virtual scene (Lotus Pool), and an image array of the real 3D scene (Indoor Scene). Scenes are captured from different directions by cameras. There are both horizontal parallax and vertical parallax in image arrays. The maximum parallaxes of Stanford images, virtual scene and real 3D scene, are 7%, 15% and 13% of the image width on horizontal, and 2%, 8% and 7% of the image height on vertical, respectively.

Fig. 11 Three types of image data sets. (a) CD cases is the image array of Stanford light field. (b) Lotus Pool is the image array of virtual scene. (c) Indoor Scene is the image array of real scene. The red lines indicate the vertical parallax in image arrays.

Download Full Size | PDF

Detailed configurations of experiments are shown in Table 1. Compared experiments are carried out between the proposed network and some other networks with different structures, including 2D, 2D + s, 2D + 3D, and 2D + 3D + s networks. And a non-learning based view interpolation method is also used for the comparison. The method is based on the optical flow, which is used to compute the correspondence map between views. 3D points are estimated by using triangulation on matched 2D points. And the novel view can be obtained after projecting these 3D points on the target cameras.

Table 1. Configurations of different method

View Table

4.1 Posed view synthesis

The network is able to synthesize full-size posed views after trained on view patches. The results of synthesized posed views are shown in Fig. 12. PSNR and SSIM are calculated on the red rectangle areas, except the scene of CD Cases, in case the overlapped area is incomplete after view reprojection. It can be seen that view #3 is always the best in all synthesized views, owing to the abundant features extracted from bilateral views.

Fig. 12 The results of synthesized posed views. The leftmost column is the result of view #3 synthesized by 2D + 3D + S network. PSNR and SSIM are calculated on the red rectangle areas, except the scene of CD Cases. The middle column is the PSNR of different synthesized views under different network structures. The rightmost column is the SSIM of different synthesized views under different network structures.

Download Full Size | PDF

For the first non-background scene (CD Cases), the synthetic posed view of 2D network is better than 2D + 3D network and 2D + S network, where PSNR is around 30dB, higher about 2dB, and SSIM is above 0.95, higher about 0.01. That is because it is easier for 2D network to converge when the interference of background depth is eliminated. But 2D + 3D network and 2D + S network would blend the synthesized result with the black back color which cause poor results. For the second scene (Lotus Pool) and the third scene (Indoor Scene), the performances of 2D network are the worst and unacceptable, where SSIM are below 0.90, due to the errors between and inner planes. The performance of 2D + 3D network is a little better than 2D + S network, where the average of SSIM is higher about 0.02. That is because the former 2D + 3D network considers the correlation between planes and is able to adjust global 3D depth distribution for the result, but the latter 2D + S network focuses on the smoothness of local 2D depth plane which causes the synthetic result blurred.

The proposed 2D + 3D + S network is the best in all comparison methods. It is able to recover the posed views in a very good quality, where PSNR are all higher than 30dB, and SSIM are all higher than 0.90, some even higher than 0.95. That is because both the 3D depth correlation between planes and the 2D + 3D smoothness inner and between planes are considered in the network. Under the balance of depth adjustment and plane smoothness, it can synthesize an acceptable result with clear details. As a result, the proposed network can deal with a variety of scene situations.

4.2 Virtual views synthesis

The network is able to synthesize desired virtual views by setting reprojection positions. There is one more viewpoint prepared ahead to estimate the capability of the network which is excluded from the training part. The virtual views are placed between viewpoint #3 and viewpoint #4. The results synthesized by different method are shown in Fig. 13. Image details are shown in red rectangles. PSNR and SSIM of the results are calculated on the blue rectangle areas, except the scene of CD Cases. The residual error map of blue rectangle area is calculated indicating the absolute errors between the original view and the synthesized view.

Fig. 13 The virtual view results synthesized by different networks. The leftmost column is the original view prepared ahead. From the second to the rightmost column, they are the virtual views synthesized by 2D network, 2D + S network, 2D + 3D network, 2D + 3D + S network, and Optical Flow, respectively. Image details are shown in red rectangles. PSNR and SSIM are calculated on the blue rectangle areas, except the scene of CD Cases. The row below the virtual views is the residual error map of blue rectangle area indicating the absolute errors between the original view and the synthesized view.

Download Full Size | PDF

From Fig. 13, it can be seen that, the quality of Non-learning OF is not very high, PSNR is around 24dB, and SSIM is around 0.8. That is mainly because there are many black holes on the virtual view, and mistakes on the image details and scene edges, which is caused by the inaccurate correspondence map of the traditional optical flow. Compared with Non-learning OF, the results of CNN based methods are much better.

For the first non-background scene (CD Cases), the synthetic virtual view of 2D network is better than 2D + S network and 2D + 3D network, where PSNR is over 30dB and SSIM is above 0.95. That is because for 2D network, the distribution of the non-background scene is easy to be fit. For 2D + S network, the result is wrongly blurred with the black background due to the local 2D smoothness item. For 2D + 3D network, the result is also wrongly fused with the black background due to 3D convolution operation. For the second scene (Lotus Pool) and the third scene (Indoor Scene), the results of 2D network are unacceptable with SSIM lower than 0.90. Different from the posed view synthesis part above, the performances of 2D + S network are a little better than 2D + 3D network in generating virtual views, where PSNR is higher about 1dB and SSIM is higher about 0.01. The reason is that, for 2D network, the complex scene is too difficult to converge and the results are full of errors. For 2D + 3D network, it adjusts the 3D depth distribution for the posed views in training, but produces a wrong reprojection result for the virtual view in predicting. On the contrary, for 2D + S network, although some details are blurred, it can reduce the convergent errors and the serious reprojection occlusion errors on virtual views by smoothing local 2D depth planes.

The proposed 2D + 3D + S network is still the best in all comparison methods. It can be seen that, PSNR are around 30dB and SSIM are all higher than 0.90. Owing to the ability of 3D depth adjustment and 2D + 3D planes smoothness, the network can produce a relatively correct depth distribution on the virtual view and keep image details clear. However, there are still some mistakes in the area of scene edges and small object occlusions. That is because these regions are textureless area or hardly identified from the surrounding scenes. It is difficult for the network to locate the refocus plane from extracted features and reconstruct scenes correctly. In addition, due to the large parallax between views, some refocus areas are overlapped each other, which results in the errors of depth distribution and blurs of scene edges.

4.3 Dense virtual view synthesis

The sequence of dense virtual views can be synthesized by the network. As shown in Fig. 14, 50 virtual views are repeated generated with 5 calibrated input views. The center of desired 50-virtual-camera array is set in the middle of 5 posed cameras. All virtual cameras are parallel, and the array is horizontally extended from leftmost posed camera to the rightmost posed camera. Three of them are picked out as left view, middle view, and right view, and estimated the PSNR and SSIM. But since there are not extra parallel video cameras for the real scene, the qualities of the synthesized views of Indoor Scene cannot be estimated. From the Fig. 14, we can see that, the qualities on the two sides are also very high, where PSNR are around 30dB and SSIM are around 0.90. EPIs (Epipolar Plane Image) of these synthesized images are computed and the parallax line is very smooth and clear. The dense-view sequence of each scene is presented on our innovative 15.6-inch 3D light-field display with 50 views in 50° view angles. From the displaying result, the dense synthetic views with smooth parallax are very suitable for 3D display and there is not much quality deterioration on each viewpoint. These experimental results demonstrate the validation of our proposed method is able to provide dense enough views for the 3D light-field display.

Fig. 14 The input views and the synthesized output views of the network. Sequence of dense views can be synthesized. Dense synthesized views of (a)CD Cases, (b) Lotus Pool. (c) Indoor Scene. The top-right is the EPI of 50 synthesized views. The bottom is the displaying results of dense-view sequences presented on a 3D light-field display. (see Visualization 1)

Download Full Size | PDF

5. Conclusion

In summary, a method is presented to synthesize dense virtual views for the 3D light-field display based on unsupervised learning. Multiple posed views are reprojected and input into the neural network. The network outputs a color tower and a selection tower at target positon. The final image view can be computed by weighted summing of two towers. By reprojecting posed views to the desired position, arbitrary virtual views can be predicted. And a dense-view sequence is generated by repeated predictions for 3D light-field display. Experimental results validate the performance of proposed network. PSNR of synthesized views are around 30dB and SSIM are over 0.90. Because the proposed method is able to solve non-uniform full parallax problems, multiple cameras can be placed in free-posed positions. So that there are no strict physical requirements, and the method can be flexibly used for the real scene capture. We believe that this work will be helpful for the wide applications of 3D light-field display in the future.

Funding

973 Program (2017YFB1002900); Fundamental Research Funds for the Central Universities (2018PTB-00-01); The Fund of State Key Laboratory of Information Photonics and Optical Communications (IPOC2017ZZ02).

References

1. X. Sang, X. Gao, X. Yu, S. Xing, Y. Li, and Y. Wu, “Interactive floating full-parallax digital three-dimensional light-field display based on wavefront recomposing,” Opt. Express 26(7), 8883–8889 (2018). [CrossRef] [PubMed]

2. X. Yu, X. Sang, X. Gao, Z. Chen, D. Chen, W. Duan, B. Yan, C. Yu, and D. Xu, “Large viewing angle three-dimensional display with smooth motion parallax and accurate depth cues,” Opt. Express 23(20), 25950–25958 (2015). [CrossRef] [PubMed]

3. R. Ng, M. Levoy, and M. Brédif, “Light field photography with a hand-held plenoptic camera,” Stanford Tech. Report 2(11), 1–11 (2005).

4. B. Wilburn, N. Joshi, V. Vaish, E.-V. Talvala, E. Antunez, A. Barth, A. Adams, M. Horowitz, and M. Levoy, “High performance imaging using large camera arrays,” ACM T. Graphic 24(3), 765–776 (2005). [CrossRef]

5. H. Deng, Q.-H. Wang, and D. Li, “Method of generating orthoscopic elemental image array from sparse camera array,” Chin. Opt. Lett. 10(6), 31–33 (2012).

6. K. Oh, S. Yea, A. Vetro, and Y.-S. Ho, “Virtual view synthesis method and self-evaluation metrics for free viewpoint television and 3D video,” Int. J. Imaging Syst. Technol. 20(4), 378–390 (2010). [CrossRef]

7. S. Chan, H. Shum, and K. Ng, “Image-Based Rendering and Synthesis,” IEEE Signal Process. Mag. 24(6), 22–33 (2007). [CrossRef]

8. J. Xiao and M. Shah, “Tri-view morphing,” Comput. Vis. Image Underst. 96(3), 345–366 (2004). [CrossRef]

9. S. Chan, Z. Gan, and H. Shum, “An object-based approach to plenoptic video processing,” in Proceedings of IEEE International Symposium on Circuits and Systems, (IEEE, 2007), 985–988. [CrossRef]

10. D. Ji, J. Kwon, and M. Mcfarland, “Deep view morphing,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2017), 7092–7100.

11. T. Zhou, S. Tulsiani, and W. Sun, “View synthesis by appearance flow,” in Proceedings of European Conference on Computer Vision, (Springer, 2016), 286–301.

12. N. Kalantari, T. Wang, and R. Ramamoorthi, “Learning-based view synthesis for light field cameras,” ACM T. Graphic 35(6), 193 (2016). [CrossRef]

13. J. Flynn, I. Neulander, and J. Philbin, “Deep stereo: learning to predict new views from the world’s imagery,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2016), 5515–5524. [CrossRef]

14. G. Wu, M. Zhao, and L. Wang, “Light field reconstruction using deep convolutional network on EPI,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2017), 6319–6327. [CrossRef]

15. D. Chen, X. Sang, W. Peng, X. Yu, and H. C. Wang, “Multi-parallax views synthesis for three-dimensional light-field display using unsupervised CNN,” Opt. Express 26(21), 27585–27598 (2018). [CrossRef] [PubMed]

16. R. Szeliski, Computer Vision: Algorithms and Applications (Springer, 2011).

17. P. Huang, K. Matzen, and J. Kopf, “Deepmvs: Learning multi-view stereopsis,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2018), 2821–2830.

18. Y. Yao, Z. Luo, and S. Li, “Mvsnet: Depth inference for unstructured multi-view stereo,” in Proceedings of European Conference on Computer Vision, (Springer, 2018), 767–783. [CrossRef]

19. R. Garg, V. BG, and G. Carneiro, “Unsupervised CNN for single view depth estimation: Geometry to the rescue,” in Proceedings of European Conference on Computer Vision, (Springer, 2016), 740–756.

20. C. Godard, O. Aodha, and G. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2017), 6602–6611. [CrossRef]

21. G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science 313(5786), 504–507 (2006). [CrossRef] [PubMed]

22. V. Vaish, M. Levoy, and R. Szeliski, “Reconstructing Occluded Surfaces Using Synthetic Apertures: Stereo, Focus and Robust Measures,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2006), 2331–2338. [CrossRef]

23. Stanford Computer Graphics Lab, “The (New) Stanford Light Field Archive,” http://lightfield.stanford.edu/index.html.

Dense-view synthesis for three-dimensional light-field display based on unsupervised learning

Abstract

1. Introduction

2. The proposed method

2.1 Multi-view calibration

2.2 Homography transformation of view reprojection

2.3 The unsupervised learning algorithm

2.4 View synthesis method

2.5 Network architecture

3. Implementation and simulation

3.1 Method implementation

3.2 Simulation assessment

4. Experiment

4.1 Posed view synthesis

4.2 Virtual views synthesis

4.3 Dense virtual view synthesis

5. Conclusion

Funding

References

Supplementary Material (1)

Cited By

Figures (14)

Tables (1)

Equations (11)

Optics Express

Method names	Method structure
2D	2D selection network
2D + S	2D selection network with depth smooth loss $s_{d}^{r}$
2D + 3D	2D + 3D selection network
2D + 3D + S	2D + 3D selection network with depth smooth loss $s_{d}^{r}$
Non-learning OF	Non-learning based view interpolation based on optical flow