Real-time optical reconstruction for a three-dimensional light-field display based on path-tracing and CNN super-resolution

Xiao Guo; Xinzhu Sang; Xinzhu Sang; Duo Chen; Peng Wang; Huachun Wang; Xue Liu; Yuanhang Li; Shujun Xing; Binbin Yan

doi:10.1364/OE.441714

1. Introduction

Three-Dimensional (3D) light-field display technology is a popular method to realize 3D display, which has many advantages such as all depth cues, full parallax, correct geometric occlusion, and avoiding cue conflicts [1,2]. Integral imaging technology, proposed in 1908, is widely used in 3D light-field reconstruction with full parallax and consecutive views [3–5]. The reconstructed 3D image, called elemental image array (EIA), is generated by reallocating pixels of multi-view images, which are used for the 3D light-field display.

To generate realistic and high-quality 3D images, many efforts have been devoted to computer-generated integral imaging (CGII) in recent years. Each camera viewpoint independent rendering (ECVIR) increased the number of views and the resolution of individual views concurrently [6]. Multiple viewpoints rendering (MVR) algorithm rendered multi-view images based on perspective coherence among epipolar plane images [7]. Backward ray-tracing (BRT) algorithm reduced rendering time using a views-independent ray-tracing method [8]. Moreover, to reduce Monte-Carlo noise at the low sampling rate, directional path tracing (DPT) combined path-traced Monte-Carlo integration with recurrent convolutional denoising neural network [9]. Together the above researches provide important insights into 3D image rendering and reconstruction. However, it is challenging to realize real-time 3D image reconstruction in super high-resolution, which is essential for the interactive 3D light-field display.

In recent years, deep learning is widely used in computer vision, which makes remarkable progress, such as image denoising, novel view synthesis, and image super-resolution (SR) [10–12]. Image SR can restore a high-resolution image with more details from a low-resolution image. Specifically, single image super-resolution (SISR) is extensively exploited in past decades, and it is the basis of multiple images super-resolution [13]. Super-resolution convolutional neural network (SRCNN) was the first deep learning-based method using end-to-end learning to restore high-resolution images [14]. To better understand the mechanism of deep convolutional neural network and its effect, very deep super-resolution (VDSR) and residual dense network (RDN) were proposed to solve the problem using 20 convolutional layers and residual skip connections with dense connections, respectively [15,16]. To explore the effect of generative adversarial learning, super-resolution generative adversarial network (SRGAN) first used GAN to generate high-resolution photo-realistic images [17]. Similarly, enhanced super-resolution generative adversarial network (ESRGAN) used GAN to improve the restored image quality [18].

In contrast to the above methods, some image SR methods are mainly interested in highly computational performance. For example, fast super-resolution convolutional neural network (FSRCNN) realized real-time computation on a generic CPU with a simple network architecture [19], and it was used to generate super multi-view integral images, which is faster than ECVIR [20]. Likewise, efficient sub-pixel convolutional neural network (ESPCN) first proposed an effective sub-pixel convolution operation to compute SR images and videos in real-time [21]. Considering all of this evidence, it seems that high quality real-time optical 3D reconstruction is possible.

Here, the two-stage method is proposed to achieve real-time 3D light-field display based on path tracing and image SR. To generate a 3D image fast, the first stage is to generate a low-resolution 3D image based on path tracing, which can provide a high-quality image with less time, and then a lite CNN based on GAN is used to super-resolve the low-resolution 3D image to a high-resolution 3D image in the second stage. In order to prevent crosstalk among different EIs, each EI is segmented and super-resolved individually with a well-trained CNN. To further accelerate the second stage, a foreground selection scheme based on ray casting is proposed to reduce the computation of SR. Finally, the output EIs are collected to synthesize the high-resolution 3D image. Experimental results demonstrate that the validation of our proposed method. Real-time 3D light-field display in 8K resolution can be realized over 30fps. Meanwhile, the SSIM value of the 3D images can be over 0.90. Therefore, it is believed that our method can contribute more to real-time 3D light-field display in the future.

The schematic comparison is shown in Fig. 1. In the traditional path tracing method, the high-resolution 3D image is directly generated using path tracing, as shown in Fig. 1(a). Since many rays are required for rendering, it is hard to achieve real-time 3D image generation at 8K resolution. Although the super-multiview integral imaging scheme proposed in [20] uses FSRCNN to up-sample the low-resolution 3D images, as shown in Fig. 1(b), it is not appropriate for real-time 3D image generation due to three problems. Firstly, an ECVIR-based algorithm is used to synthesize low-resolution 3D images, which takes much time on sub-aperture image generation with redundant spatial information. Secondly, although the FSRCNN is a real-time SR algorithm, its performance on image quality is not satisfactory. Thirdly, this method directly up-samples the low-resolution EIA into high-resolution EIA, which not only results in crosstalk among different EIs, but also incurs memory explosion at 8K resolution. Our method firstly utilizes path tracing for fast low-resolution 3D image generation, and a lite SR network based on GAN is proposed to perform real-time and high-quality image restoration simultaneously. Particularly, each EI is independently up-sampled to avoid possible perspective crosstalk and memory explosion. Furthermore, a selection strategy of foreground EIs is presented to improve the rendering performance based on ray casting, as shown in Fig. 1(c).

Fig. 1. The process of (a) traditional method path tracing, (b) super-multiview integral imaging scheme [20], and (c) our proposed method.

Download Full Size | PDF

2. Method

The overall approach of our proposed method is illustrated in Fig. 2. Firstly, the positions and orientations of the sparse virtual camera array are correctly established, and then multi-view images are generated and used to synthesize a low-resolution encoded image based on path tracing. After foreground areas selection, the chosen EIs are batched into a light-weight generator CNN for super-resolving. Finally, the outputs of the generator CNN are used to recompose the high-resolution 3D image, and meanwhile the outputs are input into an adversarial CNN for image quality improvement in the training process.

Fig. 2. The overall approach of our proposed method.

Download Full Size | PDF

2.1 Path tracing and pixel encoding

Path tracing is a fundamental process to render multi-view images from 3D models. With a virtual camera array, multi-view images can be generated by path tracing. To obtain the encoded 3D image, the pixels of multi-view images should be rearranged, which can be fused in path tracing by setting the orientations and directions of launched rays from the virtual camera array. In order to reduce the processing time of path tracing, this stage is used to generate low-resolution 3D images.

Path tracing is a probabilistic method to generate images with general bidirectional reflectance distribution functions (BRDFs), which launches rays from the camera and traces them back to the luminaire [22,23], as shown in Fig. 3(a). The rendering equation of path tracing [24] is expressed as

(1)$${L_s}({{\boldsymbol k}_o}) = \int_{i = 1}^n {\rho ({{\boldsymbol k}_i},{{\boldsymbol k}_o}){L_f}} ({{\boldsymbol k}_i})\cos {\theta _i}d{\sigma _i}, $$

where k_i is the direction of the incident light, k_o is the outgoing direction, and ρ(k_i, k_o) is the BRDF. L_s(k_o) is the surface radiance in the direction k_o, and L_f(k_i) is the field radiance in the direction k_i. θ_i is the angle between k_i and the normal direction in the surface, and dσ_i is the solid angle of L_f (k_i). To solve the integral problems, Monte-Carlo integration is used for approximate integration by random samples

(2)$$\int {f(x)d\mu \approx \frac{1}{N}\sum\limits_{i = 1}^N {\frac{{f({x_i})}}{{p({x_i})}}} }, $$

where x_i is the random point, and p(x_i) is the probability density function of x_i. N is the number of samples. To make a more precise approximation, N should be as large as possible. Meanwhile, f(x_i)/p(x_i) is expected to have a low variance. Based on Monte-Carlo approximation, Eq. (1) can be rewritten as

(3)$${L_s}({{\boldsymbol k}_o}) \approx \frac{1}{N}\sum\nolimits_{i = 1}^N {\frac{{\rho ({{\boldsymbol k}_i},{{\boldsymbol k}_o}){L_f}({{\boldsymbol k}_i})\cos {\theta _i}d{\sigma _i}}}{{p({{\boldsymbol k}_i})}}}. $$

Fig. 3. Schematic diagram of path tracing. (a) The process of path tracing using virtual camera arrays. (b) Multi-view images.

Download Full Size | PDF

In order to capture multi-view images, path tracing is used to generate images from different perspectives with a virtual camera array, as shown in Fig. 3(a). Furthermore, the multi-view images are illustrated in Fig. 3(b).

To obtain EIA images for the 3D light-field display, the pixel encoding process should be applied in multi-view images. In practice, path tracing can perform pixel encoding by setting up the orientations and directions of launched lights from the virtual camera array.

Supposing that the resolution of the EIA image is W×H, the number of views in horizontal and vertical axes are p_x and p_y, respectively. Thus the size of the sub-aperture image can be represented as s_x=W/p_x and s_y=H/p_y. In general, the orientation and direction of rays in path tracing can be written as

(4)$$\left\{ {\begin{array}{{c}} {{\boldsymbol{Ori}} = {\boldsymbol{eye}} + k(CamDis,0,0) + m(0,CamDis,0)}\\ {{\boldsymbol{Dir}} = norm({dx \times {\boldsymbol U} + dy \times {\boldsymbol V} + {\boldsymbol W} - k(CamDis,0,0) - m(0,CamDis,0)} )} \end{array}} \right., $$

where CamDis indicates the distance of adjacent virtual cameras. eye is the origin of the central camera. U, V, and W are unit vectors in x, y, and z axes, respectively, and norm is the normalization operation. k and m are the indices of virtual cameras in horizontal and vertical axes, respectively. dx and dy are the normalized indices of pixels in horizontal and vertical axes, respectively.

When the multi-view images are not encoded, the values of k, m, dx, and dy can be expressed as

(5)$$\left\{ {\begin{array}{{c}} {k = \lfloor{{i / {{p_x}}}} \rfloor \textrm{ - }\lfloor{{{{p_x}} / 2}} \rfloor }\\ {m = \lfloor{{j / {{p_y}}}} \rfloor \textrm{ - }\lfloor{{{{p_y}} / 2}} \rfloor } \end{array}} \right., $$

(6)$$\left\{ {\begin{array}{{c}} {dx = ({{{\bmod ({i,{s_x}} )} / {{s_x}}}} )\times 2 - 1}\\ {dy = ({{{\bmod ({j,{s_y}} )} / {{s_y}}}} )\times 2 - 1} \end{array}} \right., $$

where i and j are indices of pixels in horizontal and vertical axes, respectively, and the diagrammatic sketch of path tracing without pixel encoding is shown in Fig. 4(a). When the pixel encoding is used, the values of k, m, dx, and dy are rewritten as Eqs. (7) and (8), and the positions of pixels are rearranged, as shown in Fig. 4(b). Moreover, the result of the encoded EIA image is shown in Fig. 4(c).

(7)$$\left\{ {\begin{array}{{c}} {k = \bmod ({i,{p_x}} )- \lfloor{{{{p_x}} / 2}} \rfloor }\\ {m = \bmod ({j,{p_y}} )- \lfloor{{{{p_y}} / 2}} \rfloor } \end{array}} \right., $$

(8)$$\left\{ {\begin{array}{{c}} {dx = ({{i / W}} )\times 2 - 1}\\ {dy = ({{j / H}} )\times 2 - 1} \end{array}} \right.. $$

Fig. 4. The diagram of pixel encoding in path tracing. (a) Path tracing without pixel encoding. (b) Path tracing with encoding. (c) The encoded EIA image.

Download Full Size | PDF

2.2 High resolution EIA image generation based on SR with CNN

According to the approach introduced in Sec 2.1, a low-resolution 3D image can be generated. To fast obtain the high-resolution 3D image, a light-weight CNN is used to super-resolve the low-resolution 3D image.

The schematic diagram of the proposed SR algorithm is shown in Fig. 5. Since each EI is independent for visualization, it should be super-resolved individually for better image quality. In addition, it should be noted that EIA’s background areas are not essential to process in this stage since they are dispensable. Therefore, only foreground EIs are selected for super-resolving. In detail, a ray casting scheme is proposed to perform foreground selection, which identifies the foreground areas by collision detection between the casting ray and foreground 3D model. Notably, ray casting is integrated into path tracing, and thus it does not take extra time. Then a s_x×s_y mask EI buffer is employed to record the state of each EI, as shown in Figs. 5(a). After division, the foreground EIs are extracted and batched into a lite CNN, as shown in Figs. 5(b)-5(c). SR operation is performed by CNN, and the (s×p_x)×(s×p_y) foreground EIs are output, as shown in Fig. 5(d), where s is the scaling factor of SR. Finally, to synthesize the high-resolution EIA, a high-resolution blank buffer is created, pre-filled with the background color. Then the output EIs from CNN are put back to the corresponding positions in the blank buffer based on the records of the mask buffer. With the filling process, the high-resolution EIA image is reconstructed, as shown in Figs. 5(e)-5(f).

Fig. 5. Schematic diagram of the proposed SR algorithm.

Download Full Size | PDF

To improve calculation speed and image quality simultaneously, the structure of CNN is composed of a light-weight generative network and a deep adversarial network, as shown in Fig. 6. On the one hand, the GAN-based algorithm can generate photo-realistic results to improve the image quality as far as possible [17]. On the other hand, it is notable that the effect of the generator network is to synthesize the high-resolution 3D image, as shown in Fig. 6(a), and the effect of the adversarial network is to discriminate the generated SR result against the ground truth based on the adversarial loss function without regenerating the 3D image, as shown in Fig. 6(b). Due to the inherent feature in GAN, it can be concluded that the adversarial network is used to optimize the generator network for better image quality, and it does not generate any image. Therefore, the adversarial network is not required in the test phase for high-resolution 3D image generation, and the complexity of the adversarial network does not influence the computation speed. In practice, the generator network consists of two convolution layers and one deconvolution layer for fast generation, as shown in Fig. 6(a), and the adversarial network involves eight convolutional and seven batch normalization layers along with two fully connected layers, which is similar to SRGAN, as shown in Fig. 6(b).

Fig. 6. The architecture of the proposed network. (a) The generative network, and (b) the adversarial network.

Download Full Size | PDF

In order to train the proposed network, the ImageNet dataset is used as a training dataset, which has high-quality images and is suitable for SR training. Besides, several 3D models with different perspectives and different scales are utilized as an augmented dataset. To generate low-resolution and high-resolution image pairs for training, all images in the datasets are cropped into 256×256 patches firstly, and then they are down-sampled to corresponding low-resolution patches.

In the training stage, the loss function of the network is formulated as Eq. (9), which is the summation of a content loss and an adversarial loss, and λ=1×10⁻³ is the weight of the adversarial loss. The content loss is the mean square error of the generated result and ground truth, which is formulated as Eq. (10). For the adversarial loss, it is the logarithmic error to evaluate the similarity of distribution between the generated image and ground truth, which is defined as Eq. (11). In addition, the gradient descent strategy is Adam, the learning rate is 1×10⁻⁴, and the network is trained for 100 epochs with a batch size of 8.

(9)$$l = {l_{content}} + \lambda {l_{adversarial}}, $$

(10)$${l_{content}} = \frac{1}{{{s^2}WH}}\sum\nolimits_{x = 1}^{sW} {{{\sum\nolimits_{y = 1}^{sH} {({{I^{HR}} - G({{I^{LR}}} )} )} }^2}}, $$

(11)$${l_{adversarial}} = \sum\nolimits_{n = 1}^N {\log ({1 - D({G({{I^{LR}}} )} )} )} . $$

Furthermore, to realize real-time 3D image generation, parallel computing is essential for path tracing and SR using GPU, as shown in Fig. 7. For path tracing, the virtual rays can be launched simultaneously with the parallel threads in GPU. For SR, the accelerating computation can be realized using CNN in GPU. Note that although the path tracing and SR are executed in different pipelines, the data communication between these pipelines can be directly performed in GPU, as shown in Fig. 7. Thus the calculation speed is not affected by the data communication.

Fig. 7. Flowchart of the proposed method with GPU.

Download Full Size | PDF

3. Experimental configuration

The 3D light-field display used in the experiments is shown in Fig. 8, and its numerical parameters are summarized in Table 1. In the experiments, the PC hardware configuration consists of Intel Core i7-4790 CPU @ 3.6GHZ with 8Gb RAM and NVIDIA Quadro P5000 GPU. In order to launch rays parallel, the ray tracing engine NVIDIA OptiX 5.1.1 is utilized to realize path tracing. For SR, the Caffe package is used to build the structure of CNN [25]. In the implementation, both OptiX and Caffe are performed in C++ language. Hence the data communication between them can be directly realized by the CUDA toolkit, which saves much time to realize real-time 3D reconstruction.

Fig. 8. The photograph of the 3D light-field display.

Download Full Size | PDF

Table 1. Experimental configuration for 3D light-field display.

View Table | View all tables in this article

4. Experimental results

4.1 Image quality evaluation

The quality of high-resolution EIA images generated by our method is evaluated with different models, as shown in Fig. 9. SSIM is used as a quantitative metric for image quality evaluation, measuring the similarity of the generated image and the reference image. In order to obtain the reference images for comparison, path tracing is used to generate high-resolution EIA images with more than 1000 times Monte-Carlo sampling, and the reference images can be seen from the first column in Fig. 9. The residual errors of the generated images and the SSIM values using our method are shown in the fourth column in Fig. 9. To verify our method, the bicubic method is also evaluated to generate 3D images, and the results are shown in the second column in Fig. 9. It can be seen that the SSIM values of our method are over 0.90, which are significantly higher than bicubic method, and the residual errors of our method are better than bicubic method as well, which indicates that the image quality of our method can be accepted.

Fig. 9. Five virtual 3D models are evaluated in the experiments. The first column is the original models, the second, third, and fourth columns are the residual errors of EIA images using bicubic method, generator CNN, and our method, respectively. The last column is the displayed results on the 3D light-field display using our method.

Download Full Size | PDF

Furthermore, to demonstrate the effect of the adversarial network, the generative network is trained without the adversarial network, and the quantitative and qualitative results using this network are shown in the third column in Fig. 9. It is apparent from the results that without the adversarial network, the residual errors of the generated images are larger, the edges are more blurred, and the color of the images have low-fidelity. In addition, the SSIM values are lower than the CNN with the adversarial network, which manifests the significance of the adversarial network. On the other hand, to evaluate the influence of the number of the convolutional layers, the generator network is also trained with different numbers of convolutional layers. However, the result of SSIM values is somewhat counterintuitive, as shown in Fig. 10(c). With the increment of the convolutional layers, the values of SSIM are not improved. It seems possible that the result is due to the existence of the adversarial network, which significantly improves the quality of generated images. Therefore, when the convolutional layers of the generator network increase slightly, the image quality is not improved significantly.

Fig. 10. Performance evaluation of the proposed method. (a) The relation between frame rate and the number of EIs per mini-batch. (b) The frame rate with different percentages of the foreground EIs. (c) The frame rate and SSIM values with different numbers of convolutional layers in the generator network. (d) The frame rate with different CNN architectures.

Download Full Size | PDF

To validate the effect of the GAN in terms of geometric structure, the sub-aperture images are synthesized, and Epipolar Plane Image (EPI) is computed, as shown in Fig. 11. Our proposed network is compared with two super-resolution schemes using FSRCNN, which super-resolve the 3D image based on EIA and EI, respectively. It can be seen that with EIA super-resolution, the result is confused by different EIs, and hence the output image is blurry and with low-fidelity colors, as shown in Fig. 11(a). On the other hand, although the FSRCNN with EI super-resolution can prevent background color confusion, its EPI result still presents low geometric accuracy, as shown in Fig. 11(b). In contrast, our method makes full use of the advantage of GAN and generates high-quality multi-view images with high geometric accuracy.

Fig. 11. Horizontal and vertical EPI for the “Head” model using different super-resolution schemes. For better visualization, the vertical EPI is rotated by 90 degrees.

Download Full Size | PDF

4.2 Performance evaluation

In this section, the rendering speed of our proposed method is evaluated. To compare the performance of the traditional path tracing method and our proposed method, four experimental schemes using different SR scaling factors are used with the “Unicorn” model, as shown in Fig. 12(a). From the data in Fig. 12(a), it is apparent that our method with different scaling factors can render faster than path tracing at different resolutions. To further verify the effectiveness of our method, the models in Fig. 9 are used to evaluate rendering speed at 8K resolution, as shown in Fig. 12(b). For a fair comparison, the distance between different 3D models and the virtual camera array is fixed at 70cm, which is an appropriate viewing distance. It can be seen that our method without foreground selection is stable for different models, and the most striking results from Fig. 12(b) are that when the scaling factor s=4, the rendering speed at 7680×4320 resolution can be over 30fps, which demonstrates that our method can realize real-time 3D image generation for the 3D light-field display. Furthermore, it is notable that with the foreground areas selection scheme, the frame rate can be further increased over 5fps at s=4, which verifies that this scheme can reduce time-consuming on background EIs, and the rendering speed can be improved.

Fig. 12. Rendering speed using different schemes and different models. (a) Comparison of rendering speed between the proposed method and traditional path tracing method. (b) Comparison of rendering speed using different 3D models at 8 K resolution. (PT represents path tracing, and SR represents super-resolution).

Download Full Size | PDF

To compare the processing time of our method and existing accelerated methods on 3D light-field display, the super multi-view integral image generation method [20] and the DPT method [9] are used for comparison. The first method used FSRCNN to accelerate image generation compared to ECVIR, and the second method realized real-time 3D display at 3840×2160 resolution using path tracing. Note that due to the problem of memory explosion in the first method, the low-resolution 3D images are also divided into EIs for SR. Table 2 presents the summary statistics for these methods at 7680×4320 resolution. It can be seen from the data in Table 2 that the rendering time of our method is significantly faster than other methods. The reason for the discrepancy between the super multi-view integral image generation method and our method is that the former method used 3Ds Max to generate low-resolution multi-view images, which is substantially slower than path tracing in our method. On the other hand, the discrepancy between DPT and our method mainly arises from the different usages of path tracing, which is used to synthesize high-resolution EIA images directly in the DPT method, but it is used to generate low-resolution EIA images to save time in our method.

Table 2. Rendering speed using different methods.

View Table | View all tables in this article

The next part of the experimental results is concerned with four factors to influence the processing speed. The first factor is the number of low-resolution EIs per batch in the SR stage, called step. For different steps, the CNN works N = F_EI/step times to process all of foreground EIs, where F_EI is the number of foreground EIs. In order to assess the impact of step, different values of step are sampled from 8 to 208 by fixed interval 8, and rendering time is evaluated using five models at 7680×4320 resolution with the scaling factor s=4. Figure 10(a) provides the experimental data on processing speed. Interestingly, when the step is lower than 125, the frame rate in this figure can be nearly or over 30fps, but it rapidly descends when the step increases to over 125 and more. A possible explanation for this might be that when the step is oversized, the speed of SR is limited due to the performance of GPU, which results in the increment of total processing time.

The second factor is the proportion of foreground EIs, represented by p = F_EI/N_EI, where N_EI is the number of all EIs. To analyze the impact of p, different values of p are used by changing the distance between the virtual camera array and the virtual 3D model. The performance analysis results are shown in Fig. 10(b), which are carried out at 7680×4320 resolution with the “Red dragon” model and the scaling factor s=4. From this data, it can be seen that with the increment of p, the frame rate of our method is declining significantly, and when p<0.85, the frame rate is lower than 20fps, which is hard to satisfy real-time 3D display. This result can be explained by the fact that when p increases, the time of path tracing and SR are both raised. Furthermore, the traditional path tracing method is also verified with different p, which is illustrated as the purple curve in Fig. 10(b). From the two results, it can be seen that although the performance of our method degrades with the increment of p, it is still better than the traditional path tracing method, and it is possible to further improve in future work.

The third factor is the number of convolutional layers in the generator network. To evaluate the influence of the convolutional layers on rendering speed, five generator networks with different numbers of convolutional layers are used, and the results of frame rate are shown in Fig. 10(c). It can be seen that the frame rate is vitally affected by the number of convolutional layers. With the increment of the convolutional layers, the frame rate decreases rapidly. When the number of convolutional layers is larger than 5, the frame rate is lower than 10fps, which is hard to satisfy real-time 3D image generation.

The last factor is the architecture of CNN. To validate the efficiency of our proposed generator network, two real-time networks FSRCNN and ESPCN are used to replace our generator network for performance evaluation at 8K resolution, and the results of frame rate with different models are shown in Fig. 10(d). To further analyze the computation of different networks, the floating point operations (FLOPs) of three networks are listed in Table 3. It can be seen that the FLOPs of FSRCNN is the largest. Thus, its frame rate is the lowest, and it is hard to achieve a real-time 3D display at 8K resolution. Although ESPCN uses the pixel shuffling operation to decrease computation, its speed is lower than our proposed network. On the contrary, the least FLOPs are used in our network to achieve the high frame rate, while the image quality can be acceptable, as aforementioned in Sec 4.1.

Table 3. FLOPs with different real-time super-resolution network.

View Table | View all tables in this article

4.3 Presenting on 3D light-field display

To visualize the synthesized EIA image on the 3D light-field display, the 3D mesh model “Red Dragon” is presented on our innovative 32-inch 3D light-field display, as shown in Fig. 13. Glass material with different extinction is represented the global illumination. It can be seen that the high-quality 3D images with less noise are presented on the 3D light-field display. Overall, these results suggest that our proposed method can realize real-time and high-quality optical reconstruction.

Fig. 13. Reconstructed results on 3D light-field display (see Visualization 1).

Download Full Size | PDF

5. Conclusion

In summary, a two-stage method is presented to realize real-time 3D light-field display with super high-resolution based on path tracing and image SR. In the first stage, low-resolution 3D images are generated by path tracing. In the second stage, a light-weight CNN is used to fast generate high-resolution 3D images from low-resolution 3D images. In addition, to implement SR efficiently, foreground areas selection and split are developed as well. The experimental results of the study show that the proposed method can synthesize 3D images at 7680×4320 resolution over 30fps, which is better than existing 3D image generation methods, and the value of SSIM is over 0.90. These results demonstrate that the proposed method can realize real-time 3D image generation, and the image quality can be acceptable. It is believed that the proposed method can be helpful for real-time 3D light-field display with higher resolution in the future.

Funding

National Natural Science Foundation of China (62075016, 61905017, 61905020); Fundamental Research Funds for the Central Universities (2021RC09, 2021RC14).

Disclosures

The authors declare no conflicts of interest. This work is original and has not been published elsewhere.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. N. Balram and I. Tošić, “Light-field imaging and display systems,” Information Disp. 32(4), 6–13 (2016). [CrossRef]

2. X. Sang, X. Gao, X. Yu, S. Xing, Y. Li, and Y. Wu, “Interactive floating full-parallax digital three-dimensional light-field display based on wavefront recomposing,” Opt. Express 26(7), 8883–8889 (2018). [CrossRef]

3. H. E. Ives, “Optical properties of a Lippmann lenticulated sheet,” J. Opt. Soc. Am. 21(3), 171–176 (1931). [CrossRef]

4. M. Guo, Y. Si, Y. Lyu, S. Wang, and F. Jin, “Elemental image array generation based on discrete viewpoint pickup and window interception in integral imaging,” Appl. Opt. 54(4), 876–884 (2015). [CrossRef]

5. Z. Yan, X. Yan, X. Jiang, and L. Ai, “Computational integral imaging reconstruction of perspective and orthographic view images by common patches analysis,” Opt. Express 25(18), 21887–21900 (2017). [CrossRef]

6. K. Yanaka, “Integral photography using hexagonal fly's eye lens and fractional view,” Proc. SPIE 6803, 68031K (2008). [CrossRef]

7. M. Halle, “Multiple viewpoint rendering,” in Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques (ACM, 1998), pp. 243–254.

8. S. Xing, X. Sang, X. Yu, C. Duo, B. Pang, X. Gao, S. Yang, Y. Guan, B. Yan, J. Yuan, and K. Wang, “High-efficient computer-generated integral imaging based on the backward ray-tracing technique and optical reconstruction,” Opt. Express 25(1), 330–338 (2017). [CrossRef]

9. Y. Li, X. Sang, S. Xing, Y. Guan, S. Yang, D. Chen, L. Yang, and B. Yan, “Real-time optical 3D reconstruction based on Monte Carlo integration and recurrent CNNs denoising with the 3D light field display,” Opt. Express 27(16), 22198–22208 (2019). [CrossRef]

10. Y. Wang, X. Song, and K. Chen, “Channel and Space Attention Neural Network for Image Denoising,” IEEE Signal Process. Lett. 28, 424–428 (2021). [CrossRef]

11. C. L. Liu, K. T. Shih, J. W. Huang, and H. H. Chen, “Light Field Synthesis by Training Deep Network in the Refocused Image Domain,” IEEE Trans. on Image Process. 29, 6630–6640 (2020). [CrossRef]

12. Y. Wang, X. Ying, L. Wang, J. Yang, W. An, and Y. Guo, “Symmetric parallax attention for stereo image super-resolution,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2021), pp. 766–775.

13. S. Anwar, S. Khan, and N. Barnes, “A deep journey into super-resolution: A survey,” ACM Comput. Surv. 53(3), 1–34 (2020). [CrossRef]

14. C. Dong, C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016). [CrossRef]

15. J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 1646–1654.

16. Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2018), pp. 2472–2481.

17. C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 4681–4690.

18. N. C. Rakotonirina and A. Rasoanaivo, “ESRGAN+: Further improving enhanced super-resolution generative adversarial network,” in IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2020), pp. 3637–3641.

19. C. Dong, C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in European Conference On Computer Vision (Springer, 2016), pp. 391–407.

20. H. Ren, Q. H. Wang, Y. Xing, M. Zhao, L. Luo, and H. Deng, “Super-multiview integral imaging scheme based on sparse camera array and CNN super-resolution,” Appl. Opt. 58(5), A190–A196 (2019). [CrossRef]

21. W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 1874–1883.

22. S. Marschner and P. Shirley, Fundamentals of Computer Graphics, (CRC, 2018), Chap. 20.

23. J. T. Kajiya, “The rendering equation,” in Proceedings of the 13th Annual Conference on Computer Graphics and Interactive Techniques (ACM, 1986), pp. 143–150.

24. D. S. Immel, M. F. Cohen, and D. P. Greenberg, “A radiosity method for non-diffuse environments,” SIGGRAPH Comput. Graph. 20(4), 133–142 (1986). [CrossRef]

25. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM International Conference on Multimedia (ACM, 2014), pp. 675–678.

LCD panel	Resolution	7680×4320
LCD panel	Pixel pitch	31µm
Lens array	Array size	53×30
	Len focal distance	16mm
	Radius	5mm
	Distance between lens array and LCD panel	18mm
Virtual cameras array	Array size	143×143
	Neighboring distance	1.93mm
	Horizontal field of view	20°
	Distance between lens array and virtual cameras array	11.188dm
	Sampling rate	2 spp
	Maximum tracing depth	10

Models	Vertices	Triangles	Rendering speed / frame (s)
Models	Vertices	Triangles	Ren et al. [20]	Li et al. [9]	Ours (PT + ×4 SR)
Teapot	63025	126048	887.3	0.223	0.033
Unicorn	152535	50845	807.6	0.183	0.030
Head	299712	99904	746.7	0.174	0.029
Red dragon	36952	72274	687.9	0.139	0.028
Stickman	123258	41086	658.0	0.142	0.026

LCD panel	Resolution	7680×4320
LCD panel	Pixel pitch	31µm
Lens array	Array size	53×30
	Len focal distance	16mm
	Radius	5mm
	Distance between lens array and LCD panel	18mm
Virtual cameras array	Array size	143×143
	Neighboring distance	1.93mm
	Horizontal field of view	20°
	Distance between lens array and virtual cameras array	11.188dm
	Sampling rate	2 spp
	Maximum tracing depth	10

Models	Vertices	Triangles	Rendering speed / frame (s)
Models	Vertices	Triangles	Ren et al. [20]	Li et al. [9]	Ours (PT + ×4 SR)
Teapot	63025	126048	887.3	0.223	0.033
Unicorn	152535	50845	807.6	0.183	0.030
Head	299712	99904	746.7	0.174	0.029
Red dragon	36952	72274	687.9	0.139	0.028
Stickman	123258	41086	658.0	0.142	0.026

Real-time optical reconstruction for a three-dimensional light-field display based on path-tracing and CNN super-resolution

Abstract

1. Introduction

2. Method

2.1 Path tracing and pixel encoding

2.2 High resolution EIA image generation based on SR with CNN

3. Experimental configuration

4. Experimental results

4.1 Image quality evaluation

4.2 Performance evaluation

4.3 Presenting on 3D light-field display

5. Conclusion

Funding

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (13)

Tables (3)

Equations (11)

Optics Express