## Abstract

Synthetic aperture integral imaging using monocular video with arbitrary camera trajectory enables casual acquisition of three-dimensional information of any-scale scenes. This paper presents a novel algorithm for computational reconstruction and imaging of the scenes in this SAII system. Since dense geometry recovery and virtual view rendering are required to handle such unstructured input, for less computational costs and artifacts in both stages, we assume flat surfaces in homogeneous areas and take full advantage of the per-frame edges which are accurately reconstructed beforehand. A dense depth map of each real view is first estimated by successively generating two complete, named smoothest- and densest-surface, depth maps, both respecting local cues, and then merging them via Markov random field global optimization. This way, high-quality perspective images of any virtual camera array can be synthesized simply by back-projecting the obtained closest surfaces into the new views. The pixel-level operations throughout most parts of our pipeline allow high parallelism. Simulation results have shown that the proposed approach is robust to view-dependent occlusions and lack of textures in original frames and can produce recognizable slice images at different depths.

© 2018 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

Integral imaging, first proposed in 1908 [1], is the most promising naked-eye three-dimensional (3D) display technique so far. It can produce full-color, full-parallax, and quasi-continuous 3D effect based on an elemental image (EI) array. By using a micro-lens array and a single camera, each EI records the scene information in a particular viewing direction. Normally, the EIs have low resolution which is limited by the camera field of view, and suffer from aberrations due to the small aperture of the micro-lenses. These issues can be avoided by using the synthetic aperture integral imaging (SAII) [2], that obtains multiple perspective images with a camera array or a translated camera (Fig. 1). However, traditional acquisition methods (*e.g*. [3,4]) have high hardware or calibration costs, inconvenience of dismounting and moving, or limitations of captured areas. In view of above drawbacks, this paper is devoted to the SAII system which uses an unstructured (*i.e*. casually shot) monocular video (Fig. 6). This manner is cheaper, simpler, faster, and also capable of acquiring all information of any-scale scenes, *e.g*. from small objects to enormous buildings.

Since the camera trajectory is arbitrary and the scene geometry is unknown, we introduce a novel approach for computational reconstruction and imaging of 3D scenes in this SAII system. Our pipeline goes through two main stages to generate the perspective images for a given virtual camera array. The first stage, *dense geometry recovery*, reconstructs a dense 3D scene model from the calibrated video frames, which is then used in the second stage, *virtual view rendering*, to synthesize new views. Note that in fact it is also practical to straightly produce the slice images with calculated camera extrinsic parameters and depth information, but warping the massive frames many times is computationally expensive. Instead, the slice image generation can be more simply simulated by shifting and accumulating the virtual perspective images.

One major challenge to this pipeline lies in that, directly adopting classic schemes on the seriously redundant video data would bring high computational costs in both stages. Another problem is caused by homogeneous surfaces (with insufficient textures for multi-view matching) in the scene, which are difficult to recover for most algorithms. The resulted geometry errors or holes might lead to artifacts in the synthesized views and displayed 3D images.

The work of Kim *et al*. [5] has shown that, accurate 3D edges can be calculated by adeptly exploiting the input dense view samples, and subsequently used to recover the bounded homogeneous areas. Their pipeline focuses on handling regular light fields, not suitable for unstructured videos. Inspired by this sparse-to-dense strategy, our SAII approach (Fig. 2) is based on the high-quality edge depth maps of densely sampled frames, which are reconstructed beforehand using the ray-wise algorithm of [6] (our previous research).

In the proposed approach, the dense geometry recovery stage depends on two complete depth maps. The *smoothest-surface depth map* assumes flat surfaces in homogeneous areas, and uses the per-frame edge depths to perspectively fit the smoothest possible surface in between. This depth map probably produces interpolants that connect fore- and background edges but occlude the real surfaces (Fig. 5(d)), particularly when the real and virtual views have distinct viewing directions. The *densest-surface depth map* is calculated by propagating the estimates with the highest surface density among the former depth maps. This depth map is able to effectively remove the wrong interpolants, because basically all these errors have large slopes (thus producing low-density surfaces) and would be replaced by the correct depths from other views. But it possibly induces depth discontinuities which are not aligned with object edges (Fig. 5(f)). Therefore, we borrow the idea of Tao *et al*. [7] to combine the local cues of smoothness and density, by merging the above depth maps of each frame via global minimization of Markov random fields (MRFs). The optimized depth maps of all real views are satisfactory in terms of both cues. This allows us to synthesize new perspective images simply by back-projecting the closest surfaces in the virtual view rendering stage.

Most of the techniques we employ are performed per pixel, thus supporting high parallelism on graphics processing units (GPUs). Simulation results show that the presented approach is robust to view-dependent occlusions and lack of textures in original frames, that enables recognizable slice images at different depths to be produced in the display part of our SAII system.

## 2. Related work

#### 3D scene acquisition

For more freedom than typical camera-array-based acquisition [8], some researches try to make the cameras emancipate from grid distribution. In [9] and [10], the cameras are arbitrarily placed but their optical axes must keep parallel. Besides, the strategy of [10] has to bound all cameras on the same plane. These methods need a great amount of work for camera calibration. Combining sparse view capturing and virtual view synthesis [11] is a good solution to simplify the acquisition equipment. Wang *et al*. [12] use camera pose estimation to make the cameras distributed on any curved surface. Their more flexible strategy requires no calibration even if the camera arrangement changes. For lower hardware costs, [13, 14], [15], and [16] separately propose axially, laterally, and obliquely distributed acquisition with a single camera, so only one calibration is sufficient. All the above methods have difficulties in acquisition around large scenes. The Kinect sensor is first used for integral imaging in [17]. Such depth cameras are able to move freely and directly output the scene geometry in real time. Zhang *et al*. [18] utilize the KinectFusion technique for higher accuracy and completeness of the model. However, depth cameras are only suitable for close scenes owing to their short sensing distances. The methods based on (micro-)lens array have similar drawback.

#### Scene reconstruction from video

Early approaches of structure-from-motion (SfM) [19] and simultaneous localization and mapping (SLAM) [20] generate sparse detectable 3D points from monocular video, conveying little semantics of the scene. Advanced SLAM algorithms obtain dense surfaces by replacing [21] or combining [22] point features with planes. Some video-oriented multi-view stereo (MVS) methods [23,24] realize fast or interactive reconstruction, but they are only usable for object modeling, yield low coverage for homogeneous surfaces, or rely on optical flow. To solve the matching ambiguities due to lack of textures, some MVS work employ energy minimization constrained by a smoothness term [24], domain transform filter [25], or a multi-resolution manner [26], but might still produce depth discontinues for large regions. Especially when occlusion occurs, the results of these techniques are not completely visibility-consistent. Consistent models can be obtained by simply removing the estimates which are inconsistent among small-baseline neighboring views or violate the occlusion consistency across large-baseline non-neighbors [24]. Bundle optimization [27], tetrahedra carving [28], cross-view filtering [25,26], and edge-based interview invalidation [6,29] have also been adopted.

#### Virtual view synthesis

Computer generated integral imaging (CGII) and image-based rendering (IBR) are two mainstream methods of virtual view synthesis. CGII uses a known 3D scene model and renders novel views through a virtual lens array. It includes parallel group rendering [30], viewpoint vector rendering [31], viewpoint-independent rendering [32], and multi-viewpoint rendering [33]. Xing *et al*. [34] accelerate the CGII process via backward ray-tracing. Differently, IBR synthesizes views from a set of images. It can be loosely classified into three categories according to how much geometry is used [35]. Most light field rendering methods [36] fall into the first category. The number of their input images is large and the viewpoints are strictly constrained on a grid. The second category rely on implicit geometry representation, *e.g*. point correspondences [37] and optical flow [38]. The methods requiring dense and explicit geometry estimates obtained from a few unordered images [39] belong to the third category. One main drawback lies in that depth errors normally lead to tearing and ghosting in the virtual view, especially in the occluded and textureless regions. The artifacts can be avoided with a soft geometry model [40] or Bayesian formulation [39] to robustly respect the depth uncertainty. The majority of the latter two categories only apply to wide-baseline images. Very few IBR approaches have been concentrated on handling monocular videos. Some work [41] depend on small-baseline image warping without serious occlusions. Chaurasia *et al*. synthesize distinct views using silhouette-aware [42] and local shape-preserving [43] warps to address the missing or unreliable geometry. In contrast, our method is much simpler and can be implemented more efficiently, while yielding comparable rendering quality.

## 3. The proposed approach

The pipeline of the proposed SAII method is shown in Fig. 2. In this paper, the object boundaries and textured regions are collectively called edges. The edges are typically aligned with color discontinuities, and thus easy to detect and to reconstruct from images by imposing the photometric constraint. Considering this, our approach is aimed at outputting a set of slice images for a virtual camera array, using the frames of an unstructured video and their edge depth maps.

In the pre-processing stage, the SfM algorithm of Resch *et al*. [19] is first executed to compute the camera extrinsics of all frames, 3D positions of features, and their associated visibilities. According to these information, we select a dense subset of views from the small-baseline frames for sufficient triangulation angles (letting each pair of neighboring view samples have a viewing angle of about 1°). For convenience, hereafter the mentioned frames and real views all correspond to the view samples. For each frame, the edge pixels are extracted if the magnitude of color gradient calculated with a 3 × 3 Sobel operator is larger than 2.5. We use the scheme of [6] to estimate edge depths, because it considers individual visual rays to achieve high precision and more robustness on unconstrained camera movements.

Dense geometry recovery and virtual view rendering are the next two stages of our pipeline, and also the most important. We describe them in detail below.

#### 3.1. Dense geometry recovery

This stage calculates the corresponding dense depth maps for all edge depth maps obtained in the pre-processing stage. Most of 3D reconstruction methods struggle on recovering large areas of homogeneous surfaces, particularly in the presence of heavy occlusions. We do this by first generating two complete depth maps for each real view with different local cues respected, and then merging them in a MRF global optimization process.

### 3.1.1. Smoothest-surface depth map

Noting that planar geometry is very common in the real world, we use intra-view diffusion of edge depths to fit the smoothest possible surface for textureless areas in each real view, even if the actual geometry is curved. We name the interpolation result as *smoothest-surface depth map*.

Figure 3 illustrates the problem of classic isotropic diffusion scheme on depth data, and the idea of perspective diffusion developed in our paper. The target is to interpolate a depth value for the mid-pixel *p* = (*x*, *y*) of two pixels *p*_{1} = (*x* − *w*, *y*) and *p*_{2} = (*x* + *w*, *y*) with depths *d*_{1} and *d*_{2}, respectively. By using isotropic diffusion, *p* would be assigned the average depth ${d}_{\text{w}}=\frac{1}{2}({d}_{1}+{d}_{2})$. However, due to perspective distortion, the corresponding 3D interpolant *P*_{w} lies behind the point *P*_{r}, which is on a flat surface connecting the 3D points of *p*_{1} and *p*_{2}, *i.e. P*_{1} and *P*_{2}. It is because the 2D pixel grid is not exactly projected to an uniform grid on the slanted surface. A solution to this problem is adopting an anisotropic strategy, *e.g*. defining a weighting function. We instead employ a simpler, perspective diffusion method based on the visual ray. Note in Fig. 3 that, since *P*_{r} is the intersection between the visual ray ${\overrightarrow{\mathit{OP}}}_{r}$ and the 3D line $\overline{{P}_{1}{P}_{2}}$, both can be calculated, *p* can get its depth *d*_{r} by back-projecting *P*_{r} onto the image plane. Therefore, we first formulate our diffusion strategy to allow for that initial depths are scattered. Then the interpolants separately obtained in horizontal and vertical directions are incorporated into 2D diffusion task.

The diffusion process performs iterative convolution with a 4-point stencil, only on the pixels without edge depth to preserve depth discontinuities. Let the diffused result of the *i*th edge depth map ${D}_{\text{edge}}^{i}$ in the *t*th iteration be ${D}_{t}^{i}$, and ${D}_{0}^{i}\leftarrow {D}_{\text{edge}}^{i}$ for initialization. In order to assign *p* the only available depth if one of *p*_{1} and *p*_{2} has no estimate, we include the indicator function *δ*(·) (evaluated to 1 for a known depth and 0 for others) to denote the horizontally diffused depth as

*d*

_{y}is calculated similarly using

*p*

_{3}= (

*x*,

*y*−

*w*) and

*p*

_{4}= (

*x*,

*y*+

*w*). We interpolate these two depths for each non-edge pixel

*p*using

*δ*

^{Σ}=

*δ*(

*d*

_{x}) +

*δ*(

*d*

_{y}) for normalization. If no available depth is found for diffusion,

*i.e*.

*δ*

^{Σ}= 0, the result at

*p*remains unchanged temporarily. The smoothest-surface depth map ${D}_{\text{smoothest}}^{i}$ is determined as soon as the sum of per-pixel depth differences between two consecutive iterations is smaller than 0.001. Thereby, a small stencil size

*w*might slow down the depth transportation. We use the stencil-shrinking solver [44] by initializing

*w*for each interpolated pixel with the Euclidean distance from the nearest edge pixel.

Figure 4 shows an example of ${D}_{\text{edge}}^{i}$ and our obtained ${D}_{\text{smoothest}}^{i}$. It also compares the resulted 3D points using isotropic and proposed perspective diffusion approaches. The distortion from isotropic diffusion produces uneven surfaces on the floor and an oscillation on the large-area homogeneous ceiling (Fig. 4(d)). In contrast, ours generates flat surfaces in both cases (Fig. 4(e)). See Figs. 5(b) and 5(c) for another example.

${D}_{\text{edge}}^{i}$ probably contains outliers due to insufficient contrast, aliasing artifact, or reflection in the original scene, resulting in inaccurate estimates in ${D}_{\text{smoothest}}^{i}$. For the latter global optimization stage (Sec. 3.1.3), we calculate a confidence map for ${D}_{\text{smoothest}}^{i}$:

*s*(

*p*) denotes the depth score of ${D}_{\text{smoothest}}^{i}(p)$.

*s*

_{left}(

*p*) and

*s*

_{right}(

*p*) are the depth scores computed by shifting the projections of

*p*in all secondary images to the left and right along the epipolar lines by one pixel, respectively. See [6] for details of the secondary image selection and definition of the depth score. In principle,

*s*(

*p*) should represent the highest local maximum of all depth scores at

*p*. In this case, Eq. (3) uses the Gaussian function of the difference between

*s*(

*p*) and

*s*

_{left}(

*p*) or

*s*

_{right}(

*p*) to assign

*p*a higher confidence if

*s*(

*p*) is markedly higher. Here,

*σ*= 0.3 is used. A larger value can make the confidence map smoother, that would induce more blurness in the finally obtained depth map.

_{s}### 3.1.2. Densest-surface depth map

As shown in Figs. 5(c) and 5(d), intra-view depth diffusion possibly creates wrong surfaces in front of the real geometry between fore- and background edges. These surfaces are generally heavily slanted for the currently processed view, thus containing low-density 3D points. Based on this observation, we remove the interpolation errors by propagating the depths, which have the highest surface density, across all {${D}_{\text{smoothest}}^{i}$}. In this paper, the propagation result is named as *densest-surface depth map*, and denoted by ${D}_{\text{densest}}^{i}$ for the *i*th frame.

To measure the density of per-view reconstructed points, we calculate a scale map *S ^{i}* for each ${D}_{\text{smoothest}}^{i}$, where

*S*(

^{i}*p*) is defined as the maximum distance between the 3D points produced from

*p*and its 4-neighbors. A smaller scale value means that a locally denser surface is recovered. Let $\underset{j\to i}{\mathbf{P}}({D}^{j}(q))$ and $\underset{j\to i}{\mathbf{T}}({D}^{j}(q))$ represent the pixel and depth obtained by projecting the pixel

*q*from the

*j*th depth map

*D*to the

^{j}*i*th real view, respectively. Then the calculation of ${D}_{\text{densest}}^{i}$ can be formulated as follows:

Figures 5(e) and 5(f) present an example of our generated ${D}_{\text{densest}}^{i}$ and its reconstructed 3D points. It can be seen that the wrong depth interpolants of ${D}_{\text{smoothest}}^{i}$ in Fig. 5(c) are substantially corrected in ${D}_{\text{densest}}^{i}$, producing sharp depth discontinuities at object boundaries, and the corresponding surfaces connecting fore- and background edges in Fig. 5(d) are eliminated from the scene geometry. However, the above density-based cross-view propagation leads to layered surfaces in some homogeneous areas, where the geometry is expected to be flat. This problem will be addressed in Sec. 3.1.3.

For the following global optimization stage (Sec. 3.1.3), we also calculate a confidence map for ${D}_{\text{densest}}^{i}$, by using the confidences of the depths that comprise ${D}_{\text{densest}}^{i}$:

*c*color of the color difference between

*p*and the pixel

*q*projected to

^{*}*p*is incorporated as a penalty factor, for the cases that

*p*and

*q*belong to different surfaces. We set

^{*}*σ*= 8 for synthetic scenes and 15 for real-world scenes,

_{c}*i.e*., a larger value can enhance the noise-robustness of depth estimation. In our implementation of Eq. (4), we improve the accuracy of ${D}_{\text{densest}}^{i}$ by only selecting the pixels

*q*

^{*}satisfying

*c*

_{color}(

*p*,

*q*

^{*}) ≥ 0.5.

### 3.1.3. MRF global optimization

Since ${D}_{\text{smoothest}}^{i}$ and ${D}_{\text{densest}}^{i}$ are estimated both relying on local cues, and their characteristics are mutually complementary, we merge them to handle the depth ambiguities considering the image structure. To this end, we adopt the method of [7] to globally minimize the MRFs defined as

*Z*

_{source}and

*W*

_{source}denote the initial data term:

In Eq. (6), *λ*_{source} is used to adjust the proportion of input two depth maps. *λ*_{flat} controls the Laplacian constraint to enforce flatness of ${D}_{\text{optimal}}^{i}$. *λ*_{smooth} controls the second derivative kernel to guarantee that ${D}_{\text{optimal}}^{i}$ is overall smooth. To implement this MRF formulation, we set *λ*_{source} = 1 and *λ*_{flat} = *λ*_{smooth} = 2, and use the solving approach proposed in [7]. Afterward, we further refine ${D}_{\text{optimal}}^{i}$ by removing the depths whose recomputed scale values are larger than 0.001. This post-processing step eliminates the small quantity of remaining wrong interpolants and outliers induced by the rank deficient problem during the optimization.

The merging results of Figs. 5(c) and 5(e) without and with refinement are shown in Figs. 5(g) and 5(h). It is obvious that, after globally optimizing, the incorrect connections between the objects on distinct depth planes in ${D}_{\text{smoothest}}^{i}$ are almost cleaned up, while the layered geometry of homogeneous surfaces in ${D}_{\text{densest}}^{i}$ become smooth again. The same conclusion can be drawn by comparing the point clouds shown in Figs. 5(d), 5(f), and 5(i).

#### 3.2. Virtual view rendering

With the dense and accurate depth maps {${D}_{\text{optimal}}^{i}$} of individual real views, we can efficiently synthesize a perspective image for any virtual view. Assume that a virtual camera array has been built in the world coordinate system of our reconstructed 3D model, *i.e*., the intrinsic and extrinsic (rotation and translation) parameters of each virtual camera, as well as the distance between neighboring virtual cameras are all known. We render the perspective image *I ^{v}* of the

*v*th virtual camera by back-projecting the 3D-point estimates of all {${D}_{\text{optimal}}^{i}$} to each pixel

*p*in this view, and then picking the color of the point projected to

*p*which has the least distant depth. This simple view synthesis process can be described by

*j*

^{*}th frame sample. See Fig. 9 for our rendered results.

#### 4. Results

### 4.1. Datasets and compared methods

Three monocular video datasets with large homogeneous surfaces and unconstrained camera trajectories (Fig. 6) have been used to test our SAII system. *Bathroom* is a synthetic scene with ground truth depth maps created in Blender software. *Building* and *Boxes* are real-world scenes without known geometry. Building is captured by a camera mounted on a flying drone, while Boxes by a hand-held camera. For each dataset, a subset of 100 frames were pre-selected as the real views, and all acquired images have resolution of 1920 × 1080.

Since dense geometry recovery is the key stage of our proposal, we compare our results with three existing depth map estimation schemes: [45] (*BAI*), [5] (*KIM*), and [6] (*WEI*). BAI is a classic MVS approach for sparsely sampled views, that uses the Patch-Match technique on a few secondary images, region growing, and post-checking to remove the outliers inconsistent across views. The light field reconstruction method KIM performs ray-oriented depth sweeping at edges (with 1024 hypotheses), and recovers textureless surfaces by detecting edges in gradually downscaled images. WEI deals with video input and diffuses edge depths like ours, but invalidates wrong interpolants via interview checking. Experiments were run on multithreaded CPUs using OpenMP, and NVIDIA GeForce GTX 680 (KIM) and 1080 Ti (others) GPUs.

### 4.2. Evaluation

Figure 7 utilizes the ground truth for depth map comparisons among ours and other techniques, and also shows the completeness curves for varying relative error thresholds. Figures 7(a) and 7(b) demonstrate that the homogeneous surfaces reconstructed by BAI and KIM both suffer from discontinuities, where WEI and ours have better performance. However, according to Fig. 7(c), the depths of WEI have the most distribution on large relative errors (> 0.002), but the amount of its high-accuracy estimates is smaller than our algorithm. We computed the mean relative errors of these four approaches, obtaining 0.0053, 0.0075, 0.0043, and 0.0040 in turn. Therefore, ours yields the highest overall accuracy of depths. Although the refinement step of the proposed method produces several holes (only accounting for about 1.3% of the image resolution) in the depth map, our experiments confirm that they have negligible influence on the perspective image synthesis stage (see Fig. 9). Figure 8 shows our estimated depth maps for the real-world scenes.

To evaluate the effect of global depth optimization (Sec. 3.1.3) in our pipeline, Fig. 9 compares the synthesized images of an arbitrarily selected real view for peak signal-to-noise ratio (PSNR) calculation, using the smoothest-surface (Fig. 9(a)), densest-surface (Fig. 9(b)), and optimal (Fig. 9(c)) depth maps. Incorrect connections between objects can be found in Fig. 9(a), and many noises degrade the image qualities in Fig. 9(b). Comparatively, Fig. 9(c) achieves PSNR improvements of 1.67 dB and 1.09 dB with respect to the former two for Building (see the enlarged railings and stripes). For Boxes, the rendering qualities are further enhanced by up to 4.08 dB and 1.81 dB, respectively (see the enlarged letters).

Figure 10 shows the perspective images synthesized by our method for the virtual camera arrays. For all tested scenes, we used a 15 × 15 virtual camera array illustrated in Fig. 6. In the enlarged images, the geometry and textures of each scene can be clearly recognized. Figure 11 presents three of the slice images for each scene, which were generated from Fig. 10. By observing the regions focused on different depth planes, we can easily identify the objects and textures in the scenes. Especially, the toothpaste in Bathroom, the railings in Building, and the numbers and colored pens in Boxes are all recognizable.

On average, our SAII system took approximately 23 minutes on pre-processing, 1.6 minutes on calculation of the smoothest-surface and densest-surface depth maps (for 100 real views), 49 minutes on global optimization, and 1.2 minutes on synthesis of new perspective images (for 125 virtual views), thus totally 74.8 minutes for each scene. Note that, the global optimization required around 29.4 seconds for each view, because this step was implemented using the public Matlab code of [7]. It is expected to significantly accelerate our whole pipeline if the code is converted to GPU processing. We also implemented the strategy of directly applying the video frames to produce one slice image. There was little difference in both quality and speed between its results and ours. However, this method was about 1.4 minutes slower when we shifted the depth of the slice image, as explained in Sec. 1.

#### 5. Conclusions

The SAII based on unstructured videos provides the scene information acquisition more freedom than other existing SAII techniques. This paper has presented a new algorithm of computational 3D scene reconstruction and imaging for it. Considering that the edges already capture the primary characteristic structures in many scenes, and can be precisely reconstructed by exploiting the sufficient input data, we proposed to recover dense scene geometry and render virtual views using the edge depth maps of densely sampled frames. For each frame sample, two complete depth maps are calculated from the sparse depth initializers separately relying on the smoothness and density of local 3D surfaces, and afterward globally merged. This strategy enables higher robustness of our method to occlusions and texture absence in the real views. The accurate depth estimates allow efficient and high-quality synthesis of new perspective images, which finally generate clearly distinguishable slice images.

#### Discussion and future work

The reported approach is particularly suitable for large-scale indoor and urban environments which are mainly composed of prominent edges and planar homogeneous surfaces, even if the surfaces are occluded in some captured images. But it should be noted that, since our method fits the smoothest possible surfaces from edges, it can handle arbitrary scene structure. Although our proposal is limited to the scenes that only exhibit Lambertian reflection and are captured under ideal conditions, *e.g.* non-Lambertian properties, varying illumination, or rolling shutter would lead to erroneous depth values and 3D images, these are the common problems that most 3D reconstruction and view synthesis techniques still struggle to overcome. In these cases, it may be helpful to respect depth uncertainty during virtual view rendering for stronger tolerance to unreliable scene geometry. This extension is an interesting direction for future research. Moreover, due to the limitations of the experimental conditions, currently we can only provide computational results. Our next work will be to verify the 3D image quality achieved by the proposed method on an optical SAII platform.

## Funding

National Key Research and Development Program of China (2017YFB0404800); National Natural Science Foundation of China (NSFC) (61631009); Fundamental Research Funds for the Central Universities (2017TD-19).

## References

**1. **G. Lippmann, “La photographie integrale,” C.R Acad. Sci. **146**, 446–451 (1908).

**2. **J. S. Jang and B. Javidi, “Three-dimensional synthetic aperture integral imaging,” Opt. Lett. **27**, 1144–1146 (2002). [CrossRef]

**3. **X. Li, M. Zhao, Y. Xing, H. L. Zhang, L. Li, S. T. Kim, X. Zhou, and Q. H. Wang, “Designing optical 3d images encryption and reconstruction using monospectral synthetic aperture integral imaging,” Opt. Express **26**, 11084–11099 (2018). [CrossRef] [PubMed]

**4. **X. Li, Y. Wang, Q. H. Wang, Y. Liu, and X. Zhou, “Modified integral imaging reconstruction and encryption using an improved sr reconstruction algorithm,” Opt. Lasers Eng. **112**, 162–169 (2019). [CrossRef]

**5. **C. Kim, H. Zimmer, Y. Pritch, A. Sorkine-Hornung, and M. Gross, “Scene reconstruction from high spatio-angular resolution light fields,” ACM Trans. Graph. **32**, 73 (2013). [CrossRef]

**6. **J. Wei, B. Resch, and H. P. A. Lensch, “Dense and occlusion-robust multi-view stereo for unstructured videos,” in Conf. Comput. and Robot Vis., (2016).

**7. **M. W. Tao, S. Hadap, J. Malik, and R. Ramamoorthi, “Depth from combining defocus and correspondence using light-field cameras,” in IEEE Intern. Conf. Comput. Vis., (2013).

**8. **N. Sabater, G. Boisson, B. Vandame, P. Kerbiriou, F. Babon, M. Hog, R. Gendrot, T. Langlois, O. Bureller, A. Schubert, and V. Allié, “Dataset and pipeline for multi-view light-field video,” in IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, (2017).

**9. **M. DaneshPanah, B. Javidi, and E. A. Watson, “Three dimensional imaging with randomly distributed sensors,” Opt. Express **16**, 6368–6377 (2008). [CrossRef] [PubMed]

**10. **X. Xiao, M. DaneshPanah, M. Cho, and B. Javidi, “3d integral imaging using sparse sensors with unknown positions,” J. Disp. Technol. **6**, 614–619 (2010). [CrossRef]

**11. **D. C. Schedl, C. Birklbauer, and O. Bimber, “Optimized sampling for view interpolation in light fields using local dictionaries,” Comput. Vis. Image Und. **168**, 93–103 (2018). [CrossRef]

**12. **J. Wang, X. Xiao, and B. Javidi, “Three-dimensional integral imaging with flexible sensing,” Opt. Lett. **39**, 6855–6858 (2014). [CrossRef] [PubMed]

**13. **R. Schulein, M. DaneshPanah, and B. Javidi, “3d imaging with axially distributed sensing,” Opt. Lett. **34**, 2012–2014 (2009). [CrossRef] [PubMed]

**14. **D. Shin and M. Cho, “3d integral imaging display using axially recorded multiple images,” J. Opt. Soc. Korea **17**, 410–414 (2013). [CrossRef]

**15. **M. Guo, Y. Si, Y. Lyu, S. Wang, and F. Jin, “Elemental image array generation based on discrete viewpoint pickup and window interception in integral imaging,” Appl. Opt. **54**, 876–884 (2015). [CrossRef] [PubMed]

**16. **Y. Piao, H. Qu, M. Zhang, and M. Cho, “Three-dimensional integral imaging display system via off-axially distributed image sensing,” Opt. Lasers Eng. **85**, 18–23 (2016). [CrossRef]

**17. **S. Hong, A. Dorado, G. Saavedra, J. C. Barreiro, and M. Martinez-Corral, “Three-dimensional integral-imaging display from calibrated and depth-hole filtered kinect information,” J. Disp. Technol. **12**, 1301–1308 (2016). [CrossRef]

**18. **J. Zhang, X. Wang, Q. Zhang, Y. Chen, J. Du, and Y. Liu, “Integral imaging display for natural scene based on kinectfusion,” Optik-Int. J. Light. Electron Opt. **127**, 791–794 (2016). [CrossRef]

**19. **B. Resch, H. P. A. Lensch, O. Wang, M. Pollefeys, and A. Solkine-Hornung, “Scalable structure from motion for densely sampled videos,” in IEEE Conf. Comput. Vis. Pattern Recognit., (2015).

**20. **J. Engel, T. Schops, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” in European Conf. Comput. Vis., (2014).

**21. **A. P. Gee, D. Chekhlov, W. Mayol-Cuevas, and A. Calway, “Discovering planes and collapsing the state space in visual slam,” in British Machine Vis. Conf., (2007).

**22. **A. Kundu, Y. Li, F. Dellaert, F. Li, and J. M. Rehg, “Joint semantic segmentation and 3d reconstruction from monocular video,” in European Conf. Comput. Vis., (2014).

**23. **R. A. Newcombe and A. J. Davison, “Live dense reconstruction with a single moving camera,” in IEEE Conf. Comput. Vis. Pattern Recognit., (2010).

**24. **Z. Kang and G. Medioni, “Progressive 3d model acquisition with a commodity hand-held camera,” in IEEE Winter Conf. Appl. Comput. Vis., (2015).

**25. **R. Rzeszutek and D. Androutsos, “A framework for estimating relative depth in video,” Comput. Vis. Image Und. **133**, 15–29 (2015). [CrossRef]

**26. **J. Wei, B. Resch, and H. P. A. Lensch, “Multi-view depth map estimation with cross-view consistency,” in British Machine Vis. Conf., (2014).

**27. **G. Zhang, J. Jia, T. T. Wong, and H. Bao, “Consistent depth maps recovery from a video sequence,” IEEE Trans. Pattern Anal. Mach. Intell. **31**, 974–988 (2009). [CrossRef] [PubMed]

**28. **C. Hoppe, M. Klopschitz, M. Donoser, and H. Bischof, “Incremental surface extraction from sparse structure-from-motion point clouds,” in British Machine Vis. Conf., (2013).

**29. **J. Wei, B. Resch, and H. P. A. Lensch, “Dense and scalable reconstruction from unstructured videos with occlusions,” in Intern. Symp. on Vis., Modeling and Visual., (2017).

**30. **S. W. Min, J. Kim, and B. Lee, “New characteristic equation of three-dimensional integral imaging system and its applications,” Jpn. J. Appl. Phys. Lett. **44**, L71–L74 (2005). [CrossRef]

**31. **K. S. Park, S. W. Min, and Y. Cho, “Viewpoint vector rendering for efficient elemental image generation,” IEICE Trans. Inf. Syst. **E90-D**, 233–241 (2007). [CrossRef]

**32. **K. Yanaka, “Integral photography using hexagonal fly’s eye lens and fractional view,” Proc. SPIE **6803**, 68031K (2008).

**33. **M. Halle, “Multiple viewpoint rendering,” in Proc. Comput. Graph. Interactive Technol., (1998).

**34. **S. Xing, X. Sang, X. Yu, C. Duo, B. Pang, X. Gao, S. Yang, Y. Guan, B. Yan, J. Yuan, and K. Wang, “High-efficient computer-generated integral imaging based on the backward ray-tracing technique and optical reconstruction,” Opt. Express **25**, 330–338 (2017). [CrossRef] [PubMed]

**35. **H. Y. Shum, S. C. Chan, and S. B. Kang, in *Image-based rendering*, (Springer-Verlag, 2006).

**36. **G. Wu, M. Zhao, L. Wang, Q. Dai, T. Chai, and Y. Liu, “Light field reconstruction using deep convolutional network on epi,” in IEEE Conf. Comput. Vis. Pattern Recognit., (2017).

**37. **S. M. Seitz and C. R. Dyer, “View morphing,” in Proc. Comput. Graph. Interactive Technol., (1996).

**38. **S. Vedula, S. Baker, and T. Kanade, “Image-based spatio-temporal modeling and view interpolation of dynamic events,” ACM Trans. Graph. **24**, 240–261 (2005). [CrossRef]

**39. **S. Pujades, F. Devernay, and B. Goldluecke, “Bayesian view synthesis and image-based rendering principles,” in IEEE Conf. Comput. Vis. Pattern Recognit., (2014).

**40. **E. Penner and L. Zhang, “Soft 3d reconstruction for view synthesis,” ACM Trans. Graph. **36**, 235 (2017). [CrossRef]

**41. **F. Liu, M. Gleicher, H. Jin, and A. Agarwala, “Content preserving warps for 3d video stabilization,” ACM Trans. Graph. **28**, 1–9 (2009).

**42. **G. Chaurasia, O. Sorkine, and G. Drettakis, “Silhouette aware warping for image based rendering,” Comput. Graph. Forum **30**, 1223–1232 (2011). [CrossRef]

**43. **G. Chaurasia, S. Duchene, O. Sorkine-Hornung, and G. Drettakis, “Depth synthesis and local warps for plausible image-based navigation,” ACM Trans. Graph. **32**, 1–12 (2013). [CrossRef]

**44. **S. Jeschke, D. Cline, and P. Wonka, “A gpu laplacian solver for diffusion curves and poisson image editing,” ACM Trans. Graph. **28**, 116 (2009). [CrossRef]

**45. **C. Bailer, M. Finckh, and H. Lensch, “Scale robust multi view stereo,” in European Conf. Comput. Vis.V, (2012).