Cascade light field disparity estimation network based on unsupervised deep learning

Bo Liu; Jing Chen; Zhen Leng; Yanfeng Tong; Yongtian Wang

doi:10.1364/OE.453020

1. Introduction

The light field cameras have attracted a lot of attention in recent years because of their superior property of recording 4D spatio-angular information of light rays(e.g. Lytro [1], Raytrix [2]). The 4D information can provide helpful depth cues, which are crucially needed in many practical application areas such as 3D reconstruction [3], image refocusing [4], view synthesis [5] and autonomous driving [6]. While promising, the performance of depth estimation is limited by the trade-off between spatial and angular resolution which makes the depth estimation still a challenging issue.

In the past several years, researchers have explored various algorithms to estimate depth (or disparity) maps from light field images using depth cues like correspondence cue [7–13], defocus cue [14,15] and linear structure [16–19], etc. These methods can be roughly divided into traditional light field depth estimation methods and learning based methods. Traditional light field depth estimation algorithms generally utilize the properties of light field geometry and data consistency measurement to obtain depth, such as epipolar plane images(EPI) [16,17], matching correspondence among sub-aperture images [7–11] and defocus information [14]. EPI-based methods compute the slopes of lines in the epipolar plane images to recover depth information. Multi-view stereo matching based method use the correspondences among sub-aperture images for initial disparity prediction and then refine the disparity map with global regularization. Defocus based methods estimate depth by detecting the sharpness and contrast of images in focal stacks. Although these traditional methods have provided promising results, further development has hit a bottleneck. In addition, the narrow baseline between adjacent sub-aperture images also limits the accuracy of those algorithms.

Recently, deep convolutional neural networks have achieved great success in many computer vision tasks such as image detection [20], classification [21] and semantic segmentation [22]. Convolutional neural network has very powerful feature learning ability, which makes it a natural choice for light field depth estimation task. Many related studies have been done. The early work can be traced back to the research of Heber et al. [23]. They proposed an end-to-end network to predict the depth of the 4D light field and then refined initial depth in a post processing step by applying a higher-order regularization. After that, Heber et al. [18] employed a U-shaped network architecture by using 3D convolutional layers to utilize both spatial and directional information. Shin et al. [19] proposed a fully convolutional network to estimate disparity map by combining characteristics of four directional epipolar geometry. Tsai et al. [12] proposed a multi view stereo matching network to estimate disparity and used an attention based view selection module to reduce data redundancy by assigning each viewpoint different weights. All of above deep learning methods require ground truth disparity as supervision signal for training, which is difficult to acquire in real world scenes. Therefore, most existing CNN based light field depth estimation methods are trained on synthetic datasets, which have to face the challenge of domain shift. In order to address this issue, Peng et al. [24] proposed an unsupervised CNN-based method which train an end-to-end network by imposing compliance and divergence constraints on warped sub-aperture images to the central view. This algorithm doesn’t need ground truth disparity as supervision signal and thus expands the application scenarios of learning based algorithm. However, this method does not take account the impact of the occlusion and textureless regions and therefore cannot capture sharp edge at depth discontinuous areas.

Fig. 1. The parametrization of light field.

Download Full Size | PDF

In addition, current CNN-based networks assume that each sub-aperture image contributes equally for inferring the depth without considering the depth uncertainty brought by the length of baselines. For Lytro camera, the baseline of sub-aperture images near the central view is comparatively narrow, about 0.1cm. Correspondingly the approximate disparity range is between −4 and 4 pixels. Such narrow baseline will inevitably lead to lower depth accuracy. Compared with sub-aperture images near the central view, those images lying on the border view have more capability to improve depth estimation accuracy due to its large baseline. Because with large baseline, the epipolar matching range are expanded, the uncertainty of depth estimation would become small.

In order to address the aforementioned problems, we propose a unsupervised learning based light field disparity estimation algorithm by taking knowledge from multi view stereo matching algorithm. Fig. 2 illustrates our network architecture. First, features of each sub-aperture image are extracted using feature extraction module and divided according to their baseline. After that, we build a cascade cost volume to estimate disparity map in a coarse to fine manner. Specifically, sub-aperture images near the central view are used to construct initial cost volume and estimate initial disparity. Then subsequent cost volume is built to refine initial disparity. Finally, we implement a combined unsupervised loss to handle depth ambiguous in occlusion and textureless regions.

Compared with the current unsupervised learning based methods, the main contributions of this paper are summarized as follows:

(1) We propose a cascade network structure for disparity prediction in a coarse to fine manner by fully exploring the geometry characteristics of sub-aperture images. It firstly takes the sub-aperture images near central view as inputs to estimate an initial disparity map, then progressively refine the initial disparity map by utilizing border views with larger baseline.
(2) We design a combined unsupervised loss which enables us to train the network in the absence of ground truth disparity and improve performance in occlusion and textureless regions by using occlusion-aware photometric loss and edge-aware smoothness loss, respectively.
(3) Extensive experimental results on both synthetic and real datasets demonstrate that our method outperforms existing unsupervised disparity estimation method and shows better generalizability compared to supervised methods.

Fig. 2. The overall architecture of our proposed algorithm. Light field sub-aperture images are divided into three groups according to their baseline. We first feed them into feature extractor and SPP module to generate multi-scale features. Next, we construct a three-stage cascade cost volume by utilizing three group features from inner to outer, where the subsequent cost volume is built based on disparity map predicted from last stage. Each disparity map is produced through cost aggregation and disparity regression module.

Download Full Size | PDF

2. Architecture overview

Given a 4D light field, as shown in Fig. 1, a light ray emitted from a 3D point in a scene that intersects two plane can be represented as a two-plane parametrization $L(u,v,s,t)$, where $(s,t)$ are called the camera plane and $(u,v)$ image plane. We can obtain sub-aperture images of different viewpoints from the 4D light field by gathering light rays with fixed camera coordinates or angular patches by gathering light with fixed image coordinates. A certain sub-aperture image can be represented as $I_i=L(u,v,s^\ast,t^\ast )$, therefore the central image of the light field is defined as $I_c=L(u,v,0,0)$. The relationship between central view image and its neighboring sub-aperture views can be written as:

(1)$$L(u,v,0,0)= L(u+s^\ast d(u,v) , v+t^\ast d(u,v),s^\ast,t^\ast)$$

where $d(u,v)$ is the disparity of pixel $(u,v)$ in the central view image. From Eq. (1) we can clearly see that the offset between the central and its adjacent views are quite different, which is related to view coordinate, the further away from central view, the more offset it is.

Motivated by this observation, we propose an end-to-end unsupervised light field disparity estimation network with a cascade structure, which takes advantage of the disparity characteristics of sub-aperture images to improve the disparity accuracy and utilizes combined loss to handle depth ambiguous in occlusion and textureless regions. The architecture of the proposed network is demonstrated in Fig. 2. First, the sub-aperture images of light field are fed into the feature extraction module separately to obtain effective feature representations. Then, we construct a cascade cost volume by warping the feature maps of other views to the central perspective. Next, the cascade cost volume is fed into cost aggregation module and disparity regression module to output a refined disparity map. Finally, the unsupervised loss is computed to train our network. In the following subsections, we elaborate on each part of our approach.

2.1 Feature extraction

To obtain effective features of the input light field image, we apply a feature extraction network which consists of four 2D residual convolutional blocks [21]. In the third and fourth residual blocks, we apply the dilation convolution to have a larger receptive field. Following [12,25], we apply a spatial pyramid pooling (SPP) module after feature extraction to incorporate hierarchical context information. As shown in Fig. 2, our SPP module first apply four adaptive average pooling layers with different size kernels for compressing the features. After that, pooling layers are followed by $1\times 1$ convolution to reduce the channel dimension of the features and upsampling to recover original spatial size. Finally, the feature maps of all scales are concatenated as the final output feature map. It is worth noting that, to reduce the number of parameters of our network, we share the weights of feature extraction modules described above. After the input images in different angular directions are passed through the above feature extraction module, the obtained features are passed to the subsequent modules for multi-view matching.

2.2 Cascade cost volume

After extracting features of sub-aperture images, multi view stereo based disparity estimation methods construct a 4D ($channel\times disparity\times height\times width$) cost volume by concatenating all shifted feature maps to predict disparity map. However, these methods commonly suppose that each view makes the same contribution to build cost volume without considering the special symmetrical geometry structure of light field sub-aperture images. Actually, light field sub-aperture images can be considered as image pairs with different baseline from the central view. Hence, estimating disparity equal to searching matching points between sub-aperture images. As shown in Fig. 3, the three cameras from left to right correspond to the central view, the view with small baseline and the view with large baseline, respectively. For a pixel in the central view, when we search for matching pixels along epipolar line of adjacent view, we can see that the cost-disparity curve of adjacent view with small baseline has only one global minimum. Its matching results are accompanied by large uncertainty therefore leading to inprecise estimation. However, for view with a large baseline, the curve has three local minima. There is a strong possibility of causing mismatching problem that will find a wrong pixel location corresponding to a local minimum. Consequently, the matching uncertainty decreases while the probability of mismatching increases. This geometrical characteristic of light field inspires us to estimate disparity in a coarse to fine manner by constructing a cascade cost volume.

Fig. 3. Illustration of multi view stereo matching. We search matching pixel along epipolar line of views. From left to right: central view, view with small baseline and view with large baseline. Matching cost profile of views with different baseline show that small baseline gives imprecise result while large baseline gives precise result, but may cause wrong matching. Here costs are computed with the absolute difference.

Download Full Size | PDF

As shown in Fig. 2, we iteratively predict and refine disparity to achieve high accuracy disparity estimation. More precisely, we first take the sub-aperture images near central view as inputs to estimate an initial coarse disparity. Then, we construct subsequent cost volume based on initial disparity to progressively refine the initial disparity prediction by utilizing views of wider baseline.

In practice, we divide the entire light field sub-aperture views into three groups(inner views $I_{inner}$ , middle views $I_{middle}$ and outer views $I_{outer}$) according to the baseline relative to the central view. Note that, in order to acquire a stable initial disparity map, we use inner 24 sub-aperture images as inner view and each group is fed into network along with the central view image. Correspondingly, our pipeline is divided into three stages. In the first stage, we use the feature maps of the inner region $F_{inner}$ to construct initial cost volume $C_1$. The features $F_{inner}$ are shifted using a series of predefined disparity levels $d_n = d_{min} + n(d_{max} - d_{min}), n \in \{0,1,\ldots,N-1\}$. where $d_{min}$ and $d_{max}$ denote the minimum and maximum disparity respectively, $N$ denotes the number of disparity level. After that, we concatenate all warped features at channel dimension to get the cost volume. Then we pass it into cost aggregation and disparity regression module described in subsection $2.3$ to predict a coarse disparity map of center view image. In this stage, we adopt a relatively large disparity interval. In second stage, we use coarse disparity prediction $d_{init}$ as initial disparity value and generate new disparity levels $d_n = d_{init} + (n- N/2)\times delta, n\in \{0,1,\ldots,N-1\}$, where $delta$ denotes disparity interval. Note that, here we take small sub-pixel disparity interval to recover more detailed prediction. Then cost volume $C_2$ is constructed by shifting the middle group feature maps $F_{middle}$ using new disparity levels. Then we pass it through the corresponding aggregation module to predict a refined disparity map. Similarly, in third stage, we further shrink the disparity interval meanwhile use disparity map from second stage and the feature maps of the outer group to build cost volume $C_3$. Through the third aggregation module, we obtain the final disparity prediction. Compared to directly concatenating all features of sub-aperture images, our cascade cost volume reduces the number of parameters while improve accuracy. Additionally, since we predict disparity maps at each stage, our method allows for light field input with flexible angular resolution.

2.3 Cost aggregation and disparity regression

After building a cost volume, we use cost aggregation module to aggregate neighboring information along disparity and spatial dimensions. In our approach, we apply eight 3D convolutional layers and use skip connection to fuse shallow and deep features. The last layer of our aggregation module reduces the number of channel dimension of the cost volume to 1-channel and the channel dimension will be squeezed to obtain a 3D ($disparity\times height\times width$) cost volume. Then we pass the cost volume through softmax operation along disparity dimension to obtain the normalized probability $P(d)$ of each disparity level. Finally, the probability volume will be passed to disparity regression module to inference disparity maps.

The traditional winner-takes-all operation is not differentiable and can only obtain discrete disparity values. To produce continuous disparity prediction $\hat {d}$, we use soft argmin operation which is introduced by [26]. Here, the estimated disparity value $\hat {d}$ is the probability weighted sum of all disparity levels $d$.

(2)$$\hat{d}=\sum_{d=Dmin}^{Dmax} d\times P(d)$$

3. Unsupervised learning

Previous learning based light field disparity estimation algorithms are limited to lack of ground truth disparity and exhibit domain shift when directly transfer model trained on synthetic light field data to real world scene. Real world light field captured by light field cameras often contain complex geometry structure, different lighting conditions and noise. In addition, existing unsupervised method is not yet able to achieve satisfactory results. In this section, we design a combined unsupervised loss to train our network. Our unsupervised loss consists of two terms: occlusion-aware photometric loss and edge-aware smoothness loss. The photometric losses measures how similar the pixel from center view and others are in angular patch $A_d(u,v)$ generated with disparity $\hat {d}$. An angular patch can be formed by gathering pixels at same pixel coordinate in warped sub-aperture images. The edge-aware smoothness loss constrains the predicted disparity map to be constant on color constant regions. The final combined loss we use to train our network are defined as follows:

(3)$$L = L_{occlusion}+\alpha L_{smooth}$$

where $\alpha$ is the weight factor of edge aware smoothness loss.

Fig. 4. The cost-disparity curves of points with different cases. The angular patches at ground truth disparities are also given here. (a) Central image (b) Cost curve of occlusion point (c) Cost curve of textureless point (d) Cost curve of non-occlusion point. Here blue lines denote basic photometric loss, green line in (b) denote photometric loss using our occlusion processing strategy.

Download Full Size | PDF

3.1 Occlusion-aware photometric loss

Our unsupervised learning is implemented using warping-based view synthesis loss. Given a central view image $I_c$ and other views $\{I_i\}$ of light field, we are able to synthesize a series of central view images $\{\hat I_c\}$ by backward warping operation using the predicted central view disparity map $D$. For a non-occlusion point, as shown in Fig. 4(d), pixels in angular patch generated using ground truth disparity should appear photometric consistency constraint. Hence our basic unsupervised loss adopts photometric errors of the central image and synthesis views which imposes synthetic sub-aperture images should be as similar as possible to the central view. To alleviate the negative influence on the photometric loss brought by pixels projected outside the range of image coordinates, those pixels usually appear as black edges in the synthetic images, we simultaneously generate a binary validity mask while synthesizing the central view image. The basic photometric loss function is expressed as follows:

(4)$$L_p=\dfrac{1}{N}\dfrac{1}{V} \sum_{i\in N} \sum_{p\in V} \Vert (I_c-\hat{I}_c)\odot{M} \Vert_1,$$

where $N$ denotes the number of light field sub-aperture images, $V$ is the valid pixels projected from other viewpoints, $M$ denotes the binary valid mask and $\odot$ denotes dot product.

In ideal conditions, the pixels of the angular patch obtained using ground truth disparity should obey photometric consistency. However, due to the existence of occlusion in the scene, as shown in Fig. 4(a), part of angular patch pixels will occur color changes. At these locations, the ground truth disparity not correspond to global minimum of cost-disparity curve. Motivated by the occlusion processing strategy proposed in [7,27,28], we use a modified photometric loss to handle occlusion problem. When we compute loss described in Eq. (5), we assume that $\Omega$ pixels in angular patch is not occluded. Instead of considering all pixels in the angular patch, we filter out pixels belonging to occlusion set. Specifically, we sort all pixels using photometric error and take smaller $\Omega$ into account. Here $\Omega$ denotes the set of non-occluded views. The final occlusion-aware photometric loss function as:

(5)$$L_{occlusion}=\dfrac{1}{\Omega}\dfrac{1}{V} \sum_{i\in \Omega} \sum_{p\in V} \Vert (I_c-\hat{I}_c)\odot{M} \Vert_1,$$

In addition, to better measure the similarity between the central viewpoint image and the synthetic image, follows [27,29,30], we add a common used image structural similarity assessment function to improve the robustness to illumination changes. The SSIM term is shown in Eq. (6), where $\mu$ and $\sigma$ represent the mean and variance, respectively.

(6)$$L_{SSIM}=\dfrac{1}{V}\sum_{p\in V} \big[ 1-\dfrac{(2 \mu_{I_c} \mu_{\hat{I_c}} +C_1) (2 \mu_{I_c} \mu_{\hat{I_c}} +C_2)}{(\mu_{I_c}^2 + \mu_{\hat{I_c}}^2 + C_1)(\sigma_{I_c}^2 + \sigma_{\hat{I_c}}^2 + C_2)} \big] \odot{M},$$

3.2 Edge-aware smoothness loss

Another case that may lead to incorrect disparity estimation is the textureless region, as shown in Fig. 4(b), where pixels often exhibit depth ambiguity in stereo matching. In order to solve this problem, we use an edge-aware smoothness loss to add the smoothness prior information to regularize the estimated disparity map by considering that disparity discontinuities often occur at color discontinuities and plane surface usually in color constant region. The loss is expressed as:

(7)$$L_{smooth}=\dfrac{1}{N} \sum_{p\in N} \vert \partial_x D\vert e^{\vert \partial_x I \vert} + \vert \partial_y D\vert e^{\vert \partial_y I \vert}$$

where $\partial D$ and $\partial I$ are gradient of disparity map and color intensity, respectively.

4. Implementation and datasets

In this section, we introduce the implementation details of our experimental setup and the datasets used in both training and testing.

4.1 Implementation

In our experiments, the patch-wise training strategy is adopted. It means that we randomly crop image patches of size $64\times 64$ from the original sub-aperture images as input of network for training. Following [19], we perform data augmentation of scaling, rotation and transpose on the input patches to prevent our network from overfitting and improve generalization ability. During test phase, we feed full resolution images into our network. Note that, all inputs are converted to gray scale. In terms of our cascade cost volume, we construct a three-stage cascade architecture. The disparity searching range is set to (−4,4). For the first to third stage, the disparity interval is set to 1, 0.3 and 0.1 pixel respectively and the corresponding number of disparity levels is all set to 9. With respect to computing losses, we calculate losses at each stage using all available 81 views. The edge-aware smoothness term weight $\alpha$ and the SSIM term weight are set as 0.01 and 0.15, respectively. Our network is implemented using the PyTorch [31] deep learning framework. We use Adam network optimizer [32] to optimize our models, where $\beta _{1}=0.9$, $\beta _{2}=0.999$. The learning rate is set to 0.001 and batch size is 6. We perform all our experiments on an NVIDIA GTX 1080Ti GPU.

4.2 Datasets

We train and test our algorithm on the public available 4D Light Field Dataset [33] published by University of Konstanz and Heidelberg University. It is a synthesis dataset produced by 3d rendering engine Blender. This dataset consists of 28 scenes that are divided into four classes: "Stratified" ,"Training","Test" and "Additional". Every scene include 81 images which are recorded with $9\times 9$ camera array grid setting and the resolution of each image are $512\times 512$. We use 12 scenes under ’additional’ category of the 4D light field dataset as training data and 4 ’training’ scenes as testing data. For further generality testing, we select four scenes from the HCI dataset created by Wanner et al. [34]: buddha, papillon, mona and stillLife. These scenes are also synthetic dataset similar to [33] but contain different contents with spatial resolution of $768\times 768$.

We also train our algorithm using real world dataset from The Stanford Lytro Light Field Archive [35] and light field view synthesis datasets [36]. This dataset was captured by Lytro Illum light field camera in diverse environments and include 130 scenes in which we take 100 scenes from them as the training set and the rest as the test set. The spatial resolution and angular resolution of the captured light field is $376\times 541$ and $14\times 14$ respectively. We only use central $9\times 9$ sub-aperture images in our experiments.

5. Experiments and analysis

In this section, we perform sufficient experiments to verify the effectiveness of our method. First, we validate the performance of our method on synthetic and real datasets, respectively. Then, ablation studies are conducted to verify the effectiveness of each component of the proposed algorithm.

5.1 Evaluation on synthetic datasets

To quantitatively evaluate our approach, we select some test scenes from synthetic datasets which have disparity ground truth: the ’training’ subset of the HCI 4D light field dataset [33] and four synthetic scenes from the HCI Wanner dataset [34]. The BadPix and MSE metric are used to measure the accuracy of disparity results. The MSE represents the mean square error between the predicted disparity maps and the ground truth. The BadPix represents the percentage of pixels with absolute distance between the predicted disparity and the ground truth exceeding a certain threshold $\delta$. They are defined as follows:

(8)$$MSE=\dfrac{1}{K} \sum_{p\in K} \lvert (D_{gt}(p)-\hat{D}(p)) \rvert^2$$

(9)$$BadPix=\dfrac{1}{K} \sum_{p\in K} \lvert (D_{gt}(p)-\hat{D}(p)) \rvert > \delta$$

where $D_{gt}$ and $\hat {D}$ denote the ground truth disparity map and predicted disparity map, respectively. $p$ denotes the pixel coordinate and K is the number of pixels in disparity map. We set $\delta$ to 0.07 in our experiments.

Table 1. Quantitative comparisons of our algorithm and state of the art algorithms on HCI synthetic datasets

View Table | View all tables in this article

Table 2. Quantitative comparisons of our algorithm and state of the art algorithms on HCI Wanner synthetic datasets

View Table | View all tables in this article

We compare our method with the state of the art disparity estimation algorithms on 4D Light Field Benchmark, i.e. two traditional methods lf_occ [8], cae [11], two supervised deep learning based algorithms epinet [19], lfattnet [12] as well as an unsupervised learning based algorithm lf_unsup [24]. Table 1 and Table 2 show the detailed quantitative results on HCI and HCI Wanner synthetic datasets, where lower scores indicate better performance for both metrics and best scores are shown in bold. It can be seen from the Table 1 and Table 2 that our method achieves a significant improvement on all scenes in comparison with lf_unsup [24]. Specifically, our method outperforms lf_unsup [24] by 28.41 and 4.83 in terms of Badpix0.07 and MSE in the HCI dataset, respectively. For traditional methods, our results are numerically better than lf_occ [8]. In addition, the performance of our method is comparable to cae [11] where our results are better in MSE and worse in Badpix0.07 compared to cae [11] in HCI. However, our method is much faster than traditional methods. Specifically, the traditional method usually takes several minutes to estimate a 512x512 disparity map while our method takes only about 3 seconds. When comparing the results of epinet [19] and lfattnet [12] with our method, the results of these supervised learning methods are better than our algorithm on the HCI [33] dataset. Whereas, due to supervised methods usually suffering from the problem of generalization, our algorithm achieves a comprehensive advantage except in Buddha scene when the test set switches to HCI Wanner [34] dataset.

In Fig. 5, we show the qualitative visualization comparisons between our method and compared methods for the synthetic dataset. The central view image and corresponding ground truth disparity map are shown in the first column and the disparity predictions of each method and corresponding error maps for Badpix0.07 are shown in the remaining columns. As shown in Fig. 5(f), since the impact of occlusion, the predicted disparity maps of lf_unsup [24] have more bad pixels at the edge of objects in all scenes. Moreover, large areas of errors appear at textureless background region in the cotton scene. On the contrary, shown in Fig. 5(g), our method can handle these problems better. For supervised method epinet [19] and lfattnet [12], we can see obvious precision degradation from Fig. 5(d) and (e). Specifically, there are significantly more red pixels in the bottom two rows than in the top two rows.

Fig. 5. The qualitative comparisons of results on synthetic datasets. The first column, upper row shows central view and lower row shows ground truth disparity. For the rest column, the upper row shows the error map for Badpix0.07 and the lower row shows disparity prediction. Left to right: (a)gt (b) lf_occ [8] (c) cae [11] (d) epinet [19] (e) lfattnet [12] (f) lf_unsup [24] (g) ours.

Download Full Size | PDF

5.2 Evaluation on real world datasets

We further evaluate our approach on real world datasets, where our model is retrained using real world light field. Since the real world light field datasets captured by Lytor Illum camera do not provide ground truth disparity maps, our experiments only conduct qualitative comparison shown in Fig. 6. Compared with other algorithms, our method can provide more accurate results which give smooth disparity prediction and preserve details on object boundaries. Results are shown in (i) and (j) illustrate that traditional methods like lf_occ [8] and cae [11] fail to produce complete shape of the window (purple rectangle) due to the lack of context information. Moreover, the result of cae [11] shown in (c) gives inaccurate predictions at boundaries of the fence. As the real scene data usually contain noise and unseen scenarios in training set, supervised learning methods epinet [19] and lfattnet [12] inevitably give clear artifacts shown in orange rectangle of (k) and (l). When comparing the results of lf_unsup [24], which is shown in (f) and (m), it is clear that the results are blurry and noisy.

Fig. 6. Visualization comparisons of results on real world light field datasets. The central views and corresponding disparity maps are shown.

Download Full Size | PDF

5.3 Ablation study

In this subsection, we perform ablation studies to validate the improving effect of each component of our method. The ablation experiments are conducted on the ’training’ subset of HCI [33] synthetic dataset.

First, we analyze the effort of our cascade cost volume architecture. To this end, we implement Concatenation type of cost volume which concatenating all shifted feature maps at channel dimension. For more extensive validation, we implement another Variance type cost volume which build cost volume by calculating variance of all shifted feature maps using the following formula,

(10)$$C = \dfrac{1}{N} \sum_{i=0}^N (\hat{F}-\bar F)^2$$

where $\hat {F}$ denotes the shifted feature map of sub-aperture images and $\bar F$ is the average of feature maps across all sub-aperture images. Then, we embed our Cascade structure into them, respectively. Next, we compare the produced results and parameters of those networks. Note that, networks equipped with four types of cost volume are trained with basic photometric unsupervised loss.

Table 3. Comparisons of the quantitative results with different type of cost volume structures.

View Table | View all tables in this article

From Table 3 we can see that, Concatenation type cost volume can obtain high accuracy but require large number of network parameters while Variance type cost volume having the advantage of fewer network parameters while leading to lower accuracy. Our cascade structure improves the performance of Concatenation type cost volume on both MSE and Badpix metrics while reducing the parameters of the network by half. Compared to Variance type cost volume, our cascade structure also improves the disparity estimation accuracy but brings more parameters. But small increase in parameters is acceptable. In addition, our cascade cost volume maintains the flexibility in terms of angular resolution of light field input.

Because our cascade network outputs disparity map at each stage for further refining at next stage, our method allows for relatively flexible light field inputs. Specifically, in our experiments, our method can adapt three types of light field inputs with angular resolution $5\times 5$, $7\times 7$ and $9\times 9$, respectively. Figure 7 shows that with the increase of angular resolution, the disparity estimation of our algorithm can achieve better performance on each scene. This proves that our coarse to fine strategy is feasible. Figure 8 illustrates the visualization results of our Cascade network with different angular inputs.

Fig. 7. Result comparisons of inputs with different angular resolution.

Download Full Size | PDF

Fig. 8. Visualization comparisons of inputs with different angular resolution.

Download Full Size | PDF

Then, we use our cascade network trained with the basic photometric loss as the baseline, and progressively add the edge-aware smoothness loss term and our occlusion modeling processing. The experimental results in Table 4 verify that our occlusion-aware photometric loss and edge-aware smoothness loss both can improve the accuracy of results. Through the visualization results shown in Fig. 9, we can see that the occlusion-aware photometric loss can significantly improve the disparity results at the edge of occlusion region. The edge-aware smoothness loss can effectively reduce outliers on the plane of bookcase and wall meanwhile sharpening the edges of objects. Lastly, We explore the best choice of the number of views used to calculate photometric loss in our occlusion-aware photometric loss. As shown in Table 5, When the number of used views $\Omega$ is set to 30, our algorithm can achieve best performance on both MSE and Badpix0.07 metric.

Fig. 9. The visualization comparisons of results with various loss configuration. (a)image patches and corresponding disparity ground truth (b)the disparity and error map with basic photometric loss (c) the disparity and error map with occlusion-aware photometric loss (d) the disparity and error map with occlusion-aware loss and edge-aware smoothness loss.

Download Full Size | PDF

Table 4. Comparisons of the quantitative results with various loss function combinations

View Table | View all tables in this article

Table 5. Comparisons of results as the views used to compute photometric loss varies.

View Table | View all tables in this article

6. Conclusion

In this paper, an unsupervised deep learning algorithm is proposed to estimate disparity from light field data. In our approach, we design a cascade cost volume network to produce accurate disparity maps by exploring geometry characteristics between sub-aperture images. In addition, we train our network with combined unsupervised loss which consists of occlusion-aware photometric loss and edge-aware smoothness loss to improve the performance on occlusion and textureless regions, respectively. Experimental results demonstrate the effectiveness of our method on both synthetic and real light-field images. Lastly, there are still challenging cases limiting the disparity results such as glass objects leading to incorrect correspondences and sky regions with infinite distance. We will continue to address these problems in the future.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. R. Ng, M. Levoy, M. Brédif, G. Duval, M. Horowitz, and P. Hanrahan, “Light Field Photography with a Hand-held Plenoptic Camera,” Research Report CSTR 2005-02, Stanford university (2005).

2. A. Lumsdaine and T. Georgiev, “The focused plenoptic camera,” in 2009 IEEE International Conference on Computational Photography (ICCP), (2009), pp. 1–8.

3. M. Feng, S. Z. Gilani, Y. Wang, and A. Mian, “3d face reconstruction from light field images: A model-free approach,” in Proceedings of the European Conference on Computer Vision (ECCV), (2018), pp. 501–518.

4. D. G. Dansereau, O. Pizarro, and S. B. Williams, “Linear volumetric focus for light field cameras,” ACM Trans. Graph. 34(2), 1–20 (2015). [CrossRef]

5. A. Levin and F. Durand, “Linear view synthesis using a dimensionality gap light field prior,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (2010), pp. 1831–1838.

6. A. Bajpayee, A. H. Techet, and H. Singh, “Real-time light field processing for autonomous robotics,” in 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), (IEEE, 2018), pp. 4218–4225.

7. C. Chen, H. Lin, Z. Yu, S. B. Kang, and J. Yu, “Light field stereo matching using bilateral statistics of surface cameras,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, (2014), pp. 1518–1525.

8. T.-C. Wang, A. A. Efros, and R. Ramamoorthi, “Occlusion-aware depth estimation using light-field cameras,” in 2015 IEEE International Conference on Computer Vision (ICCV), (2015), pp. 3487–3495.

9. T.-C. Wang, A. Efros, and R. Ramamoorthi, “Depth estimation with occlusion modeling using light-field cameras,” IEEE Trans. Pattern Anal. Mach. Intell. 38(11), 2170–2181 (2016). [CrossRef]

10. W. Williem and I. K. Park, “Robust light field depth estimation for noisy scene with occlusion,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), pp. 4396–4404.

11. Williem, I. K. Park, and K. M. Lee, “Robust light field depth estimation using occlusion-noise aware data costs,” IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2484–2497 (2018). [CrossRef]

12. Y.-J. Tsai, Y.-L. Liu, M. Ouhyoung, and Y.-Y. Chuang, “Attention-based view selection networks for light-field disparity estimation,” in Proceedings of the 34th Conference on Artificial Intelligence (AAAI), (2020).

13. Y. Li, Q. Wang, L. Zhang, and G. Lafruit, “A lightweight depth estimation network for wide-baseline light fields,” IEEE Trans. on Image Process. 30, 2288–2300 (2021). [CrossRef]

14. H. Lin, C. Chen, S. B. Kang, and J. Yu, “Depth recovery from light field using focal stack symmetry,” in 2015 IEEE International Conference on Computer Vision (ICCV), (2015), pp. 3451–3459.

15. W. Zhou, E. Zhou, Y. Yan, L. Lin, and A. Lumsdaine, “Learning depth cues from focal stack for light field depth estimation,” in 2019 IEEE International Conference on Image Processing (ICIP), (2019), pp. 1074–1078.

16. S. Wanner and B. Goldlücke, “Globally consistent depth labeling of 4d lightfields,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012 : 16 - 21 June 2012, Providence, RI, USA, (IEEE, Piscataway, 2012), pp. 41–48.

17. S. Zhang, H. Sheng, C. Li, J. Zhang, and Z. Xiong, “Robust depth estimation for light field via spinning parallelogram operator,” Comput. Vis. Image Underst. 145, 148–159 (2016). [CrossRef]

18. S. Heber, W. Yu, and T. Pock, “Neural epi-volume networks for shape from light field,” in 2017 IEEE International Conference on Computer Vision (ICCV), (2017), pp. 2271–2279.

19. C. Shin, H. Jeon, Y. Yoon, I. S. Kweon, and S. J. Kim, “Epinet: A fully-convolutional neural network using epipolar geometry for depth from light field images,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2018), pp. 4748–4757.

20. R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2014), pp. 580–587.

21. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), pp. 770–778.

22. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2015), pp. 3431–3440.

23. S. Heber and T. Pock, “Convolutional networks for shape from light field,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), pp. 3746–3754.

24. J. Peng, Z. Xiong, D. Liu, and X. Chen, “Unsupervised depth estimation from light field using a convolutional neural network,” in 2018 International Conference on 3D Vision (3DV), (2018), pp. 295–303.

25. J.-R. Chang and Y.-S. Chen, “Pyramid stereo matching network,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2018), pp. 5410–5418.

26. A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” in 2017 IEEE International Conference on Computer Vision (ICCV), (2017), pp. 66–75.

27. C. Godard, O. M. Aodha, M. Firman, and G. Brostow, “Digging into self-supervised monocular depth estimation,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (2019), pp. 3827–3837.

28. T. Khot, S. Agrawal, S. Tulsiani, C. Mertz, S. Lucey, and M. Hebert, “Learning unsupervised multi-view stereopsis via robust photometric consistency,” ArXiv abs/1905.02706 (2019).

29. W. Zhou, E. Zhou, G. Liu, L. Lin, and A. Lumsdaine, “Unsupervised monocular depth estimation from light field image,” IEEE Trans. on Image Process. 29, 1606–1617 (2020). [CrossRef]

30. H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Trans. Comput. Imaging 3(1), 47–57 (2017). [CrossRef]

31. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS-W (2017).

32. D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations (2014).

33. K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke, “A dataset and evaluation methodology for depth estimation on 4d light fields,” in Asian Conference on Computer Vision, (Springer, 2016).

34. S. Wanner, S. Meister, and B. Goldlücke, “Datasets and benchmarks for densely sampled 4d light fields,” in VMV 2013 : Vision Modeling and Visualization, D. Fellner, ed. (Eurographics Association, Goslar, 2013), pp. 225–226.

35. A. S. Raj, M. Lowney, and R. Shah, “Light-field database creation and depth estimation,” Stanford University (2016).

36. N. K. Kalantari, T.-C. Wang, and R. Ramamoorthi, “Learning-based view synthesis for light field cameras,” ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2016) (2016).

HCI		Boxes		Cotton		Dino		Sideboard		Avg
		BadPix0.07	MSE	BadPix0.07	MSE	BadPix0.07	MSE	BadPix0.07	MSE	BadPix0.07	MSE
traditional	lf_occ [8]	26.52	9.85	6.21	1.06	14.91	1.14	18.49	2.30	16.53	3.59
traditional	cae [11]	17.88	8.42	3.37	1.51	4.97	0.38	9.84	0.88	9.02	2.79
supervised	epinet [19]	12.34	5.97	0.45	0.19	1.21	0.16	4.46	0.80	4.62	1.78
supervised	lfattnet [12]	11.04	3.99	0.27	0.21	0.85	0.09	2.87	0.53	3.76	1.21
unsupervised	lf_unsup [24]	48.44	12.75	40.93	8.59	30.71	2.74	31.24	5.01	37.83	7.27
unsupervised	ours	22.05	7.55	2.59	0.57	4.40	0.49	8.63	1.16	9.42	2.44

HCI Wanner		Buddha		Mona		Papillon		StillLife		Avg
		BadPix0.07	MSE	BadPix0.07	MSE	BadPix0.07	MSE	BadPix0.07	MSE	BadPix0.07	MSE
traditional	lf_occ [8]	11.68	1.94	12.42	0.58	29.97	0.83	59.45	84.61	28.38	21.99
traditional	cae [11]	3.21	0.64	7.33	0.49	7.33	0.65	14.40	1.26	8.07	0.76
supervised	epinet [19]	1.55	0.36	10.78	1.33	35.56	6.12	11.37	2.43	14.82	2.56
supervised	lfattnet [12]	2.02	0.33	10.77	0.79	34.96	5.07	11.78	14.01	14.88	5.05
unsupervised	lf_unsup [24]	11.55	1.14	12.90	1.98	30.30	5.32	42.05	17.28	28.83	6.43
unsupervised	ours	4.54	0.33	9.58	0.61	27.10	1.06	8.97	5.52	12.55	1.88

	$C o n c a t e n a t i o n$	$C o n c a t e_C a s c a d e$	$V a r i a n c e$	$V a r i a n c e_C a s c a d e$
MSE	5.29	5.15	5.60	5.31
BadPix0.07	15.09	11.04	18.95	12.18
Parameters	5.7M	3.2M	0.4M	0.7M

	$L_{p}$	$L_{o c c l u s i o n}$	$L_{o c c l u s i o n} + L_{s m o o t h}$
MSE	5.15	3.67	2.44
Badpix0.07	11.04	10.16	9.42

	$Ω = 20$	$Ω = 30$	$Ω = 40$	$Ω = 50$
MSE	4.54	3.67	4.08	4.39
Badpix0.07	11.78	10.16	10.99	11.25

Cascade light field disparity estimation network based on unsupervised deep learning

Abstract

1. Introduction

2. Architecture overview

2.1 Feature extraction

2.2 Cascade cost volume

2.3 Cost aggregation and disparity regression

3. Unsupervised learning

3.1 Occlusion-aware photometric loss

3.2 Edge-aware smoothness loss

4. Implementation and datasets

4.1 Implementation

4.2 Datasets

5. Experiments and analysis

5.1 Evaluation on synthetic datasets

5.2 Evaluation on real world datasets

5.3 Ablation study

6. Conclusion

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (5)

Equations (10)

Optics Express