Binocular stereo matching of real scenes based on a convolutional neural network and computer graphics

Liaoyu Kou; Kai Yang; Lin Luo; Yu Zhang; Jinlong Li; Yong Wang; Liming Xie

doi:10.1364/OE.433247

1. Introduction

3D reconstruction based on binocular stereo matching has become a popular non-contact 3D shape measurement technology. It has the advantages of simple hardware configuration, low cost and high measurement accuracy. It is widely used in fields such as autonomous driving, robot guidance, industrial inspection and scientific research [1–3]. In essence, binocular stereo matching is a form of stereo vision, which recovers depth information from planar images. For real scenes, only a pair of images taken by the left and right cameras can realize the 3D reconstruction of the real scene [4]. Compared with the method based on monocular vision, the binocular vision method can obtain more reliable measurement accuracy [5]. However, how to apply binocular stereo matching to real scenes is still a problem worth studying.

In the traditional binocular stereo matching method, according to the difference of the optimal theoretical method, it can be divided into local stereo matching algorithm and global stereo matching algorithm [6]. In the global stereo matching algorithm, the global optimization theory is used to estimate the disparity, and the global energy function is established, including data items and smoothing items, and the optimal disparity is obtained by minimizing the global energy function [7], as shown in the graphics cut algorithm [8], belief propagation algorithm [9], dynamic programming algorithm [10] and genetic algorithm [11]. However, the complexity of the energy optimization problem in one-dimensional space is polynomial, resulting in a relatively long running time, which is not suitable for real-time operations [12]. The semi-global algorithm [13] reduces the two-dimensional problem to 8 to 16 one-dimensional problems. After calculating the cumulative cost of each direction, the cost of each direction is added to obtain the total cost, thereby simulating the two-dimensional optimization problem. This also makes it a key technology for stereo matching to replace lidar to generate disparity maps. The local matching algorithm is also called a window-based method or a support-based method. The algorithm calculates a window of appropriate size, shape and weight for each pixel in the reference image, and then performs a weighted average of the disparity values in the window [14–16]. It also uses the energy minimization method to estimate the disparity, but the energy function only contains data items, not smoothing items. The local matching algorithm only uses the grayscale, color and gradient information of a certain point to calculate the matching cost, with low computational complexity and good real-time performance, such as SAD algorithm [17] and SSD algorithm [18].

In the binocular stereo matching based on deep learning, according to the different cost calculation methods, it can be divided into a stereo matching network using traditional cost calculation methods and a stereo matching network constructing 4D cost volume [19]. The stereo matching network using the traditional cost calculation method is used to calculate the 3D cost through full connection. This method is usually effective, but due to the extraction of feature channels, a lot of information will be lost [20]. Many previous works, including DispNet [21], iResNet [22], SegStereo [23], FADNet [24] and AANet [25] belong to this category. The stereo matching network that realizes the cost calculation by constructing the 4D cost volume generates 4D cost by connecting the left and right features, which can improve the matching performance of the network, but requires higher computational complexity and GPU memory consumption [26]. Kendall etal. [27] connected the left and right features along the disparity level for the first time to generate a 4D cost volume of size C*D*H*W. A differentiable soft argmin is proposed to achieve a more robust parallax regression from the cost. Chang etal. [28] proposed pyramid stereo matching network, using spatial pyramid pooling and multiple hourglass networks to further improve the matching accuracy. Guo etal. [29] combines two cost volume calculation methods and proposes group correlation to construct cost volume, which is more effective for weak texture, reflective areas and occluded areas. Zhang etal. [30] using SGM-based semi-global guidance layer and local guidance aggregation layer in the cost aggregation module instead of 3D convolution can achieve higher accuracy while reducing calculations. Zhang etal. [31] focused on the supervision of matching cost volume. Cheng etal. [32] combining neural architecture search into stereo vision tasks can achieve unit-level and network-level architecture search, while well-balanced the demand for computing resources. Yin etal. [33] used the FPP system to make a dataset and proposed a good-performance end-to-end stereo matching network that can match a single object with high precision, but its generalization ability needs to be further improved. This is also an important factor that limits the application of stereo matching networks in complex real scenes [34]. Shen etal. [35] and Li etal. [36] studied the generalization performance of the stereo matching network, and enhanced the joint generalization ability of the network by supervising the confidence of the predicted disparity, but its cross-domain generalization performance is still not ideal.

In supervised learning, the use of high-quality datasets, including input data and ground truth, is very important for deep learning-based methods. Existing public evaluation benchmarks include Middlebury [37] and KITTI [38]. Middlebury is a high-resolution indoor scene dataset containing 15 pairs of images, which provides real situation for evaluating network performance, but it is not enough to train the network. The KITTI dataset is a prominent urban street view stereo dataset, which promotes the development of stereo vision deep learning [39]. It provides 200 pairs of images with ground truth for network training, but due to the insufficient cross-domain generalization performance of the stereo matching network, it is difficult for the network trained on KITTI to achieve good results in real scenes that have never been seen.

In this paper, we propose a real scene stereo matching method based on convolutional neural networks and computer graphics, which has good cross-domain generalization performance. First, use the graphics software Blender to build a virtual binocular imaging system, and use real scene images to texture the simulated objects. Obtain the left and right images with image texture features close to the real scene, obtain high-integrity, high-precision depth maps, and generate high-precision dense disparity maps as ground truth through depth disparity conversion. For the proposed network structure, first use the feature pyramid network [40] to extract multi-scale feature tensors from the left and right image, and construct multi-scale 4D cost volumes. Considering that there is still a certain feature space difference between our semi-synthetic data and the real scene data, a simple feature standardization layer is embedded in the feature extraction module to reduce this difference. Since the use of 3D convolution in cost aggregation brings high computational costs, three small-sized 4D cost volumes are used instead of large-sized one 4D cost volume to obtain higher matching performance.

The rest of this article is as follows. In Section 2 the proposed method is introduced, including dataset construction and end-to-end stereo matching network structure. Section 3 presents the comparison between our experimental results and the results of other methods. Finally, the conclusion is given in Section 4.

2. Method

This section will introduce a method to achieve high-precision matching of real scenes by using stereo matching network and computer graphics. In this method, two stereo cameras need to be used to capture the measured scene at the same time. First, perform epipolar correction on the collected images, and then input the proposed stereo matching network to obtain the corresponding disparity map. Converting the disparity map to the depth map can realize 3D reconstruction.

In our approach, there are two main aspects. The first is to generate a dense and accurate depth map through the computer graphics software Blender, and convert the depth map into a disparity map as the ground truth of the semi-synthetic dataset. We hope to solve the problem of difficulty in obtaining trainable real scene data by making a semi-synthetic dataset whose style texture is closer to the real scene. Next, we propose a binocular stereo matching network with good cross-domain generalization performance to solve the problem that the existing binocular stereo matching network has limited cross-domain generalization ability and the stereo matching network is difficult to apply to actual scenes. The details of how to use the virtual binocular imaging system in Blender to construct a semi-synthetic dataset close to the texture of the real scene will be discussed in Section 2.1. In Section 2.2, the specific structure of the proposed network will be described in detail.

2.1 Dataset rendering and preprocessing

Recently, computer graphics has been successfully introduced into dataset generation [41–43]. In order to establish high-quality semi-synthetic datasets, the 3D modeling software Blender is used to quickly obtain high-precision and dense depth maps, and then the depth maps are converted into disparity maps as the ground truth. Compared with the FPP system which is limited by the effective working distance under visible light [44], our proposed method can also obtain dense and accurate disparity for longer-distance targets. This section introduces the details of the construction of a virtual binocular imaging system and the production of semi-synthetic datasets.

2.1.1 Selection of the 3D model

There are many 3D models that can be used as objects in virtual binocular imaging systems, such as ModelNet [45], ShapeNet [46], and Thingi10K [47]. Considering the rich features of objects in real scenes, this article chooses ModelNet, which contains 3D models of various common objects, such as cars, stairs, sculptures, etc. In addition, simple 3D models of cubes and cylinders were added to increase the variety of models. The diversity and scale of these models help generate large-scale and diverse data samples based on actual scenarios.

2.1.2 Virtual binocular imaging system

Blender is a powerful open source 3D modeling software. It can generate images in batches through Python to meet our needs for rapid production of large amounts of data. In Blender, you can place virtual cameras and objects in the "layout", as shown in Fig. 1. The working principle of the virtual system is the same as that of the real binocular vision system. The left and right cameras capture objects respectively to obtain a left and right stereo image pair. After the virtual binocular vision system is built, use Blender’s particle system to let all models move randomly in space, avoiding manually setting the trajectory of the object. During the movement of the object, the camera will capture each frame as an image in the final dataset.

Fig. 1. Side view of the scene set in Blender.

Download Full Size | PDF

2.1.3 Scene rendering

Finally, the "Cycles" engine is used for rendering. The left and right cameras render the left and right image pairs by setting the composite node "rendering layer" to "image", and the depth image is rendered by setting "depth", the depth maps are converted into disparity maps as ground truth. Because only manual settings are used in the construction of the virtual scene, using an RTX 2080Ti, we can generate 10 pairs of images and real disparity maps with resolution of 960*540 within 30 seconds.

2.1.4 Factors enhance the authenticity

Binocular imaging system parameters. The spatial geometric relationship between the left camera coordinates[$X_l$, $Y_l$, $Z_l$ ]and the right camera coordinates [$X_r$, $Y_r$, $Z_r$ ] can be described as:

(1)$${ \left[ \begin{array}{ccc} X_l \\ Y_l \\ Z_l \end{array} \right ]}= R\times{ \left[ \begin{array}{ccc} X_r \\ Y_r \\ Z_r \end{array} \right ]} + T.$$

Where $R$ is the rotation matrix and $T$ is the translation matrix. The $R$ and $T$ of different binocular vision systems are different, but in order to improve the matching efficiency, the left and right images will be corrected to the same horizontal line during stereo matching to eliminate the influence of the rotation matrix $R$ [48]. In order to simulate the relationship between the left and right camera coordinates after epipolar correction, the rotation angle of the left and right cameras is set to be the same, and the right camera only moves in the x-axis direction relative to the left camera, which can be described as:

(2)$${ \left[ \begin{array}{ccc} X_l \\ Y_l \\ Z_l \end{array} \right ]} = \left[ \begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array} \right ] \times { \left[ \begin{array}{ccc} X_r \\ Y_r \\ Z_r \end{array} \right ]} + \left[ \begin{array}{ccc} b \\ 0 \\ 0 \end{array} \right ].$$

Image texture characteristics. In order to make the captured images closer to the real scene in terms of global and texture features, the collected real scene images are cropped into a series of 256*256 images to prevent these images from being stretched when used as the texture of 3D objects. Then, select "Edit Mode" and "Face Select" in "UV Edit" to perform UV mapping on the models in the scene one by one. After the UV is unfolded, set the "Surface" in the "Material Properties" of the simulated object to "Emission", set the "Color" to "Image Texture", and then randomly "Open" a cropped real scene image to achieve texture mapping. It is worth noting that each object randomly "Open" an image to ensure the diversity of data. For the background, the image is rendered by setting the composite node "Image" to "Color", and its disparity value is zero.

Imaging system effect. When applied in a real scene, the quality of the collected images is also different due to the difference of the optical imaging system. In order to simulate this imaging process, this paper uses the point spread function $PSF$ to blur the original image to different degrees. The point spread function describes the response of an imaging system to a point light source [49]. The imaging process can be described as:

(3)$$g = f \ast{PSF}.$$

Among them, $f$ is the initial clear image, $g$ is the blurred image, and $*$ represents the convolution operation. When there is optical aberration, the system point spread function is rewritten as:

(4)$$PSF = |\int_{-\infty}^{+\infty}P(u)exp\{j[\phi_D(u)]\}exp[j2{\pi}ux]du|^2.$$

Where $\phi _D(u)$ is the phase of the optical aberration wavefront.

(5)$$\phi_D(u) = \frac{2\pi}{\lambda}[D({\xi^2}+n^2)].$$

Where $u=(\xi ,n)$ is the coordinates on the entrance pupil surface of the optical system, and $D$ is the defocus coefficient. After the point spread function is obtained, the Fourier transform is performed to obtain the system transfer function with optical aberration.

Environmental noise. The real image is usually affected by sensor material properties, working environment and electronic components during the acquisition process, which causes the image to be contaminated by noise and disturbs the observable information of the image. In order to adapt the stereo matching network to indoor and outdoor scenes, images are collected under two light sources, "Point light" and "Sun light", and the position of the light source is moved randomly. In order to simulate the influence of noise, we perform different degrees of noise processing on the blurred image, which can be described as:

(6)$$g = f \ast{PSF} + n.$$

Among them, $n$ represents noise. After adjusting the above factors, the virtual binocular imaging system can be set to generate data that is very close to the actual system, which improves the performance of the network on the real scene after network training

2.2 Network architecture

This section proposes an effective cross-domain generalized stereo matching network to solve the problem of stereo matching of complex real scenes by the network. Compared with the most advanced stereo matching method, the matching accuracy and speed are greatly improved. First, the proposed network embeds a simple and effective feature standardization layer in the feature extraction stage to avoid the impact of the feature difference between the training data and the real image on the performance of the network. In addition, multi-scale costs are used in the cost aggregation stage to avoid excessive network demand for GPU memory and computing resources. The structure diagram of the proposed binocular stereo matching network is shown in Fig. 2.

Fig. 2. Schematic diagram of the proposed stereo matching network. The binocular stereo matching network is composed of a feature pyramid module (feature extraction module), 4D cost volume, cost aggregation module and disparity regression module.

Download Full Size | PDF

In Fig. 2, the entire binocular stereo matching network is composed of feature extraction module, 4D cost volume, cost aggregation module, and disparity regression module. It is worth noting that before stereo matching, first perform epipolar correction, simplifying the two-dimensional search problem into one-dimensional matching problem, our semi-synthetic data has been corrected in the virtual binocular imaging system. After that, the stereo matching network usually extracts the features of the input stereo image at the same time to obtain rich spatial feature information for subsequent matching processing.

For the feature extraction module, we still follow the feature pyramid structure for design. Since abundant multi-scale spatial feature information can be obtained, it is helpful to improve the matching effect of small objects. Different from the SPP module in PSMNet [28], we use cubic convolution with a step size of 2 to gradually realize the extraction of multi-scale features, instead of directly performing multiple large-scale pooling on the feature map. This can make the features more robust after multi-scale fusion.

At the same time, there are still differences in the image texture characteristics of the semi-synthetic data and the real scene data. Therefore, a feature standardization layer is proposed and applied to the feature extraction module.

In the feature extraction module, the size of the feature map after each convolutional layer is N*C*H*W (N: number of samples, C: number of channels, H: space height, W: space width).First, batch normalization is used to normalize all samples, which is defined as:

(7)$$F_{c,h,w} = \gamma\frac{F_{c,h,w}-M_{c,h,w}}{\sqrt{(F_{c,h,w}-M_{c,h,w})^2+\varepsilon}}+\beta, M_{c,h,w} = \frac{\sum_{c=1}^{C}\sum_{h=1}^{H}\sum_{w=1}^{W}F_{c,h,w}}{C\times{H\times{W}}}.$$

Among them, $F$ is the extracted feature, $M$ is the average value of the feature, $h$ and $w$ represent the spatial position, $\gamma$ and $\beta$ represent the weight and deviation that can be trained, and $c$ is the channel number of the feature. Through batch normalization, you can make good use of the correlation between all samples in a mini-batch. Then, perform instance normalization on each channel of each sample, which is defined as:

(8)$$F_{h,w} = \gamma\frac{F_{h,w}-M_{h,w}}{\sqrt{(F_{h,w}-M_{h,w})^2+\varepsilon}}+\beta, M_{h,w} = \frac{\sum_{h=1}^{H}\sum_{w=1}^{W}F_{h,w}}{H\times{W}}.$$

Since instance normalization is applied to a single channel of a single sample, it is not affected by other channels and batchsize, so each channel of each sample after instance normalization can be regarded as an independent small "domain". But the correlation between channels is ignored, and it is very critical for the extraction of spatial feature information. Therefore, value standardization is applied to each spatial position of each sample to enhance the connection between channels, which is defined as:

(9)$$F_{C} = \gamma\frac{F_{C}-M_{C}}{\sqrt{(F_{C}-M_{C})^2+\varepsilon}}+\beta, M_{C} = \frac{\sum_{c=1}^{C}F_{C}}{C}.$$

After value standardization, each sample in each mini-batch can be regarded as an independent small "domain", and all samples in a mini-batch can be regarded as an independent large "domain" . Because the correlation between these samples is enhanced by batch normalization, and their independent characteristics are magnified by instance normalization and value standardization. At this time, the feature difference between the large "domains" is larger, and the feature difference between the small "domains" is small due to the enhanced correlation. By training these different large "domains" and small "domains", the network can better adapt to different degrees of inter-domain feature differences, improving the cross-domain generalization performance of the network. The ablation experiment in Section 3.4 proves the effectiveness of the proposed feature standardization.

By connecting the left feature map and its corresponding right feature map at each disparity level, the 4D (H*W*D*C, D: maximum disparity) cost amount is formed to calculate the cost, which can retain the feature dimension, and it is also retains the knowledge of stereo vision geometry very well [29]. Given the left and right features $f_l$ and $f_r$, the 4D feature volume can be obtained:

(10)$$C_{concat}(d,x,y) = Concat\{f_l(x,y),f_r(x-d,y)\}.$$

Among them, $Concat\{\bullet \}$ represents the connection of feature channel dimensions, and $d$ represents each disparity level.

In order to better combine stereo vision geometric information, enhance detailed information, and improve real-time performance, the multi-scale features of the feature extraction module are combined. Use multi-scale cost volume, construct three smaller 4D cost volumes to replace 1/4 resolution 4D cost volume. These three 4D cost volumes amounts respectively undergo the subsequent cost aggregation stage and disparity regression to form three branches, and the disparity information of each branch is applied to the cost aggregation stage of the next branch to realize the refinement of the disparity. During training, the disparity prediction results of the three branches are output, used for the supervised training of the network, and reflected in the loss function. Only the prediction result of the third branch is output during the test. Since only the small size 4D cost volume of the third branch needs to be convolved during the test, compared to the 4D cost volume of 1/4 resolution, a large amount of calculation can be reduced, and the real-time performance of the network is better. In training, three small-size 4D cost volumes are used for supervised training of the model, which can ensure that the accuracy of the model will not be reduced due to the small size during the test phase. And it can supervise the cost aggregation module to better refine the disparity and generate more accurate matching results.

The fundamental purpose of cost aggregation is to enable the cost value to accurately reflect the correlation between pixels, and to aggregate feature information along the disparity dimension and the spatial dimension. The cost aggregation module in this article uses a stacked hourglass module, that is, an hourglass module with three codec structures is used to refine the disparity map to improve the accuracy of network disparity estimation. The experimental results in PSMNet [28] prove the advanced nature of the stacked hourglass module for parallax refinement. During the training process, all prediction results are output for the supervised training of the network. During the test, only the last of the three prediction results is output.

When performing disparity regression, the disparity value is not directly predicted based on the maximum probability of the disparity value calculated by softmax, but the disparity prediction is completed by calculating the weighted sum of the probability of each disparity, such as:

(11)$$\hat{d} = \sum_{d=0}^{D}d\times{\sigma({-}C_d)}$$

Among them, $\hat {d}$ represents the disparity predicted by the network , $d$ represents each disparity level, $D$ is the maximum disparity, $\sigma (\bullet )$ represents the softmax operation, and $C_d$ is the prediction cost. The experimental results in GcNet [27] prove that the above-mentioned disparity regression is more stable than the classification-based stereo matching method.

2.3 Loss function

During training, each branch outputs three prediction results for supervised training, and there are three branches, so our loss function is:

(12)$$Loss = \sum_{k=1}^{3}\beta^k(\sum_{i=1}^{3}\alpha^iL^i)^k.$$

(13)$$L^i(d,\hat{d}) = \frac{1}{N}\sum_{i=1}^{N}smooth_{L_1}(d_i,\hat{d_i}).$$

(14)$$\begin{aligned}smooth_{L_1}(x) = \begin{cases} 0.5x^2 & \mid{x}\mid{<} 1 \\ \mid{x}\mid{-}0.5 & otherwise. \end{cases} \end{aligned}$$

Where $L^i$ is the loss of each part, $\alpha ^i$ is the loss weight corresponding to each part, $\beta ^k$ is the loss weight corresponding to each branch, $N$ is the number of labeled pixels, $d$ is the true disparity value, and $\hat {d}$ is the predicted disparity value.

Due to parallax regression, L1 smoothing loss is widely used in the supervised training of stereo matching network. Compared with L2 loss, it is less sensitive to outliers and more robust.

3. Experiments

Pytorch 1.6.1 is used to train and test the stereo matching network. All experiments are carried out under the environment of Ubuntu 18.04 + python 3.6. Use a desktop computer equipped with Intel i9 9900k CPU, 64GB RAM and two NVIDIA TITAN RTX GPUs for training and testing. Adam optimizer ($\beta _1$= 0.9, $\beta _2$ = 0.99) is used to train the model for 20 epochs. The initial learning rate is set to 0.001, the batch size is set to 8, the maximum disparity is set to 192, and all pictures are randomly cropped to a size of 256*512 before training. It is worth noting that all the deep learning methods in the article are trained on semi-synthetic dataset and use the same experimental conditions.

3.1 Dataset

During the production of the semi-synthetic dataset, we set up 500 virtual scenes. These scenes cover very rich content, and the depth range of each group is different. In order to ensure the cross-domain generalization ability of the model, the collected real scene images are used to randomly assign textures to the objects in the virtual scene, so that the texture characteristics of the final rendered data are closer to the real scene. The rendered left and right image results are shown in Fig. 3.

Fig. 3. Image rendering result.

Download Full Size | PDF

In order to enrich the dataset and simulate the randomness of the real scene, Blender’s particle system is used to realize the random movement of the scene. The result of random motion is shown in Fig. 4. For each set of scenes, 10 frames of images are collected to generate 5000 pairs of clear stereo images, and each pair of stereo images has a corresponding true depth image.

Fig. 4. Random exercise results.

Download Full Size | PDF

In order to simulate the effects of defocus blur and noise in real scenes and improve the generalization ability of the network to real scenes, we have carried out different degrees of blurring and adding noise to clear images. The variation range of each parameter set is shown in Table 1, where the image dynamic range corresponding to the Gaussian noise standard deviation of 0.001 is equal to [0, 1]. Different combinations of parameters (defocus radius, Gaussian noise mean, Gaussian noise standard deviation) produce image changes as shown in Fig. 5.

Fig. 5. The stereo image obtained by adjusting the parameters in Table 1.

Download Full Size | PDF

Table 1. Variation range of each parameter

View Table | View all tables in this article

Therefore, each pair of clear images has a corresponding random out-of-focus blur image and a random noise-added image, and each pair of images has a corresponding real depth image. A total of 15,000 pairs of stereo images containing real depth are generated to create a semi-synthetic dataset. Before training, all depth images must be converted to disparity images.

The data used in this article includes: Real scene data: data collected in real life scenes. Such as public datasets Middlebury, KITTI and images acquired using our binocular imaging system. Semi-synthetic data: the virtual scene is constructed through the virtual binocular imaging system of the computer graphics software Blender. After the real scene data is cut, the virtual scene is texture-mapped to generate data with a style and texture close to the real scene data.

3.2 Cross-domain comparison on Middlebury

We first conducted a comparative experiment on Middlebury and compared it with two traditional methods (ZNCC [50] and AD_Census [12]) and two deep learning-based methods (PSMNet [28] and LEAStereo [32]) , to verify its effectiveness and progress. The experiment uses 3PE, D1-loss, and EPE as evaluation indicators to test real scene images. EPE represents the average pixel error between the predicted disparity and the true disparity. 3PE represents the percentage of pixels with a pixel error greater than three. D1-loss represents the percentage of pixels whose pixel error is greater than three and the error value exceeds 5% of the true value [20].

Figure 6 shows the corresponding 3D reconstruction results obtained by ZNCC, AD_Census, PSMNet, LEAstereo and the proposed method. The test samples are randomly selected from the Middlebury dataset, and the network has never been trained on the test samples. For a clearer display, Fig. 7 is the enlarged result of all the red boxes in Fig. 6.

Fig. 6. Comparative experimental results of different methods.

Download Full Size | PDF

Fig. 7. An enlarged view of a partial part in Fig. 6.

Download Full Size | PDF

ZNCC calculates the matching cost through local block matching to obtain an integer pixel disparity map, and then refines it through a five-point quadratic curve fitting model to obtain a sub-pixel disparity map [51]. However, block matching assumes that all pixels in the matching window have similar differences. When the parallax is not continuous, a large number of mismatches will occur, as shown in Fig. 6 and Fig. 7.

In AD_Census, Census calculates the preliminary matching cost by encoding the local area of the image, combining the characteristics of AD absolute difference being more sensitive to the difference, reducing the mismatch between repeated areas and similar texture areas. Then, four iterations were performed by constructing the dynamic cross domain to reduce the matching error of the discontinuous regions with small texture and parallax. Compared with ZNCC, AD_Census can provide more accurate disparity results.

Different from traditional methods, two deep learning-based methods (PSMNet and LEAStereo) use networks to achieve matching cost calculations. In PSMNet, a feature extraction network based on the Siamese structure is used to extract features, and a 4D cost volume is constructed within a predetermined disparity range to generate a better initial matching cost and achieve high-performance stereo matching. LEAStereo uses a network structure search to avoid manual design of the network while achieving high-performance stereo matching. From Fig. 6 and Fig. 7, it can be found that the method based on deep learning can provide more accurate disparity results.

Obviously, in Fig. 6, for each test sample, the proposed method provides finer disparity results. Compared with other methods, the matching results of small targets and detailed regions are more accurate. This is because the feature extraction module of the proposed network adopts a more robust multi-scale feature extraction method, which can better integrate the extracted large-scale features and small-scale features. In addition, the feature standardization layer can make the network better adapt to data that has never been seen before, and improve the overall matching accuracy.

In order to quantify the performance of the evaluation method, the average EPE and 3PE of all test samples are further calculated, as shown in Table 2, where Time represents the time required to match a pair of 750*900 images.

Table 2. Quantitative evaluation of different methods.

View Table | View all tables in this article

From the overall results, the proposed method improves the matching accuracy by about 60% compared with the traditional method in real indoor scenes, and the matching speed is greatly improved. Compared with the most advanced LEAStereo, EPE is reduced by 37.1%, 3PE is reduced by 12.7%, and the matching speed is increased by 37.8%.

3.3 Cross-domain comparison on KITTI

In order to further evaluate the versatility of the proposed method in real outdoor scenes with more complex environments, we conducted experiments on the urban street view dataset KITTI 2012 and 2015. Figure 8 shows the disparity image predicted by the proposed method and all comparison methods. The test samples are randomly selected from the test dataset, and all test samples will not appear in the training set. For a clearer display, Fig. 9 is a comparison of the error map of the results in Fig. 8. Table 3 shows the quantitative comparison between this method and other methods on the KITTI dataset. Test all samples in the KITTI 2012 and 2015 datasets to calculate the average D1-loss and EPE.

Fig. 8. KITTI dataset disparity prediction results.

Download Full Size | PDF

Fig. 9. Error map of the result in Fig. 8.

Download Full Size | PDF

Table 3. Measurement results of different methods on the KITTI dataset.

View Table | View all tables in this article

As shown in Fig. 8 and Fig. 9, the matching results of ZNCC have large matching errors. AD_Census can reduce many matching errors and generate smoother matching results. Compared with traditional methods, learning-based methods can also achieve smoother matching. Since the proposed method applies multi-scale cost quantities, the results of the three branches of the network during training are used to supervise the cost aggregation module, which can make the cost aggregation module better refine the disparity and reduce the matching error.

At the same time, in order to verify that our method does not sacrifice the best performance on the target domain in order to improve the cross-domain generalization performance. The three deep learning methods compared were fine-tuned on the KITTI dataset. All methods used a learning rate of 0.001 to fine-tune 300 epochs on KITTI, and all data was randomly cropped to a size of 256*512 before training. For the 200 pairs of data in KITTI 2015, 180 pairs are used for the training set and 20 pairs are used for the validation set. For the 194 pairs of data in KITTI 2012, 160 pairs are used for the training set and 34 pairs are used for the validation set. Table 4 shows the quantitative evaluation results of all the contrasted deep learning methods fine-tuned on the KITTI dataset. Since the more robust multi-scale feature extraction and multi-scale cost volume can better refine the disparity, the proposed method also has good performance when fine-tuning on the target dataset.

Table 4. Comparison of the results of fine-tuning on the KITTI dataset.

View Table | View all tables in this article

3.4 Ablation experiment

In order to individually evaluate the impact of the proposed feature standardization on the cross-domain generalization performance of the network, we conducted experiments in the case of the proposed network using different normalization methods. All experiments were trained on the semi-synthetic datasets produced, and tested on the Middlebury and KITTI datasets. Table 5 is the comparison of the quantitative evaluation results of the proposed network when using BN [52], DN [34] and FS respectively. BN only normalizes all samples in a mini-batch, ignoring the independent characteristics between samples, so it is not conducive to the cross-domain generalization of the network. DN normalizes each sample separately, so that a single sample is not affected by other samples, which can improve the cross-domain generalization performance of the network, but it ignores the relationship between samples. The difference is that FS first normalizes all samples in a mini-batch to enhance the relationship between samples, and then normalizes a single sample to amplify the independent features of a single sample, which can improve the network more stably the cross-domain generalization performance.

Table 5. Comparison of the results of the proposed method using different standardized methods.

View Table | View all tables in this article

3.5 Experimental results in real scenarios

In this section, we first test images with defocus and noise pollution to evaluate the anti-interference ability of the proposed method against these factors. The test results of different degrees of interference images are shown in Fig. 10, where (a), (b), (c) and (d) are the defocus radius of 3.0, and the expected and variance of Gaussian noise is (0, 0.00025), ( 0, 0.0005), (0, 0.00075), (0, 0.001) The disparity prediction results of images and various methods. It can be seen from Fig. 10 that defocus blur and Gaussian noise have a significant impact on the matching result. The proposed method has better robustness to noise and blur. The quantitative evaluation results in Table 2 also prove this point.

Fig. 10. Comparison of test results of various methods on noise-contaminated images.

Download Full Size | PDF

In order to quantify the anti-noise ability of this method, we used the method in this paper and the comparison method to calculate the EPE of the image processing results with different degrees of noise pollution, as shown in Table 6. As the noise increases, the matching errors of all methods are increasing. However, the matching error growth of our method on different degrees of noise images is less than 0.3 pixels, and better matching accuracy can be obtained in all cases.

Table 6. Comparison of the accuracy of noise images by various methods.

View Table | View all tables in this article

Then we compared the proposed method and the comparison method to the reconstruction results of the images taken in the real scene, as shown in Fig. 11. (a) contains objects with high reflectivity, and (b) contains small targets and weak texture areas. Our imaging system consists of two cameras (Basler acA1920-40gc with a resolution of 1920*1200) with a baseline of about 130 mm.

Fig. 11. Comparison of actual shooting image test results.

Download Full Size | PDF

Similar to the results of robustness experiments, learning-based methods can achieve more accurate matching to real scenes. And the proposed method has a better matching effect on target edges in real scenes, and can well solve the problem of difficult stereo matching in high reflectivity regions, small targets and weak texture regions.

4. Conclusion

We propose a real scene stereo matching method based on convolutional neural network and computer graphics. In order to effectively improve the generalization ability to real scenes of network, a virtual binocular imaging system was first constructed through the graphics software Blender, and combined with the principle of binocular imaging, a high-quality semi-synthetic dataset was established. Convert the high-precision depth map obtained by Blender into a dense and accurate disparity map as the ground truth of the dataset. For the network structure, first use the multi-scale feature extraction module to extract feature tensors from the left and right images to construct 4D cost volume. Although the image features of the semi-synthetic dataset can be very close to the real scene, there are still feature differences between the real scene data and the semi-synthetic data. Therefore, a feature standardization layer is proposed to reduce the feature space difference between real scene data and training data, and it is embedded in the feature extraction module. Since the calculation of 4D cost volume requires a large amount of GPU memory, multi-scale cost is proposed, and three small-sized 4D cost volume are replaced with large-sized 4D cost to achieve efficient cost aggregation and better matching performance. The quantitative analysis results show that compared with the traditional method, the matching accuracy of this method is significantly improved by about 60%. Compared with other learning-based methods, the matching accuracy is increased by about 30%, and the matching speed is increased by 38%. A pair of images with a size of 900 * 750 only needs 0.46s. The experimental results of real scenes verify the success of the method, which can effectively resist the interference of blur and noise pollution, and realize fast and accurate stereo matching in real scenes.

There are several aspects of this method that need to be improved. Since the calculation of 4D convolution requires a large amount of GPU memory, the image is randomly cropped to a size of 256*512 during data input to increase the network training speed, which undoubtedly significantly reduces the accuracy of stereo matching. Therefore, how to achieve more efficient stereo matching is still an urgent problem to be solved. Secondly, the matching ability between the weak texture area and the high reflectivity area needs further research to achieve more reliable and accurate 3D reconstruction. Based on the above analysis, we will explore more other methods to achieve binocular stereo matching with better cross-domain generalization performance..

Funding

Sichuan Province Science and Technology Support Program (2021YJ0080); Natural Foundation International Cooperation Project (61960206010).

Acknowledgments

The authors thanks the developers and maintainers of all open source software, languages and systems used in the article for their contributions. At the same time, thanks to the reviewers for their valuable comments, which greatly improved the content and readability of this article.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this article can be found in Middlebury [37] and KITTI [38]. Semi-synthetic dataset are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. M. Ren, R. Liu, H. Hong, J. Ren, and G. Xiao, “Fast object detection in light field imaging by integrating deep learning with defocusing,” Appl. Sci. 7(12), 1309 (2017). [CrossRef]

2. D. Feng, L. Rosenbaum, and K. Dietmayer, “Towards safe autonomous driving: Capture uncertainty in the deep neural network for lidar 3d vehicle detection,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC), (2018), pp. 3266–3273.

3. Z. Zhu, M. He, Y. Dai, Z. Rao, and B. Li, “Multi-scale cross-form pyramid network for stereo matching,” arXiv:1904.11309 [cs.CV] (2019).

4. M. Yang, Y. Liu, Y. Cai, and Z. You, “Stereo matching based on classification of materials,” Neurocomputing 194, 308–316 (2016). [CrossRef]

5. Y.-C. Leung and L. Cai, “3d reconstruction of specular surface by combined binocular vision and zonal wavefront reconstruction,” Appl. Opt. 59(28), 8526–8539 (2020). [CrossRef]

6. D. Scharstein, R. Szeliski, and R. Zabih, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” in Proceedings of the IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV’01), (IEEE Computer Society, USA, 2001), SMBV ’01, p. 131.

7. R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother, “A comparative study of energy minimization methods for markov random fields with smoothness-based priors,” IEEE Trans. on Pattern Analysis Mach. Intell. 30(6), 1068–1080 (2008). [CrossRef]

8. Y. Boykov and M.-P. Jolly, “Interactive graph cuts for optimal boundary region segmentation of objects in n-d images,” in Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, vol. 1 (2001), pp. 105–112 vol.1.

9. F. He and F. Da, “Belief propagation with local edge detection-based cost aggregation for stereo matching,” in 2011 18th IEEE International Conference on Image Processing, (2011), pp. 2373–2376.

10. P. Wang, C. Chen, and F. Wei, “A stereo matching algorithm based on outline-assisted dynamic programming,” in Foundations and Practical Applications of Cognitive Systems and Information Processing, F. Sun, D. Hu, and H. Liu, eds. (Springer Berlin Heidelberg, Berlin, Heidelberg, 2014), pp. 297–306.

11. Z. Zhang, C. Hou, and J. Yang, “A stereo matching algorithm based on genetic algorithm with propagation stratagem,” in 2009 International Workshop on Intelligent Systems and Applications, (2009), pp. 1–4.

12. X. Mei, X. Sun, M. Zhou, S. Jiao, H. Wang, and X. Zhang, “On building an accurate stereo matching system on graphics hardware,” in 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), (2011), pp. 467–474.

13. H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” IEEE Trans. on Pattern Analysis Mach. Intell. 30(2), 328–341 (2008). [CrossRef]

14. A. Hosni, M. Bleyer, M. Gelautz, and C. Rhemann, “Local stereo matching using geodesic support weights,” in 2009 16th IEEE International Conference on Image Processing (ICIP), (2009), pp. 2093–2096.

15. F. Tombari, S. Mattoccia, and L. Di Stefano, “Segmentation-based adaptive support for accurate stereo correspondence,” in Advances in Image and Video Technology, D. Mery and L. Rueda, eds. (Springer Berlin Heidelberg, Berlin, Heidelberg, 2007), pp. 427–438.

16. K.-J. Yoon and I. S. Kweon, “Adaptive support-weight approach for correspondence search,” IEEE Trans. on Pattern Analysis Mach. Intell. 28(4), 650–656 (2006). [CrossRef]

17. J. Vanne, E. Aho, T. Hamalainen, and K. Kuusilinna, “A high-performance sum of absolute difference implementation for motion estimation,” IEEE Trans. on Circuits Syst. for Video Technol. 16(7), 876–883 (2006). [CrossRef]

18. W.-P. Dong, Y.-S. Lee, and C.-S. Jeong, “A stereo matching using variable windows and dynamic programming,” in AI 2005: Advances in Artificial Intelligence, S. Zhang and R. Jarvis, eds. (Springer Berlin Heidelberg, Berlin, Heidelberg, 2005), pp. 1277–1280.

19. Z. Liang, Y. Guo, Y. Feng, W. Chen, L. Qiao, L. Zhou, J. Zhang, and H. Liu, “Stereo matching using multi-level cost volume and multi-scale feature constancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 300–315 (2019).

20. J. Wang, S. Zhang, Y. Wang, and Z. Zhu, “Learning efficient multi-task stereo matching network with richer feature information,” Neurocomputing 421, 151–160 (2021). [CrossRef]

21. N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), pp. 4040–4048.

22. Z. Liang, Y. Feng, Y. Guo, H. Liu, W. Chen, L. Qiao, L. Zhou, and J. Zhang, “Learning for disparity estimation through feature constancy,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2018), pp. 2811–2820.

23. G. Yang, H. Zhao, J. Shi, Z. Deng, and J. Jia, “Segstereo: Exploiting semantic information for disparity estimation,” in Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, eds. (Springer International Publishing, Cham, 2018), pp. 660–676.

24. Q. Wang, S. Shi, S. Zheng, K. Zhao, and X. Chu, “Fadnet: A fast and accurate network for disparity estimation,” arXiv:2003.10758 [cs.CV] (2020).

25. H. Xu and J. Zhang, “Aanet: Adaptive aggregation network for efficient stereo matching,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), pp. 1956–1965.

26. X. Gu, Z. Fan, S. Zhu, Z. Dai, F. Tan, and P. Tan, “Cascade cost volume for high-resolution multi-view stereo and stereo matching,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), pp. 2492–2501.

27. A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” in 2017 IEEE International Conference on Computer Vision (ICCV), (2017), pp. 66–75.

28. J. Chang and Y. Chen, “Pyramid stereo matching network,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE Computer Society, Los Alamitos, CA, USA, 2018), pp. 5410–5418.

29. X. Guo, K. Yang, W. Yang, X. Wang, and H. Li, “Group-wise correlation stereo network,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), pp. 3268–3277.

30. F. Zhang, V. Prisacariu, R. Yang, and P. H. Torr, “Ga-net: Guided aggregation net for end-to-end stereo matching,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), pp. 185–194.

31. Y. Zhang, Y. Chen, X. Bai, S. Yu, K. Yu, Z. Li, and K. Yang, “Adaptive unimodal cost volume filtering for deep stereo matching,” arXiv:1909.03751 [cs.CV] (2019).

32. X. Cheng, Y. Zhong, M. Harandi, Y. Dai, X. Chang, T. Drummond, H. Li, and Z. Ge, “Hierarchical neural architecture search for deep stereo matching,” arXiv:2010.13501 [cs.CV] (2020).

33. W. Yin, Y. Hu, S. Feng, L. Huang, Q. Kemao, Q. Chen, and C. Zuo, “Single-shot 3d shape measurement using an end-to-end stereo matching network for speckle projection profilometry,” Opt. Express 29(9), 13388–13407 (2021). [CrossRef]

34. F. Zhang, X. Qi, R. Yang, V. Prisacariu, B. Wah, and P. Torr, “Domain-invariant stereo matching networks,” arXiv:1911.13287 [cs.CV] (2019).

35. Z. Shen, Y. Dai, and Z. Rao, “Cfnet: Cascade and fused cost volume for robust stereo matching,” arXiv:2104.04314 [cs.CV] (2021).

36. Z. Li, X. Liu, N. Drenkow, A. Ding, F. X. Creighton, R. H. Taylor, and M. Unberath, “Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers,” arXiv:2011.02910 [cs.CV] (2021).

37. D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” Int. J. Comput. Vision 47(1/3), 7–42 (2002). [CrossRef]

38. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” Int. J. Rob. Res. 32(11), 1231–1237 (2013). [CrossRef]

39. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, (2012), pp. 3354–3361.

40. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), pp. 936–944.

41. F. Gomez-Donoso, A. Garcia-Garcia, J. Garcia-Rodriguez, S. Orts-Escolano, and M. Cazorla, “Lonchanet: A sliced-based cnn architecture for real-time 3d object recognition,” in 2017 International Joint Conference on Neural Networks (IJCNN), (2017), pp. 412–418.

42. Y. Li, A. Dai, L. Guibas, and M. NieBner, “Database-Assisted Object Retrieval for Real-Time 3D Reconstruction,” Comput. Graph. Forum 34, 1 (2015).

43. P. Stavroulakis, S. Chen, C. Delorme, P. Bointon, G. Tzimiropoulos, and R. Leach, “Rapid tracking of extrinsic projector parameters in fringe projection using machine learning,” Opt. Lasers Eng. 114, 7–14 (2019). [CrossRef]

44. F. Wang, C. Wang, and Q. Guan, “Single-shot fringe projection profilometry based on deep learning and computer graphics,” Opt. Express 29(6), 8024–8040 (2021). [CrossRef]

45. Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2015), pp. 1912–1920.

46. A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “Shapenet: An information-rich 3d model repository,” arXiv:1512.03012 [cs.GR] (2015).

47. S. Koch, A. Matveev, Z. Jiang, F. Williams, A. Artemov, E. Burnaev, M. Alexa, D. Zorin, and D. Panozzo, “Abc: A big cad model dataset for geometric deep learning,” arXiv:1812.06216 [cs.GR] (2019).

48. R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision (Cambridge University Press, ISBN: 0521540518, 2004), 2nd ed.

49. W. Li, G. Liu, Y. He, J. Wang, W. Kong, and G. Shi, “Quality improvement of adaptive optics retinal images using conditional adversarial networks,” Biomed. Opt. Express 11(2), 831–849 (2020). [CrossRef]

50. B. Pan, H. Xie, and Z. Wang, “Equivalence of digital image correlation criteria for pattern matching,” Appl. Opt. 49(28), 5501–5509 (2010). [CrossRef]

51. P. Zhou, J. Zhu, and H. Jing, “Optical 3-d surface reconstruction with color binary speckle pattern encoding,” Opt. Express 26(3), 3452–3465 (2018). [CrossRef]

52. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv:1502.03167 [cs.LG] (2015).

Parameters	Range(pixel)
Defocus radius	[1,3]
The mean of Gaussian noise	[0,0.1]
The standard deviation of Gaussian noise	[0,0.001]

Method	EPE	3PE	Time(s)
ZNCC	11.86	46.91	286.35
AD_Census	6.95	25.19	26.45
PSMNet	3.75	15.25	0.75
LEAStereo	2.56	10.68	0.74
Our	1.61	9.32	0.46

Method	KITTI 2012		KITTI 2015		Time
	D1	EPE	D1	EPE	(s)
ZNCC	39.9	10.1	50.2	11.8	139.87
AD_Census	30.1	7.0	25.0	5.1	10.65
PSMNet	15.1	2.3	16.3	3.6	0.47
LEAStereo	9.7	1.5	10.6	1.7	0.49
Our	4.8	1.2	5.7	1.3	0.32

Method	KITTI 2012		KITTI 2015		Time
	D1	EPE	D1	EPE	(s)
PSMNet	2.30	0.67	1.98	0.69	0.47
LEAStereo	2.53	0.71	1.93	0.67	0.49
Our	2.26	0.65	1.96	0.77	0.32

Method	a	b	c	d
	EPE	EPE	EPE	EPE
ZNCC	10.14	10.59	10.95	11.18
AD_Census	5.11	6.92	7.86	8.72
PSMNet	2.58	3.03	3.88	4.29
LEAStereo	2.31	2.94	4.00	3.75
Our	2.18	2.42	2.37	2.44

Binocular stereo matching of real scenes based on a convolutional neural network and computer graphics

Abstract

1. Introduction

2. Method

2.1 Dataset rendering and preprocessing

2.1.1 Selection of the 3D model

2.1.2 Virtual binocular imaging system

2.1.3 Scene rendering

2.1.4 Factors enhance the authenticity

2.2 Network architecture

2.3 Loss function

3. Experiments

3.1 Dataset

3.2 Cross-domain comparison on Middlebury

3.3 Cross-domain comparison on KITTI

3.4 Ablation experiment

3.5 Experimental results in real scenarios

4. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (11)

Tables (6)

Equations (14)

Optics Express

Method	KITTI 2012		KITTI 2015		Middlebury
	D1	EPE	D1	EPE	3PE	EPE
BN	13.5	1.7	14.3	2.5	14.4	3.0
DN	7.5	1.3	8.9	1.6	11.3	2.4
FS	4.8	1.2	5.7	1.3	9.3	1.6