## Abstract

In this paper we present a comparison study between different aggregation functions for the combination of *RGB* color channels in stereo matching problem. We introduce color information from images to the stereo matching algorithm by aggregating the similarities of the *RGB* channels which are calculated independently. We compare the accuracy of different stereo matching algorithms and aggregation functions. We show experimentally that the best function depends on the stereo matching algorithm considered, but the dual of the geometric mean excels as the most robust aggregation.

© 2013 OSA

## 1. Introduction

The stereo matching problem consists in obtaining the three-dimensional information from two bi-dimensional images of the same scene taken from different viewpoints. When an image is taken, the depth of each point in the scene is lost. Therefore, the objective of a stereo matching algorithm is to retrieve this information.

The basis of stereo vision is that a single physical point in the scene is uniquely projected to a pair of image locations. Hence, to obtain the depth from both images, first we have to estimate the correspondence between the pixels in each image. This step consists in identifying the same physical point in both projections to determine the difference between the position in each image. This difference is called *disparity*. The disparity, together with the parameters of the camera allows us to obtain the depth.

Thereby, the main problem of the stereo matching is the difficulty to find the correspondence correctly. The images are taken from different cameras with different viewing angles. These facts sometimes produce occlusions, perspective distortion, different lighting intensities, reflections, shadows, repetitive patterns, sensory noise, etc. All these facts convert a simple correspondence task in a very difficult one.

For this reason, an ideal configuration of the cameras is usually supposed. That is, they are only horizontally displaced and the focus lines are parallel. This ideal configuration and epipolar geometry allow to restrict the search of one point in the first image to the epipolar line in the second image, reducing the search space and hence, greatly decreasing the computational cost.

An exhaustive overview on stereo matching can be found in [1], while a complete introduction to stereo vision can be found in [2]. Stereo matching algorithms can be classified into local and global methods. The local approaches compare the intensity levels of a finite window to determine the disparity for each pixel. These methods use different metrics or similarities to compare intensity levels such as SAD [3], SSD [4] or NCC [5], which are widely applied despite its simplicity due to their low computational complexity [6, 7]. Global approaches apply some global assumptions about smoothness and try to determine all the disparities at the same time by using different optimization techniques such as, graph cuts [8–10], belief propagation [11], etc. These methods usually start from a local disparity estimation.

The utilization of color information, specifically the use of *RGB* color space, improves the stereo matching results achieved with gray scale images [12–14]. The extra information provided by color channels removes some ambiguities produced when the information is reduced to gray scale. Therefore, matching can be improved avoiding false correspondence matches, but we will show that this improvement is directly related with the aggregation function considered to add color similarities. Having more information might lead to worse results if it is not handled properly.

In general, there exist several techniques to calculate the disparity map using color information, but there is no agreement about which is the best color space to work with. In this work *RGB* representation is used following previous approaches [12, 13].

Local search matching algorithms assign to every pixel of the right image a correspondence with another pixel of the left image. Using a local search algorithm with color images we will obtain for each pixel in the right image three correspondences (one per channel) in the left image. Usually these degrees will not match, so we will have different matching scores. A simple and efficient solution is to aggregate the similarity information of channels and then choose as corresponding pixel the one having the largest aggregated similarity value.

In this work we study the performance and influence of different aggregation operators, such as the arithmetic mean, the median, the minimum, etc. To do so, we use several test images from [1] which have been taken using the ideal configuration of the cameras. Our aim is to study their different behaviors among different measures used in stereo matching and image comparison [6,14]. We empirically show that using the proper aggregation functions can produce significant better results, whereas using inappropriate ones could decrease the performance. Our objective is to find an aggregation for color similarities that works well whichever method (similarity measure) is used. In this sense, we want to study which aggregation is more robust. While some aggregations excels in some methods or images, our interest reside in finding an appropriate set of aggregations that can be safely used among different metrics and images.

This work is organized as follows: In Section 2 we remain the classical stereo matching algorithm and we present the metrics that we have considered. In Section 3 we present the aggregation operators that we are going to compare in Section 4, where the experimental study is carried out. Section 5 concludes this work.

## 2. Stereo matching for color images

In this section, we recall the typical steps of the classical stereo matching algorithm. Afterward, we present the different metrics and similarity measures used in the comparison.

#### 2.1. Stereo matching algorithm

Minor changes have to be applied to transform the original stereo matching algorithm for gray scale images to color ones. In the first case, the algorithm computes the similarity of the window surrounding each pixel in the right image with several windows in the left image (considering the epipolar constraint and the maximum disparity). The pixel which surrounding window reaches the largest similarity degree is chosen and used to compute the disparity.

Regarding color images, we simply compute the correspondence between color channels independently, and then we aggregate these correspondence scores (similarity degrees). We can summarize the algorithm for color images as follows (in Fig. 1 we depict an overall view of the method):

- Algorithm Stereo Matching
- const
- Window size := n × m
- begin
- For each pixel right image
- For each pixel in the epipolar line left image
- For each color channel
- Calculate the similarity between the window centered at the pixel of the right image and the window centered at the pixel of the left image
- end For
- Aggregate similarities
- end For
- Set correspondence:= arg max{value of aggregation of similarities}
- Disparity:= difference between the x-position of two pixels
- end For
- Create a disparity map from all the disparities obtained
- end.

There are many different versions of the classical stereo matching algorithm. However, most of them use the scheme we have presented. The metric or the similarity measure used is usually the biggest difference between algorithms. Another key-factor of the algorithm is the aggregation function used to aggregate the color similarities. In the following subsection we present seven common similarity measures for stereo matching problem. Then, in Section 3 we present the aggregation functions considered for the empirical study.

#### 2.2. Correspondence and similarity measures between windows

There exist several methods to compute the similarity between windows. The results (given by the obtained disparity maps) directly depends on these measures. In this paper, we study several metrics to show the behavior of different aggregations within each method.

### 2.2.1. Sum of Square Differences (SSD)

SSD [4] computes the matching score as the sum of the square differences between all pixels intensities from left window with respect to right window. Then, the disparity is computed with the one with the lowest value (largest correspondence), which indicates the most similar window. SSD can be expressed as follows:

*x*,

*y*the position of the pixel,

*k*the displacement of the left window respect to the right window,

*W*the window (size

*n*×

*m*) considered and

*I*,

_{r}*I*right and left images respectively.

_{l}### 2.2.2. Sum of Absolute Differences (SAD)

SAD [3], computes the disparity in the same way as SSD, but using the absolute differences between pixel intensities instead of the square differences:

### 2.2.3. Normalized Cross-Correlation (NNC)

NCC [5] is expressed by the following formula:

The disparity is obtained from the *k* reaching the maximum value.

### 2.2.4. Fuzzy similarity (SM_{FS})

_{FS}

The fuzzy similarity introduce the fuzzy set theory to compute the correspondence between two windows. It is computed with the following expression [12]:

*α*= 16 is used generally [12]. Disparity is computed with the pixel which similarity measure attains its maximum value.

### 2.2.5. Distance-based similarities (SM_{M} and SM_{K})

_{M}

_{K}

Distance-based similarities are widely used in image processing for image comparison techniques [15]. Hence, they are appropriate to compare the correspondence between windows. The smaller the distance is, the greater similarity is obtained. In our experiments we consider two different cases. The first one is based on Minkowski distance *d _{r}* with

*r*= 1, that is equivalent to Manhattan distance, but in this case the measure is normalized by the sum of the intensities within windows (note that different w.r.t. Eq. (2)). We denote this measure as SM

*[16, 17]:*

_{M}The second one is based on the Kullback distance [18] between fuzzy sets. We denote this similarity as SM* _{K}*:

In both cases, we normalize the intensities of the pixels to the unit interval in such way that we can apply these similarities and then compute the disparity from the largest one.

### 2.2.6. Similarity Measure based on Union and Intersection (SM_{UI})

_{UI}

The concept of similarity from union and intersection operations also comes from fuzzy set theory [19], where the similarity between two fuzzy sets can be computed as the division of intersection’s cardinality and the union’s cardinality. In the same way as in distance-based methods, we normalize the intensities before applying this similarity. Then, the disparity is obtained from the largest output. The expression is as follows:

## 3. Aggregation functions

In our experiments we use *RGB* representation following previous works in this field [12, 13]. Similarly to [12] we treat each color channel separately until we aggregate their similarity values. By using color information in the stereo matching algorithm we can avoid some false correspondence produced by color ambiguities.

In [12] they propose to use the minimum as aggregation function for this task, but some inconsistencies can be produced. For example, a pixel with low similarity values in all channels will have a larger matching score than another pixel with a great value of similarity in two channels and the other value near (but under) the similarities of the first pixel. Hence, the minimum would cause some undesirable matches (mismatches). Therefore, it is necessary to study of several aggregation functions to find which one is the most suitable (robust) to aggregate color similarities in the stereo matching problem. Next we present different aggregation functions that we will analyze in the experimental study presented in Section 4.

**Note:** We denote a vector of *n* elements with **x** = {*x*_{1}, *x*_{2},..., *x _{n}*}.

**Weighted Mean (W-Mean)**where**w**= {*w*_{1},*w*_{2},...,*w*} is the weight vector that satisfies ${\sum}_{i=1}^{n}{w}_{i}=1$._{n}In our comparison we consider different weight vectors to compute the final similarity:

$$\mu \left(x\right)={w}_{R}\cdot {\mu}_{R}\left(x\right)+{w}_{G}\cdot {\mu}_{G}\left(x\right)+{w}_{B}\cdot {\mu}_{B}\left(x\right)$$For example if

*w*= 0.1,_{R}*w*= 0.8 and_{G}*w*= 0.1, we obtain_{B}$$\mu \left(x\right)=0.1\cdot {\mu}_{R}\left(x\right)+0.8\cdot {\mu}_{G}\left(x\right)+0.1\cdot {\mu}_{B}\left(x\right)$$If

*w*= 0, 299,_{R}*w*= 0, 5870 and_{G}*w*= 0, 1140, we obtain_{B}$$\mu \left(x\right)=0,299\cdot {\mu}_{R}\left(x\right)+0,5870\cdot {\mu}_{G}\left(x\right)+0,1140\cdot {\mu}_{B}\left(x\right)$$The weights values of Eq. (14) belong to the computation of the luminance of a

*RGB*image [20]. The expression of luminance is used to transform*RGB*color images into gray scale. The purpose of luminance is to represent the brightness of colors just as human perceive them. In this manner, it represents that humans consider the color green brighter than the color blue.**Median**$$M\left(\mathbf{x}\right)=\{\begin{array}{ll}\frac{1}{2}\left({x}_{\left(k\right)}+{x}_{\left(k+1\right)}\right),\hfill & \text{if}\hspace{0.17em}n=2k\hspace{0.17em}\text{is}\hspace{0.17em}\text{even}\hfill \\ {x}_{\left(k\right)},\hfill & \text{if}\hspace{0.17em}n=2k\hspace{0.17em}\text{is}\hspace{0.17em}\text{odd},\hfill \end{array}$$where*x*_{(k)}is the*k*-th largest (or smallest) component of**x**.**Mode**The mode is the most frequent value in

**x**.

We also consider in our experiments the dual aggregation function of the Geometric and Harmonic means constructed as: *M*(**x**) = 1 − *M*(1 − **x**).

## 4. Experimental results

In our experimental study, we compare the behavior of different aggregation functions to aggregate color similarities. We have three main objectives:

- To check whether using color information by aggregating the similarities improves the results of using gray scale images.
- To study which aggregation fits each correlation method or similarity measure in order to carry out a global analysis.
- To study which is the best (most robust) aggregation function in order to combine color similarities among all methods.

In order to evaluate the performance we use the Middlebury test bed proposed by Scharstein and Szeliski [1] (http://cat.middlebury.edu/stereo), which is established as the common benchmark for stereo matching methods and allows one to easily reproduce the results obtained. In this test, the disparity maps obtained by each algorithm are compared with the ideal disparity maps. The test images are shown in Fig. 2 with their corresponding ideal disparity maps. We refer to each image pair by the name given in [1]: “Tsukuba”, “Teddy” and “Cones”. We have to recall that stereo matching algorithms do not use any type of preprocessing or post processing steps (such as, optimization techniques, occlusion detection or image filtering).

As a particular case, we start presenting the quantitative results for the stereo matching algorithm using the fuzzy similarity measure (SM* _{FS}*, Eq. (4)) and different aggregation functions to merge color channels similarities in Table 1. The leftmost column of Table 2 indicates the aggregation function used. Then, we present three columns for each image pair. These columns represent the percentage of absolute disparity error greater than one for three different regions in the image:

**no-oc.**: only non-occluded pixels are considered.**all**: whole image is considered.**disc.**: only pixels near discontinuities are considered.

The rightmost column is the overall performance of the algorithm computed by the arithmetic mean of all other columns. The results are listed following the total error in descending order, and the row corresponding to the stereo algorithm applied to gray scale images is shaded to ease the comparison with respect to the performance using color images.

Following Table 1, we can conclude that using color information is beneficial. Notice that the aggregations using color information performing worse than the usage of gray scale images mainly consider one of the coefficients in the aggregation (in the case of the weighted means, the weights give most of the importance to a unique channel). However, these statements are based on a unique similarity measure; hence, we should study whether these conclusions are maintained across all metrics.

Table 2 summarizes the total error obtained for the considered metrics and aggregations. The last two columns present the mean error for each aggregation and the average rank (and rank position) respectively. The average rank is obtained by assigning a position to each aggregation depending on its performance on each metric. The aggregation which achieves the best accuracy in a specific method will have the first ranking (value 1); then, the aggregation with the second best accuracy is assigned rank 2, and so forth. This task is carried out for all metrics and finally an average ranking is computed as the mean value of all rankings. It provides a quick view of the overall behavior of each aggregation with respect to the others. Bold numbers indicate the best aggregation (rank 1) among the method. In addition, Fig. 3 shows the comparison of the best disparity maps obtained using color aggregation (those stressed in bold-face in Table 2) and the ones obtained with gray scale images.

This experiment shows that the optimal aggregation function to be used depends on the stereo matching algorithm considered. However, the previously stated facts are confirmed, the usage of color information can improve the matching as long as all color channels are taken into account. The usage of color information may need a greater computational effort in order to compute the disparity map, but it can be really low since the process can be easily carried out in parallel, whereas the obtained improvements can make a difference. It is remarkable that the total error obtained using the weighted arithmetic mean based on the luminance formula is always better than using gray scale images. Gray scale images are obtained by the *RGB* luminance formula, and hence, it is advisable to use the three color channels to compute the matching independently, aggregating the matching scores instead of aggregating the information first and then applying the algorithm for a unique color channel.

Moreover, among the tested aggregations, the usage of the dual of the geometric mean (G-Mean Dual), the weighted mean with the weights from luminance formula (W-Mean Luminance) and the dual of the harmonic mean (H-Mean Dual) can be recommended, since they have stand out as the most robust ones. They behave well within different images and metrics, which address for their appropriateness on different frameworks. We should also note the good behavior of the product and the geometric mean in SAD and SSD, which are commonly used [6, 7]. Both aggregations obtain equivalent results despite of the similarity values are different, since the order between these scores is the same (the results of the geometric mean are equal to the product ones but applying the root).

## 5. Conclusions

We carried out a comparison study of the performance of different aggregation functions in the stereo matching algorithm to aggregate the similarities from different color channels in *RGB* color space. We can conclude that it is better to make the color aggregation after the similarities are computed in order to avoid ambiguities (produced by color) than to aggregate the color to obtain gray scale images and then compute the similarities. That is, color information is useful for the matching process and must not be overlooked.

The experiment has shown that despite the optimum aggregation function depends on the metric used, there are robust aggregations such as the dual of the geometric and harmonic mean, the weighted arithmetic mean based on the luminance formula, the geometric mean or the product which performs properly in all metrics among all images and hence, whose usage can be recommended.

## Acknowledgments

This paper has been partially supported by the National Science Foundation of Spain, Reference TIN2010-15055, TIN2011-29520 and the Research Services of the Universidad Publica de Navarra.

## References and links

**1. **D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” Int. J. Comput. Vision **47**, 7–42 (2002). [CrossRef]

**2. **B. Cyganek and J.P. Siebert, *An Introduction to 3D computer vision techniques and algorithms* (Wiley, 2009). [CrossRef]

**3. **R. Zabih and J. Woodfill, “Non-parametric local transforms for computing visual correspondence,” presented at the Third European Conference on Computer Vision, Stockholm, Sweden, 2–6 May 1994.

**4. **E. Trucco, V. Roberto, S. Tinonin, and M. Corbatto, “SSD disparity estimation for dynamic stereo,” presented at The Bristish Machine Vision Conference, Edinburgh, England, 1996.

**5. **O. Faugeras, T. Vieville, E. Theron, J. Vuillemin, B. Hotz, Z. Zhang, L. Moll, P. Bertin, H. Mathieu, P. Fua, G. Berry, and C. Proy, “Real-time correlation-based stereo: algorithm, implementations and applications,” INRIA Technical Report 2013, 1993.

**6. **X. Hu and P. Mordohai, “Quantitative evaluation of confidence measures for stereo vision,” IEEE T. Pattern Anal. **34**, 2121–2133 (2012). [CrossRef]

**7. **X. Xiang, M. Zhang, G. Li, H. Yuyong, and Z. Pan, “Real-time stereo matching based on fast belief propagation,” Mach. Vision Appl. **23**, 1219–1227 (2012). [CrossRef]

**8. **M. Bleyer and M. Gelautz, “Graph-based surface reconstruction from stereo pairs using image segmentation,” Proc. SPIE **5665**, 288–299 (2005). [CrossRef]

**9. **L. Hong and G. Chen, “Segment–based stereo matching using graph cuts,” *Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition* (IEEE, 2004), pp. 74–81.

**10. **V. Kolmogorov and R. Zabih, “Computing visual correspondence with occlusions via graph cuts,” *Proceedings of IEEE International Conference on Computer Vision* (IEEE, 2001), pp. 508–515.

**11. **P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient belief propagation for early vision,” *Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition* (IEEE, 2004), pp. 261–268.

**12. **G. Tolt and I. Kalaykov, “Measures based on fuzzy similarity for stereo matching of colour images,” Soft Comput. **10**, 1117–1126 (2006). [CrossRef]

**13. **A. Klaus, M. Sormann, and K. Karner, “Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure,” *Proceedings of IEEE International Conference on Pattern Recognition* (IEEE, 2006), pp. 15–18.

**14. **M. Galar, J. Fernandez, G. Beliakov, and H. Bustince, “Interval-valued fuzzy sets applied to stereo matching of color images,” IEEE T. Image Process. **20**, 1949–1961 (2011). [CrossRef]

**15. **D. Van der Weken, M. Nachtegael, and E. E. Kerre, “Using similarity measures and homogeneity for the comparison of images,” Image Vision Comput. **22**, 695–702 (2004). [CrossRef]

**16. **W.J. Wang, “New similarity measure on fuzzy sets and on elements,” Fuzzy Set. Syst. **85**, 305–309 (1997). [CrossRef]

**17. **S.M. Chen, M.S. Yeh, and P.Y. Hsiao, “A comparison of similarity measures of fuzzy values,” Fuzzy Set. Syst. **72**, 78–89 (1995).

**18. **S. Kullback, *Information theory and statistics* (Wiley, 1959).

**19. **C.P. Pappis and N.I. Karacipilidis, “A comparative assessment of measures of similarity of fuzzy values,” Fuzzy Set. Syst. **56**, 171–174 (1993). [CrossRef]

**20. **E. Deza and M.M. Deza, “Image and audio distances,” in *Dictionary of distances*, E. Deza and M.M. Deza, (Elsevier, 2006), pp. 262–278. [CrossRef]