## Abstract

Depth maps captured by Kinect or time of flight (ToF) cameras have an active role in many visual applications. However, a brutal truth is that these depth maps are often contaminated with compound noise, which includes intrinsic noise and missing pixels. In addition, depth maps captured with ToF-based cameras are low in resolution. As these depth maps carry rich and critical information about 3D space, high quality post-processing is crucial for supporting subsequent visual applications. Previous works were proposed via the guiding of the registered color image and bicubic interpolation as an initialization for the up-sampling task, where challenges arose from texture coping and blurry depth discontinuities. Motivated by these challenges, in this paper, we propose a new optimization model depending on the relative structures of both depth and color images for both depth map filtering and up-sampling tasks. In our general model, two self-structure priors for depth and color images are constructed individually and used for the two tasks. For overcoming the texture coping problem, the color-based and depth-based priors are used near the depth edges and at the homogeneous regions respectively. To this end, we further propose a confidence map at every task for managing where every prior is used. Experimental results on both simulated and real datasets for Kinect and ToF cameras demonstrate that the proposed method has a superior performance than benchmarks.

© 2021 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

Depth maps play a crucial role in many visual computation and communication applications such as 3D surgery operations [1–3], augmented reality [4], UAV navigation [5], human-device interaction [6], 3D modeling [7], and 3DTV/free viewpoint visual applications [8]. These depth maps are obtained from current depth sensors such as Kinect and Time of Flight (ToF) cameras, which are affordable and popular. Unfortunately, these depth maps are far from perfect, where they suffer from different types of degradation such as intrinsic noise and missing pixels. In addition, the depth maps captured with the current devices have low resolution. These various degradation types reduce the range sensors performance, when they are used in different applications. For benefit from the depth information, many researchers devoted their efforts to depth restoration, where they pursued various ways either through using multi-depth maps or exploiting an auxiliary information such as a registered color image. The current research on depth map restoration, including depth map filtering and super-resolution can be divided into three categories according to the baseline [9]:

**(1) Filtering methods:** The filtering methods depend on local or non-local information, and all of these methods can be divided into two categories: *self-guided* and *color-guided*. Self-guided filters depend only on the depth map for filtering either local-based such as bilateral filter (BF) [10], non-local-based such as non-local means (NLM) filter [11] or any traditional filtering method used for color images [12]. These self-guided methods are not applicable for up-sampling and filtering ToF-based depth maps because these under-sampled depth maps have very low quality. On the other hand, the color-guided filters exploit the content of the color image to detect the depth discontinuities in the noisy environment. These color-guided filters rely on the edges correlation assumption between the depth and color images. For example, joint bilateral filter (JBF) [13] is one of the popular color-guided local methods used for enhancing depth maps. Most of the color-guided methods depend explicitly on the co-occurrence property between the depth map and the corresponding color image. Figure 1 shows the edges of both depth and color images. From Fig. 1, it is observed that there is an inconsistency between these edges; however, some of these color edges correspond to the depth edges. When the color image is used as a guide for depth map filtering, these textures that do not correspond to any depth edge are transferred and copied to the homogeneous regions of depth map, and this problem is called texture coping problem. This problem is a challenge for all color-guided filtering methods.

**(2) Optimization methods:** Optimization methods are actually global methods used to recover the depth maps through using some regularization terms or priors that model the depth map characteristics. Most of the previous optimization models used for up-sampling and filtering depth maps generated with ToF and Kinect devices are color-guided. Some of these color-guided models are based on graphical model such as Markov random field (MRF) [14] or a low rank-based method (LRM) [15]. Besides that, two optimization models depending on auto-regressive (AR) predictors have been proposed. The first model is based on AR without any further priors [16] but the second one is with adding further priors such as a low rank (LR) and total variation (TV) regularization terms [17]. All of these AR-based models result in blurry depth edges. Apart from the aforementioned optimization methods, some methods depend on the prior of mutual structures between the color and depth images for depth map enhancement. Mutual structure for joint filtering (MSJF) method [18] is an example of guiding the depth map restoration by using the structures that exist in both depth and color images. In addition, based on weighted least square, some optimization models were proposed for depth map restoration [8,19,20].

**(3) Learning methods:** Learning-based methods are methods that use the learning methodology for depth map processing. These learning-based methods can be divided into two categories: *deep neural networks* and *sparse coding* methods. As deep learning is a new born field, this field drew the attention of many researchers for using the deep learning in depth map restoration. For instance, a convolutional neural network (CNN) with a pre-processing step is used for depth map enhancement in [21]. Zhu *et al.* [22] combine CNN and a linear regularization into a learning network for filtering and up-sampling depth maps. Recently, He *et al.* proposed a graphical neural network to handle the compression artifacts of multi-view depth maps [23]. For sparse coding, Wang *et al.* [24] used the sparse coding constrained with a trilateral prior for filling hole pixels in Kinect depth maps. This constrain is robust to handle the blurred depth discontinuities that appears if the sparse code is used alone. Wang *et al.* [25] proposed to use deep intensity features for compression artifacts reduction of depth map.

In this paper, we propose a color-guided optimization model based on the reliable self structures of depth and color images for depth map restoration, including two tasks: filtering noise with filling missing pixels task and super-resolution (i.e. up-sampling) task. Our contributions can be outlined as follows:

- • Reliable self structures-based depth map filtering: In this task, motivated by mutually guided image filtering (muGIF) [26] method, we construct an optimization model that depends also on the mutual structures or relative structures but our model depends only on the self relative structures of depth map guided by the self relative structures of color image. The other contribution is utilizing of this relative structure based model for filling missing pixels in the depth map, where the original model muGIF [26] is not applicable for filling hole pixels. To this end, we also propose a confidence map for using the color-based prior only in missing regions, and that is for overcoming the texture transfer problem.
- • Reliable self structures-based depth map up-sampling: In this task, although the original model muGIF is used for depth map up-sampling, we propose a modified model similar to the model used for depth map filtering with a different confidence map. The confidence map in this task is suitable for dealing with the problems that face the original model. With this modification, the performance of depth map up-sampling is improved.

## 2. Proposed color-guided optimization model

#### 2.1 Problem statement and degradation model

We should note that the depth map is not only polluted with intrinsic noise (e.g. Gaussian noise with constant or depth value-related variance) but also corrupted with missing of some regions, especially near the depth edges. In addition, these depth maps captured by recent sensors and depth cameras such as ToF are of low resolution compared with the high resolution (HR) RGB color image for the same scene. For summarizing the depth map degradation, the main types of degradation that contaminate the captured depth maps are **intrinsic noise**, **hole pixels** (i.e. missing pixels) either random or structural and **under-sampling**. The observation model could be mathematically formulated as: $\textbf {T}_{0}= \textbf {P}\textbf {T}+\textbf {n}$, where $\textbf {P}$ is the observation matrix, $\textbf {T}$ is the desired depth map, $\textbf {T}_{0}$ is the degraded depth map, and $\textbf {n}$ is the intrinsic noise. For super-resolution in ToF and Kinect version2 (i.e. ToF-based) cameras, we can denote the observation matrix $\textbf {P}$ as $\textbf {P}_{s}$, which represents a sampling matrix. $\textbf {P}_{s}$ is constructed from an identity matrix by removing those rows associated with pixels that are not exist in the low resolution and should be estimated in the high resolution. On the other hand, for hole pixels filling at Kinect version1 (i.e. structure light-based) camera, we denote $\textbf {P}$ as $\textbf {P}_{h}$ to represent an identity matrix whose rows associated with hole pixels are removed. For understanding the shortcomings that face the depth map restoration, we separate the analysis for filtering of the missing pixels and intrinsic noise from super-resolution task.

Given a degraded depth map with spatial noise (e.g. Gaussian noise with constant variance or the variance at each pixel is proportional to the square of noise-free depth value), the degraded depth map can be restored through many methods. As the optimization-based method muGIF [26] is very related to our work is used as an example in the analysis. muGIF is a recent method that defines a new measurement for mutual response to manage structural similarity between two inputs for image smoothing. In the case of color-guided manner, let us denote the depth map as a target image and the color image as a reference or guided image by $\textbf {T}_{0}$ and $\textbf {R}_{0}$ respectively. $\textbf {R}$ and $\textbf {T}$ indicate the filtering outputs of color and depth images respectively through the optimization iterations. The muGIF is formulated as:

#### 2.2 Analysis of different mutually filtering types

### 2.2.1 Filling hole pixels and filtering

For filling hole pixels and filtering task, the self-guided muGIF is robust for dealing with intrinsic noise and removing texture; however, it can not handle the hole pixels. For the remaining types, the hole pixels are not completely filled, especially the large black regions, although the mutuality between the color and depth edges is activated. Figure 2(b-d) show the performance of muGIF in the case of hole pixels existence in the depth map. As the original muGIF is not built for filling hole pixels, we first modify the data fidelity term of depth map in muGIF model by the degradation model, and we denote this new obtained model as Extended muGIF (EmuGIF) in this paper. After the modification, the optimization model becomes as follows:

Figure 2(f-h) demonstrates the performance of EmuGIF. From Fig. 2(f), it is clear to see that the self-guided EmuGIF can fill the hole pixels with wrong predictions, especially in the large black pixels. On the other hand, S/D and D/D EmuGIF perform well in filtering and filling the hole pixels; however, these color-guided types of EmuGIF are not sufficient for handling the case where color edges are inconsistent with depth discontinuities. The inconsistency between the compound noise free depth map and color image appears in two cases. The first one is the homogeneous depth regions that correspond to highly textured color regions, while the other one is the depth discontinuities that correspond to weak color edges. Therefore, when the color image is used as a guiding image, some textures transfer from the color image to the homogeneous regions of depth map in the first case, while the restored depth edges will be blurred in the second case. The mutuality concept depends on the mutual structures between the two images (i.e. depth and color images). The mutual filter considers that the structure is consistent, if common edges exist in the two images. On the other hand, the structure is considered inconsistent if the edges appears in only one image but not on the other. As the mutual concept is robust for inconsistency, the texture coping problem that results from transferring the texture to the corresponding homogeneous depth regions is highly mitigated especially if the depth map is not contaminated with heavy intrinsic noise. Unfortunately, this mutuality is not powerful if the color edges that correspond to the depth edges are weak because the mutual filter considers this case is also inconsistent, which in turn results in smoothing or blurring these depth edges.

### 2.2.2 Super-resolution

For super-resolution task, we initialize HR depth map by using the bicubic interpolation, then all types of muGIF are applied on this interpolated version. Figure 3 shows the results of these types on simulated ToF depth map *Art* for 8$\times$ up-sampling. From Fig. 3, we can evidently observe that muGIF algorithm is not robust for noisy depth map up-sampling, where some noise still remains in the recovered HR depth map. Actually, there is a conflict between removing the intrinsic noise and keeping the sharpness of depth edges. To remove the noise and smooth the homogeneous regions, some depth edges will be blurred in the case of using self guided type. In addition to the blurred discontinuities, some fake edges transfer to the homogeneous depth regions in the case of the other types.

#### 2.3 Proposed method for compound noise filtering

In addition to what we aforementioned, if the depth map has missing pixels, filling these pixels becomes more challenging for the mutual filter even with the degradation model. To handle the missing pixels and overcome the blurry effect at the depth discontinuities corresponding to the weak color edges, we propose a new optimization model, which depends also on the relative structure concept. This new optimization model can be expressed as:

On the other hand, $\mathcal {R}_{masked}(\textbf {R},\epsilon _{r})$ is the relative structure of depth map to self-guided color image mutual response masked by the inverted confidence map $\textbf {C}_{inv}$. This relative structure related to color image gradients is expressed as:

*i*-th diagonal entries being the denominators of Eq. (6) and Eq. (7) respectively. $\textbf {D}_{d}$ is the discrete gradient operator in horizontal, vertical and diagonal directions, which points out the 8 surrounding nearby pixels. As the final decomposition equation has quadratic terms, which are convex, the closed form solution can be described as:

#### 2.4 Proposed method for super-resolution

In this subsection, we describe how the depth map degradation process related to low resolution is modeled. The optimization model used for super-resolution is similar to the filtering model but with some modifications in the initialization of depth map and the confidence map. The proposed model for super-resolution is formulated as follows:

To calculate $E$, we first distinguish between the smooth depth regions and regions around depth discontinuities. In the beginning, we denoise the interpolated map by $L_{0}$ smoothing filter [28] to facilitate the distinguishing process. Then, similar to [20], we also use the local depth relative smoothness and denote it as $\rho$ to detect the depth edges as shown in the following decomposition model equation:

## 3. Experiments and discussions

In this section, the performance of our model is verified via various experiments using different types of datasets. These experiments include two degradation types: super-resolution with filtering intrinsic noise (i.e. ToF-like experiments) and filling the missing black pixels with filtering intrinsic noise (i.e. Kinect-like experiments). For datasets, our proposed method is tested on Middlebury datasets [30], where these datasets are modified to simulate ToF and Kinect-like degradation models. Moreover, our method is tested on real datasets. We also qualitatively and quantitatively compared our filtering method with the state-of-the-art methods: a low rank based method (LRM) [15], a mutual structure for joint filtering (MSJF) [18], the color guided AR model [16], RCG [20], muGIF [27], adaptive color guided non local means method (ACGMNLM) [31], and ACGMNLM with shock filter (ACGMNLM+SF) [31]. For super-resolution comparison, fast guided global interpolation method (FGI) [32] and learning dynamic guidance method (DG) [33] are used in addition to the mentioned methods. The parameters of our method are set as follows: $\alpha$ in Eq. (12) is set as 0.0002 for filtering task but 0.0005 divided by up-sampling rate for super-resolution task. $\epsilon _{t}$ and $\epsilon _{r}$ are set as 0.005 for the two tasks.

#### 3.1 Experiments on simulated and real Kinect depth maps

In this part of experiments, our method is tested on both simulated and real Kinect datasets. In regards to the simulated datasets, we also reuse the two simulated Kinect datasets used in [20], where the first dataset (D1) and the second dataset (D2) are prepared by [15] and [20] respectively. After that, we apply our method and the compared filters that are applicable for restoring the Kinect-like depth maps on these corrupted depth maps. Table 1 obtains the comparison between our proposed and other filtering methods performed on the simulated Kinect datasets in terms of MAE. From Table 1, it is clearly observed that our proposed method ranks the first, which has the smallest MAE for most depth maps in the two datasets. In addition, our average score outperforms the average scores of other filtering methods.

For visual comparison, Figs. 4 and 5 illustrate the comparison between different algorithms performed on two simulated Kinect depth maps from D1 (*Art* and *Teddy*). In Figs. 4 and 5, two specific regions from each depth map are picked and enlarged for further clarification; one region is chosen in the homogeneous regions and the other region is for clarification the problem of blurred and distorted depth discontinuities. From Figs. 4(c) and 5(c), it is clearly observed that LRM blurs the depth discontinuities because it is based on patch-based low rank optimization method. Furthermore, some textures are also transferred in the homogeneous regions. For AR results, some intrinsic noise still occupies the homogeneous regions in depth maps as shown in Figs. 4(d) and 5(d). For RCG results, the depth edges are over-sharpen. In addition, fake edges are transferred from the color images as shown in Figs. 4(e) and 5(e). Although the results of EmuGIF methods have little texture in their smoothed regions, they distort the depth edges as shown in Fig. 4(f-g). ACGMNLM with and without SF are robust against texture coping artifacts; however, the resulted depth edges are not sharp enough compared with our proposed optimization-based filtering method. Among all of these aforementioned filtering methods, our proposed method performs the best results in overcoming the color texture transfer, preserving sharper edges, and handling the intrinsic noise. In regards to the real Kinect dataset, we also evaluate our proposed method on real Kinect dataset. Some of NYU dataset [34] is used in our verification. Figure 6 illustrates our method performance in against of other filtering methods performed on two depth maps from the mentioned dataset. From Fig. 6(c), it is obviously observed that AR method [16] suffers from blurred depth edges as appears in the marked regions. RCG method [20] always over-sharpens the depth discontinuities and transfers more texture to the corresponding homogeneous depth regions as shown in Fig. 6(d). The results of EmuGIF are comparable with our results; however, our results are still the best for preserving the depth discontinuities and overcoming the texture coping problem.

In addition to the objective and subjective evaluation of simulated and real Kinect depth maps, we also construct 3D point clouds, where the flying pixels and texture coping artifacts appear in the 3D space with better visualization than at the two dimensions. Figure 7 shows the point clouds obtained based on the resulted depth maps of different compound noise filtering methods. From Fig. 7, we can see that the point cloud obtained from the depth map resulted from AR is very distorted either in the homogeneous regions or at the depth boundaries. For RCG method, the obtained point cloud boundaries are also distorted especially at the locations corresponding to color image regions that have rich textures; however, almost depth edges are very sharp. This problem appears because of the over-sharpness of RCG method. From Fig. 7(c), it is observed that the point cloud obtained from EmuGIF(S/D) method has many defects due to texture transferring and flying pixels near the edges. Although the point cloud of EmuGIF(D/D) is better than that of EmuGIF(S/D) for reduction of flying pixels, it still has some distortions in the geometry corresponding to the homogeneous regions due to texture coping. This texture coping problem is tackled in ACGMNLM and ACGMNLM+SF because these methods are robust against texture transferring, where the color image is not used in non-hole regions. However, the flying pixels still appear at the depth discontinuities. On the other hand, our proposed optimization model provides a high improvement on tackling the flying pixels, where the depth discontinuities are sharp as shown in Fig. 7(g). In addition, our method is robust against texture coping problem.

#### 3.2 Experiments on simulated ToF depth maps

In this part of experiments, our up-sampling method is tested on both simulated ToF dataset. The simulated dataset is provided by Yang *et al* [16], where they took some depth maps from Middlebury datasets (six depth maps) and made these depth maps noisy and under-sampled with the following factor: 2, 4, 8 and 16 to mimic the real ToF depth maps. In this experiment, our method is compared with some of edge aware up-sampling optimization model. Table 2 shows the comparison between our proposed method and the other up-sampling methods performed on the simulated ToF dataset in terms of MAE for all up-sampling rates. From Table 2, we can see that our up-sampling results are much better than the results of all types of muGIF model because of the effect of edge confidence map. In addition, it is clearly seen that our proposed method has lower errors compared the learning-based method DG and the optimization-based method FGI in most of simulated depth maps. Although our objective results are slightly worse than RCG up-sampling results as shown in Table 2, our method still outperform RCG methods over some of the simulated depth maps (e.g. Art and Book). Our proposed method ranks the first or the second among the other up-sampling methods.

For subjective evaluation, our proposed method is also compared with the up-sampling methods, where the subjective evaluation is sometimes better than the objective evaluation especially for visualization of the depth discontinuities. Figure 8 presents the subjective evaluation of 8$\times$ up-sampled depth map *Art* by all methods. In addition, the difference maps for all methods are also shown. The depth map up-sampled by the dynamic/dynamic type of muGIF and DG methods still contain observable intrinsic noise. In addition, their depth discontinuities are also blurred as shown in Fig. 8(a-b). For FGI, the depth map obtained by this method has very little noise; however, the depth edges still are blurred and distorted. For RCG, the transferred textures to the homogeneous regions of depth map are still noticeable especially at the regions corresponding to the high contrasted textures in the registered color image. It is also noticed that our proposed method outperforms RCG method in overcoming the texture coping problem and preserving the depth discontinuities as shown in the corresponding difference maps of Fig. 8(d-e).

#### 3.3 Experiments on real ToF depth maps

In addition to testing of our proposed super-resolution model on simulated ToF dataset, it is further tested on real ToF dataset. The real ToF dataset used for verification is provided by [35], which includes three depth maps namely *Shark*, *Devil* and *Books*. The depth maps included in this dataset are at low spatial resolution (i.e. 120 $\times$ 160), where their values in millimeter (mm), while the spatial resolution of the registered intensity images are 610 $\times$ 810. In regards to the real ToF dataset, Table 3 illustrates the quantitative performance of our method compared with other up-sampling methods on real ToF dataset. From Table 3, it is seen that our method ranks first for most of the depth maps of the real dataset (*Books* and *Shark*).

For subjective evaluation, our method is validated by the depth map results and the point cloud obtained by our method. The visual comparison of one real ToF depth map *Books* is presented in Fig. 9. AR and all muGIF types still have some noise and blurry edges. The learning-based method DG has little noise and distortion in the homogeneous regions and at the depth discontinuities respectively; however, the depth edges are also blurred as AR and muGIF methods. The depth edges resulted from RCG method are distorted and quite jaggy as shown in Fig. 9(d), although RCG over-sharpens the depth edges. From Fig. 9, one can undoubtedly realize that our up-sampling method is the best in recovering and up-sampling real ToF depth map compared with other methods especially RCG method.

The other validation is the point cloud, where the benefit of warping the depth maps into point clouds is that the problem of blurred depth edges and flying pixels are clearly appeared in the point cloud. Figure 10 presents the point clouds obtained from various up-sampling methods including ours. This figure confirms on the observations drawn from Fig. 9. From Fig. 10, it is obvious that our method preserves the boundaries of depth map with few flying pixels compared with the other approaches. Although RCG method has robust performance, where it has sharp edges and very few flying pixels compared with other methods, and it is comparable with our method, there is noticeable distortions at the boundaries of point cloud resulted from RCG.

#### 3.4 Visualization of confidence maps

In this subsection, we discuss the changing of the confidence maps through optimization iterations. Figure 11 shows the changing of confidence maps through the iterations. As shown in Fig. 11, the first column of the first row represents the initialization of confidence map $C$ which is the eroded mask of Kinect-like depth map. After the first iteration, the hole pixels are filled and the confidence map becomes a white matrix, where the proposed method is equivalent to the self-guided muGIF, which removes any texture copied from the color image. For the edge confidence map $E$, the black regions in the map is related to the homogeneous regions of depth map, while the white pixels are related to the expected edges.

## 4. Conclusion

In this paper, a new optimization model depending on the relative structures of both depth and color images is proposed for depth filtering and up-sampling tasks. In addition, a confidence map suitable for every task is proposed for distinguishing between the depth discontinuities and smooth regions, where the color-based and depth-based priors are used in them respectively. Our proposed model is superior for overcoming texture coping problem in both filling hole pixels and super-resolution tasks. Moreover, the depth discontinuities of our results are sharp and moderate between blur and over-sharpen as shown in experiments on both simulated and real Kinect and ToF data.

## Funding

National Natural Science Foundation of China (61971203); China Southern Power Grid (YNKJXM20180015).

## Disclosures

All authors are not employed by government or government related entities, and commercial entities. The research is not conducted under any commercial relationships to any kind of commercial entities.

This manuscript or any part of the manuscript has not been published or under consideration by other journals.

All authors declare no conflicts of interest.

## Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

## References

**1. **G. Badiali, L. Cercenelli, S. Battaglia, E. Marcelli, C. Marchetti, V. Ferrari, and F. Cutolo, “Review on augmented reality in oral and cranio-maxillofacial surgery: Toward surgery-specific head-up displays,” IEEE Access **8**, 59015–59028 (2020). [CrossRef]

**2. **M. H. Lee, J. Kim, K. Lee, C. Choi, and J. Y. Hwang, “Wide-field 3d ultrasound imaging platform with a semi-automatic 3d segmentation algorithm for quantitative analysis of rotator cuff tears,” IEEE Access **8**, 65472–65487 (2020). [CrossRef]

**3. **Z. Dai, R. Yang, F. Hang, J. Zhuang, Q. Lin, Z. Wang, and Y. Lao, “Neurosurgical craniotomy localization using interactive 3d lesion mapping for image-guided neurosurgery,” IEEE Access **7**, 10606–10616 (2019). [CrossRef]

**4. **B. J. Boom, S. Orts-Escolano, X. X. Ning, S. McDonagh, P. Sandilands, and R. B. Fisher, “Interactive light source position estimation for augmented reality with an rgb-d camera,” Comp. Anim. Virtual Worlds **28**(1), e1686 (2017). [CrossRef]

**5. **Y. Lu, Z. Xue, G.-S. Xia, and L. Zhang, “A survey on vision-based uav navigation,” Geo-spatial information science **21**(1), 21–32 (2018). [CrossRef]

**6. **J. Palacios, C. Sagüés, E. Montijano, and S. Llorente, “Human-computer interaction based on hand gestures using rgb-d sensors,” Sensors **13**(9), 11842–11860 (2013). [CrossRef]

**7. **Y. Wang, Y. Yang, and Q. Liu, “Feature-aware trilateral filter with energy minimization for 3d mesh denoising,” IEEE Access **8**, 52232–52244 (2020). [CrossRef]

**8. **Y. Yang, Q. Liu, X. He, and Z. Liu, “Cross-view multi-lateral filter for compressed multi-view depth video,” IEEE Trans. on Image Process. **28**(1), 302–315 (2019). [CrossRef]

**9. **M. M. Ibrahim, Q. Liu, R. Khan, J. Yang, E. Adeli, and Y. Yang, “Depth map artifacts reduction: A review,” IET Image Processing **14**(12), 2630–2644 (2020). [CrossRef]

**10. **C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in IEEE Int. Conf. Comput. Vis. (ICCV), (IEEE, 1998), pp. 839–846.

**11. **A. Buades, B. Coll, and J.-M. Morel, “Image denoising methods. a new nonlocal principle,” SIAM Rev. **52**(1), 113–147 (2010). [CrossRef]

**12. **E. S. Gastal and M. M. Oliveira, “Adaptive manifolds for real-time high-dimensional filtering,” ACM Trans. Graph. **31**(4), 1–13 (2012). [CrossRef]

**13. **J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral upsampling,” ACM Trans. Graph. **26**(3), 96–100 (2007). [CrossRef]

**14. **J. Diebel and S. Thrun, “An application of markov random fields to range sensing,” in Conf. Neural Information Processing Systems (NIPS), (2005), pp. 291–298.

**15. **S. Lu, X. Ren, and F. Liu, “Depth enhancement via low-rank matrix completion,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), (IEEE, 2014), pp. 3390–3397.

**16. **J. Yang, X. Ye, K. Li, C. Hou, and Y. Wang, “Color-guided depth recovery from rgb-d data using an adaptive autoregressive model,” IEEE Trans. Image Process. **23**(8), 3443–3458 (2014). [CrossRef]

**17. **W. Dong, G. Shi, X. Li, K. Peng, J. Wu, and Z. Guo, “Color-guided depth recovery via joint local structural and nonlocal low-rank regularization,” IEEE Trans. Multimedia **19**(2), 293–301 (2017). [CrossRef]

**18. **X. Shen, C. Zhou, L. Xu, and J. Jia, “Mutual-structure for joint filtering,” in IEEE Int. Conf. Comput. Vis. (ICCV), (IEEE, 2015), pp. 3406–3414.

**19. **D. Min, S. Choi, J. Lu, B. Ham, K. Sohn, and M. N. Do, “Fast global image smoothing based on weighted least squares,” IEEE Trans. Image Process. **23**(12), 5638–5653 (2014). [CrossRef]

**20. **W. Liu, X. Chen, J. Yang, and Q. Wu, “Robust color guided depth map restoration,” IEEE Trans. Image Process. **26**(1), 315–327 (2017). [CrossRef]

**21. **X. Zhang and R. Wu, “Fast depth image denoising and enhancement using a deep convolutional network,” in Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), (IEEE, 2016), pp. 2499–2503.

**22. **J. Zhu, J. Zhang, Y. Cao, and Z. Wang, “Image guided depth enhancement via deep fusion and local linear regularizaron,” in Int. Conf. Image Process. (ICIP), (IEEE, 2017), pp. 4068–4072.

**23. **X. He, Q. Liu, and Y. Yang, “MV-GNN: Multi-view graph neural network for compression artifacts reduction,” IEEE Transaction on Image Processing **29**, 6829–6840 (2020). [CrossRef]

**24. **Z. Wang, J. Hu, S. Wang, and T. Lu, “Trilateral constrained sparse representation for kinect depth hole filling,” Pattern Recognit. Lett. **65**, 95–102 (2015). [CrossRef]

**25. **X. Wang, P. Zhang, Y. Zhang, L. Ma, S. Kwong, and J. Jiang, “Deep intensity guidance based compression artifacts reduction for depth map,” J. Vis. Commun. Image Represent. **57**, 234–242 (2018). [CrossRef]

**26. **X. Guo, Y. Li, and J. Ma, “Mutually guided image filtering,” in 2017 ACM Multimedia Conf., (ACM, 2017), pp. 1283–1290.

**27. **X. Guo, Y. Li, J. Ma, and H. Ling, “Mutually guided image filtering,” IEEE Trans. Pattern Anal. Mach. Intell. **42**(3), 694–707 (2020). [CrossRef]

**28. **L. Xu, C. Lu, Y. Xu, and J. Jia, “Image smoothing via *l*_{0} gradient minimization,” ACM Trans. Graph. **30**(6), 1–12 (2011). [CrossRef]

**29. **D. Krishnan and R. Szeliski, “Multigrid and multilevel preconditioners for computational photography,” ACM Trans. Graph. **30**(6), 1–10 (2011). [CrossRef]

**30. **H. Hirschmuller and D. Scharstein, “Evaluation of cost functions for stereo matching,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), (IEEE, 2007), pp. 1–8.

**31. **M. M. Ibrahim, Q. Liu, and Y. Yang, “An adaptive colour-guided non-local means algorithm for compound noise reduction of depth maps,” IET Image Processing **14**(12), 2768–2779 (2020). [CrossRef]

**32. **Y. Li, D. Min, M. N. Do, and J. Lu, “Fast guided global interpolation for depth and motion,” in European Conference on Computer Vision, (Springer, 2016), pp. 717–733.

**33. **S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang, “Learning dynamic guidance for depth image enhancement,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), (IEEE, 2017), pp. 712–721.

**34. **N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in European Conf. on Comput. Vis. (ECCV), (Springer, 2012), pp. 746–760.

**35. **D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof, “Image guided depth upsampling using anisotropic total generalized variation,” in IEEE Int. Conf. Comput. Vis. (ICCV), (IEEE, 2013), pp. 993–1000.