Translation-invariant context-retentive wavelet reflection removal network

Wei-Yen Hsu; Wei-Yen Hsu; Wei-Yen Hsu; Wan-Jia Wu

doi:10.1364/OE.461552

1. Introduction

It can reduce the visibility of the scene behind the glass for reflections when shooting through glass surfaces in the real world. Reflections not only degrade image quality, but also degrade computer vision performance. This situation can be addressed by a single image reflection removal (SIRR) process designed to improve scene visibility of the background. Reflection removal is a challenging problem, and the challenges come from at least two aspects: (1) It is difficult to obtain real reflection datasets in the real world, and many studies will conduct research in a synthetic way [1–3]; (2) reflections have diverse variations, e.g., glass thickness [4], image misalignment [5], etc., that affects the rendering of reflections, so reflection removal needs to be given high priority in computer vision.

To address this problem, many non-learning-based methods [6, 7] employ a two-stage approach that first locates the reflective regions, for instance, is solved by classifying the background and reflective edges [6, 7] and the problem is then solved using [8] a method proposed to recover the background layer based on edge information. However, locating the reflective area itself is a very challenging task, so these methods mainly rely on some heuristic observations [7], or must involve user interaction [8], which are not applicable in many scenarios. In order to solve the reflection removal problem more efficiently, recent methods start to use deep learning techniques to solve this problem, such as CEINet [9] proposed by Fan et al., which is a pioneering technique. It is the first method to use deep learning and propose an end-to-end framework to solve reflection removal problems. Benefiting from deep learning frameworks, these methods exhibit good modeling ability to capture the features of various reflection images.

Nevertheless, gradient inference is still used like many non-learning-based methods [7, 8] and do not fully exploit the multi-scale information of background restoration. Furthermore, these methods [9] are mainly trained with synthetic images. However, synthetic images cannot capture comprehensive information during real-world image formation, so the removal of reflections from real-world images is still limited. In order to solve the problems encountered [9], Wan et al. [10] proposed a network with a feature sharing strategy (Cooperative Reflection Removal Network, CoRRN), by integrating image background information and multi-scale gradient information to solve the problem of reflection removal in a network, in addition to introducing a new loss function, based on the statistical loss of gradient level for the removal of local strong reflections. Wei et al. [5] use a context encoding module to reduce the uncertainty of local strong reflections, and in the loss function part, an alignment-invariant loss function is introduced to target the data set with image misalignment. To sum up, the current methods assume the removal of reflections in specific cases and provides its own dataset for study.

When we decomposed the reflection image into high and low frequency sub-images at different levels through wavelet transform, and observed that the reflections are almost present in high frequency sub-images. More specifically, we randomly selected 30 pairs of reflection and non-reflection images from the SIR² dataset [11], decomposed them into high and low frequency sub-images at different levels through SWT, and calculated the difference of mean square error (MSE). Most of the reflections can be observed at high frequencies, as shown in Fig. 1. In addition, current deep learning methods have not combined wavelet transform to remove reflections, but this method has been widely used in the fields of super-resolution [12, 13], haze removal [14] and object detection [15]. Hence, we proposed to use wavelet transform as prior knowledge to process high-frequency reflection and low-frequency background context separately to obtain the better reflection-removal results in this study.

Fig. 1. Percentage of MSE difference between reflected and non-reflected images.

Download Full Size | PDF

In this study, we propose a novel Translation-invariant Context-retentive Wavelet Reflection Removal Network (TiCrWRRNet), which can not only remove reflections but also retain the background context. The reflection image is decomposed into high-frequency and low-frequency sub-images at different levels several times with SWT. The decomposed high-frequency and low-frequency sub-images are then processed with different sub-networks to learn the end-to-end mapping to obtain the reflection maps at different levels, and finally to use to remove the reflections. Detail-enhanced reflection layer Removal Subnetwork (DRSn) and Detail-enhanced Reflection Information Transmission (DRIT) are both proposed to transmit the reflection layer information from DRSn to Context Retention Subnetwork (CRSn) through DRIT, so that CRSn can effectively separate the transmission layer and the reflection layer to further improve the results of reflection removal. To retain the background context and remove a small number of reflections that appear at low frequencies, we propose CRSn and establish a recursive context retention method to gradually remove low-frequency reflections at each level and retain the background context. We reconstructed the nth-level high and low-frequency de-reflection results by inverse wavelet transform (ISWT), performed Context Level Blending (CLB) with the low-frequency sub-images in the (n-1)th-level de-reflection results, and used for ISWT reconstruction with the high-frequency de-reflection results in the (n-1)th-level. It is repeated for the output results at each level with DRSn and CRSn until the final clean image is reconstructed. The contributions of this study are summarized as follows:

• A novel Translation-invariant Context-retentive Wavelet Reflection Removal Network (TiCrWRRNet) is proposed to effectively remove reflections while retaining background context of high- and low-frequency at each level.
• A novel Context Level Blending (CLB) combined with inverse wavelet Transformation (ISWT) is proposed to remove reflections and retaining background context to restore clean images and improve image quality.
• A novel Detail-enhanced Reflection Information Transmission (DRIT) is also designed, through which the reflection features of the high-frequency sub-images are extracted and then transmitted to CRSn, which greatly helps to separate the transmission layer and the reflection layer to achieve reflection removal.
• The experimental results indicate that the proposed method has better reflection removal performance compared with the state-of-the-art approaches in public datasets, and demonstrate that our method is not limited by the assumption that reflection removal is performed in a specific situation.

2. Related work

2.1. Single image reflection removal

It has been widely investigated for reflection removal in recent years. Current studies can be divided into multi-image reflection removal and single-image reflection removal. The former usually utilizes conditional constraints to alleviate the uncertainty of resolving reflections, e.g., polarization angles [16–18], or reflection differences [19]. These conditional priors are rarely considered in the context of a single image. However, this paper mainly focuses on the latter. At present, the reflection removal method for single image is mainly based on the content of the image, and usually makes assumptions about the image, as shown below:

(1) $$I = T + R$$

where I is the captured image, T is the transmitted layer, and R is the reflection layer. Single image reflection removal can be roughly divided into non-learning-based methods and deep-learning-based methods. In the former, due to the uncertainty of this reflection problem, different priors are employed to exploit the special properties of the background and reflection layer. For example, Levin and Weiss [8] used an image prior based on derivative filter sparsity that is optimized with iterative reweighted least squares (IRLS), but their method relies on the user's understanding of the background and reflective edge markings, which is very time consuming. Li and Brown [6] utilized a smooth image gradient prior, since reflected images are likely to be out of focus and blurry. Farid and Adelson [20] separated reflections from images using optical techniques and the assumption of statistical independence and used higher-order statistical moments to resolve reflection images taken through a linear polarizer at two different angles. However, while these methods work well when the assumptions hold, when those assumptions are violated, they cannot be widely used for various types of reflections in the real world.

In the latter case, i.e. single image reflection removal, since deep learning has achieved good performance on both high-level and low-level computer vision problems, its comprehensive modeling capabilities are also beneficial for reflection removal. Fan et al. [21] exploited edge information to better preserve image details. Zhang and Chen [1] utilized perceptual information during training to improve the fidelity of predictions. However, they mainly used on synthetic images, which led them to limitations in removing real reflections. Wan et al. [10] integrated image background information and multi-scale gradient information in a cooperative and unified framework to solve the reflection problem. In addition, in order to remove the strong reflections in some local areas, a statistical loss considering the gradient level statistics between background and reflections is proposed and a reflection image dataset is provided. Ding et al. [22] used the polarization properties of reflections to detect and remove reflections in LWIR images and invented a joint reflection detection method, they showed that this detection method can effectively identify the real reflection areas in the image, so it can effectively remove the reflection in LWIR images. Li et al [23] used the Stokes algebra and Mueller calculus formulae with a novel edge-based technique to separate the reflections of visible light, near infrared, and Long Wave Infrared Wavelengths (LWIR), in addition to exploiting spectral information and patch-wise operation to improve robustness, and results show that it can be applied to optically smooth reflecting and partially transmitting surfaces, such as those caused by glass surfaces. Zheng et al. [4] thought that the absorption effect needed to be considered, so they re-examined the formation model of the reflection image, and theoretically provided a single image reflection removal method. From the above, it can be seen that most of the current methods assume reflection removal under specific conditions. Although these methods have good results, real-world reflections are not specific to a certain hypothesis. Therefore, these methods are still limited.

2.2. Image restoration with wavelet prior

Wavelet transform is often used as image preprocessing or postprocessing [24,25], and down-sampling or up-sampling operations are also used to design deep networks [26,27] in today's deep learning process. However, wavelet transform is widely used in various fields [28–31] except reflection removal. For example, in the field of super resolution, Guo et al. [13] proposed a deep wavelet super resolution (DWSR) technique to obtain good SR results by using wavelet sub-bands as input to recover the missing details. In image restoration, Liu et al. [26] proposed a multi-level wavelet network (MWCNN) for image restoration and that can be effectively applied in image denoising, SISR and removal of artifacts generated by JPEG compression. In the field of rain removal, Yang et al. [14] proposed a recurrent wavelet learning and constructed dilated residual dense networks to alleviate the problem of different sizes of rain streaks in the training and testing phases, thereby effectively removing rain streaks.

3. Proposed method

3.1. Translation-invariant wavelet transform

In Eq. (1), we decompose the reflection image (${R^0}$) with Haar stationary wavelet transform to obtain wavelet bands of different scales. It is expressed as follows:

(2)$$\begin{array}{c}SWT({{R^0}} )= [{I_{LL}^1,I_{LH}^1,I_{HL}^1,I_{HH}^1} ],\\SWT({I_{LL}^1} )= [{I_{LL}^2,I_{LH}^2,I_{HL}^2,I_{HH}^2} ],\\ \ldots\\SWT({I_{LL}^{N - 1}} )= [{I_{LL}^N,I_{LH}^N,I_{HL}^N,I_{HH}^N} ]\end{array}$$

Among them, “SWT” represents the wavelet decomposition process, “I” represents the decomposed image, while LL (approximation sub-band LL), LH (horizontal detail LH), HL (vertical detail HL) and HH (diagonal detail HH) correspond to four sub-images. “LL” represents the context of the image, while “LH”, “HL” and “HH” represent the details of the image, and “N” denote the level number of wavelet decomposition.

3.2. Multi-level subnetwork models

In Fig. 2, The reflection image is decomposed into multiple levels of high and low frequency images with SWT and then processed with the Context Retention Subnetwork (CRSn) and Detail-enhanced reflection layer Removal Subnetwork (DRSn) respectively to predict the corresponding images with reflection removal and context retention. In order for the network to learn higher-level complex mappings, lower-level features are made shareable with higher-level ones. For each network structure, local residual learning and whole residual learning are used, which can help when learning highly complex features and gradient backpropagation.

(3)$$\begin{array}{c}G_{LL}^N = CRSn({I_{LL}^N} )\\G_{Detail}^N = DRSn({I_{LH}^N,I_{HL}^NI_{HH}^N} )\\ \ldots\\G_{LL}^2 = CRSn({I_{LL}^2} )\\G_{Detail}^2 = DRSn({I_{LH}^2,I_{HL}^2I_{HH}^2} )\\G_{LL}^1 = CRSn({I_{LL}^1} )\\G_{Detail}^1 = DRSn({I_{LH}^1,I_{HL}^1I_{HH}^1} )\end{array}$$

where $G_{LL}^n$ is the image retained by n levels low-frequency context, and $G_{Detail}^n$ is the image of high-frequency de-reflection for those n levels.

Fig. 2. Network architecture of proposed TiCrWRRNet.

Download Full Size | PDF

3.2.1 Detail-enhanced reflection layer removal subnetwork

In Fig. 3, the DRSn is modified from [32] by removing the upscale part. In addition, in order to obtain the high-frequency image $\textrm{I}_{\textrm{Detail}}^\textrm{n}$ of the nth-level SWT wavelet decomposition, we use a convolutional layer with a kernel size of 3*3 to get the shallow feature $C_{H{C_1}}^n$, as follows:

(4)$$\; I_{Detail}^n = Concat[{I_{LH}^n,I_{HL}^n,I_{HH}^n} ]$$

(5)$$C_{H{C_1}}^n = Conv({I_{Detail}^n} )$$

where Concat stands for concatenation, and Conv stands for convolution operation. The shallow feature $C_{H{C_1}}^n$ can obtain deep features through H ${f_{RDB}}$ (Residual Dense Blocks) of DRSn, as shown below:

(6)$$C_{D{D_i}}^n = {f_{RDB{,_i}}}({C_{D{D_{i - 1}}}^n} )+ {f_{RL{B_i}}},\; \; 1 \le i \le H$$

$C_{D{D_0}}^n = C_{H{C_1}}^n\; \textrm{, }{f_{RDB{,_i}}}$ represents the i-th Residual Dense Blocks operation, ${f_{RDB{,_i}}}$ is a combined convolution operation with ReLU. $C_{D{D_i}}^n$ represents the operation result of the i-th Residual Dense Blocks. In order to obtain clear features, we perform feature fusion on the outputs of H Residual Dense Blocks of DRSn and perform channel compression using 1*1 convolution ($Con{v_{1 \times 1}}$), and finally adopt the whole residual connection, as follows:

(7)$$C_{H{C_2}}^n = Conv(Con{v_{1 \times 1}}({Concat[{{C^n}_{D{D_1}}, \ldots ,{C^n}_{D{D_H}}} ]} )) + C_{H{C_1}}^n$$

Fig. 3. Architecture of proposed CRSn and DRSn.

Download Full Size | PDF

Finally, after putting $C_{H{C_2}}^n$ into the convolution layer output, the high-frequency de-reflection image $G_{Detail}^n$ can be obtained, as shown below:

(8)$$G_{Detail}^n = Conv({C_{H{C_2}}^n} )$$

3.2.2 Context retention subnetwork

In Fig. 1, although most of the background context exists at low frequencies, there are still a few reflections. Therefore, we strengthen the Context Retention Subnetwork to achieve the effect of retaining the background context and removing reflections. As shown in Fig. 3, our CRSn refers to the network structure [33] and modifies it to remove the batch-normalization layers and sub-pixel layers in the network structure respectively to obtain a better recursive effect. First, we extract the shallow features of the low-frequency image ${I_{LL}}$ decomposed by the SWT operation through the convolution layer, as shown below:

(9)$$C_{LC_{1}}^{n} = PReLU\left( {Conv\left( {I_{LL}^n} \right)} \right),n = 1,2 \ldots .N$$

where $C_{{LC}_{1}}^{n}$ is the first-layer feature obtained by decomposing the nth-order low-frequency image $I_{LL}^n$ from the reflectance map SWT. Conv is the convolution operation, and PReLU is the excitation function. The depth feature can be obtained by passing the shallow feature $C_{L{C_1}}^n$ through the H-th ${f_{RLB}}$ (Residual Blocks) of CRSn, as shown below:

(10)$$\; C_{L{L_i}}^n = {f_{RLB{,_i}}}({C_{L{L_{i - 1}}}^n} ),{\kern 1cm} 1 \le i \le H$$

where $C_{L{L_0}}^n = C_{L{C_1}}^n$, $C_{L{L_i}}^n$ represents the operation result of the ith Residual Blocks, and ${f_{RLB{,_i}}}$ represents the i-th Residual Blocks operations, ${f_{RLB{,_i}}}$ is an operation combining convolution and PReLU. $C_{L{C_1}}^n$ can obtain feature map ${C_{L{L_H}}}$ after performs deep feature acquisition and residual learning is performed for H ${f_{RLB}}$, and ${C_{L{L_H}}}$ is subjected to convolution operation, plus shallow feature ${C_{L{C_1}}}$, and finally the result of the residual feature fusion $C_{L{C_2}}^n$ can be obtained, as shown below:

(11)$$C_{L{C_2}}^n = Conv({C_{L{L_H}}^n} )+ C_{L{C_1}}^n$$

Finally, after putting $C_{L{C_2}}^n$ into the convolution layer output, the low-frequency preserved structural image $G_{LL}^n$ can be obtained, as shown below:

(12)$$G_{LL}^n = Conv({C_{L{C_2}}^n} )$$

3.2.3 Context level blending

Context-level blending (CLB) can effectively remove reflections from low-frequency images and retain background context. To enhance our method's result of retaining low-frequency background context at each level, we use CLB at each level recursively. As shown in Fig. 2, we first connect the high and low frequency images of the n-th level through DRSn and CRSn to do the ISWT inverse wavelet transformation. Then, the result of ISWT of the n-th level high and low frequency image and the (n-1)-th level low frequency context retention result $G_{LL}^{n - 1}$ are used for CLB operation to help strengthen the low frequency reflection removal and retain the background context information. Finally, an ISWT inverse wavelet transformation is performed on the weighted image and the (n-1)-th order high-frequency de-reflection result $G_{Detail}^{n - 1}$, and it is repeated recursively until the last-level de-reflection image is reconstructed, that is, the final de-reflection image is restored. The reflection result, ${\textrm{r}^0}$, looks like this:

(13)$$SLF = image1\mathrm{\;\ \ast \;\ }({1.0 - \alpha } )\, +\,image2\mathrm{\;\ \ast \;\ }\alpha $$

(14)$$\begin{array}{c} {r^{n - 1}} = ISWT\left( \begin{array}{c} {G_{Detail}^{n - 1}}\\ ,CLB\left( \begin{array}{c} {G_{LL}^{n - 1},ISWT}\\ ({\,{{Concat}\;({G_{LL}^n,G_{Detail}^n} )} )} \end{array} \right) \end{array} \right)\\ {{C^{n - 2}} = ISWT({G_{Detail}^{n - 2},CLB({G_{LL}^{n - 2},{r^{n - 1}}} )} )}\\ \ldots \\ {{r^1} = ISWT({G_{Detail}^1,CLB({G_{LL}^1,{r^2}} )} )} \end{array}$$

where ISWT stands for inverse wavelet transformation operation, Concat stands for concatenation, α is the weighting coefficient, $G_{LL}^n$ and $G_{\textrm{Detail}}^n$ represent the result of the low-frequency image decomposed by the n-th level SWT wavelet entering CRSn and the result of the high frequency image decomposed by the nth order SWT wavelet entering the DRSn respectively. CLB means that two images are subjected to the Context Level Blending (CLB) operation, where different alpha values will affect the result of the final low-frequency image.

3.2.4 Detail-enhanced reflection information transmission

In the low-frequency DRSn and high-frequency CRSn networks, the terms ${\textrm{f}_{RLB}}$ and ${\textrm{f}_{\textrm{RDB}}}$ were mentioned. RLB and RDB are operations for extracting deep features to remove reflections. The enrich ${\textrm{f}_{\textrm{RLB}}}$ and ${\textrm{f}_{\textrm{RDB}}}$ are enriched through Detail-enhanced Reflection Information Transmission (DRIT) to enrich features and increase robustness [34]. In addition, in order for the network to learn how to separate the Transmission layer and the Reflection layer and to get a clear image, our network only learns the residual between the input and output and applies a whole residual connection [35].

The architecture of RLB and RDB is shown in Fig. 4 respectively. RLB contains H residual blocks with the same structure. We introduce the residual block [33], but remove the batch regularization layer and use PReLU [36] as the activation function instead as well as two convolutional layers with kernel size 3*3 and 64 feature maps.

(15)$${f_{RLB,h}} = PreLU(Conv({PreLU({Conv({{C_{L{L_{h - 1}}}}} )} )} )+ {C_{L{L_{h - 1}}}}$$

where ${f_{RLB,h}}\; i$s the h-th RLB operation, Conv represents the convolution operation, and PReLU is the activation function. We also introduce the residual dense block (RDB) [32], which contains densely connected layers, local residual learning and local feature fusion. RDB is often used to establish tight connections between layers and has been widely used to solve single-image super-resolution (SISR) tasks [32]. However, we use the application of RDB in the field of single-image reflection removal. As shown in Fig. 4, the high-frequency reflection removal network is divided into H RDB, assuming that the h-th RDB is concatenated in the middle and the output feature map after the convolution layer is as follows:

(16)$${C_{D{D_{h,m}}}} = CR({Concat[{{C_{D{D_{h - 1}}}},{C_{D{D_{h,1}}}},{C_{D{D_{h,2}}}}, \ldots ,{C_{D{D_{h,m - 1}}}}} ]} )$$

where ${C_{D{D_{h,m}}}}$ represents the feature map output by the m-th convolutional layer in the h-th RDB, Concat represents concatenation, $[{{C_{D{D_{h - 1}}}},{C_{D{D_{h,1}}}},{C_{D{D_{h,2}}}}, \ldots ,{C_{D{D_{h,m - 1}}}}} ]$ is the string of previous (h-1) convolution output feature maps. Next, CR represents the combined operation of the convolution Conv and the ReLU excitation function in the m-th convolutional layer. The outputs of the previous RDB and the other layers are directly connected to all subsequent layers, which not only preserves the feed-forward properties, but also extracts local dense features.

Fig. 4. Architecture of RLB and RDB.

Download Full Size | PDF

${C_{D{D_{h - 1}}}}$ is defined as the input of the extracted depth feature of the h-th residual dense block. For the h-th residual dense block, the input is the concatenation of all the outputs of the previous layer and the original input. It can be expressed as $[{{C_{D{D_{h - 1}}}},{C_{D{D_{h,1}}}},{C_{D{D_{h,2}}}}, \ldots ,{C_{D{D_{h,m - 1}}}}} ]$. So, the output looks like this:

(17)$${C_{D{D_{h,\,\,fusion}}}} = {f_{fusion,h}}({[{{C_{D{D_{h - 1}}}},{C_{D{D_{h,1}}}},{C_{D{D_{h,2}}}}, \ldots ,{C_{D{D_{h,m - 1}}}}} ]} )$$

where ${f_{fusion,h}}$ represents the local feature fusion operation. In this study, a convolution with a kernel size of 1*1 is used to control the output depth and concatenation of the h-th residual dense block to tightly connect the deep local features. Local residual learning is used between different residual dense blocks to solve the problem of gradient disappearance when training residual dense networks. Here is how it works:

(18)$${C_{re{s_h}}} = {C_{D{D_{h,\,\,fusion}}}} + {C_{D{D_{h - 1}}}}$$

where h is the h-th local residual feature map. In order to separate the Transmission layer and the Reflection layer, we use Detail-enhanced Reflection Information Transmission (DRIT) in addition to the local residual connection to help the output of the residual block (RLB) of CRSn, as follows:

(19)$$\; \; {f_{RLB,h}} = {C_{re{s_h}}} + {f_{RDB,h}}\,\,$$

where h is the h-th residual dense block operation.

3.3. Loss function

There are p pairs of reflective and non-reflective images ${\{{{y_i},{x_i}} \}_{i = 1, \ldots ,p}}$. The nth-level images ${y_{i,n}}$ and ${x_{i,n}}$ are obtained by SWT in the n-th level (n = 1, 2, …, N):

(20)$$\begin{array}{c}{y_{i,n}}=\{y_{i,n}^{LL},y_{i,n}^{Detail}\}\\{x_{i,n}}=\{x_{i,n}^{LL},x_{i,n}^{Detail}\}\end{array}$$

(21)$$Loss(B) = \frac{1}{p}\mathop \sum \limits_{i = 1}^p \mathop \sum \limits_{n = 1}^N \{\left\|(C({y_{i,n}^{LL};B} )- x{_{i,n}^{LL}}\right\|^2 + \left\|D({y_{i,n}^{Detail};B} )- x_{i,n}^{Detail}\right\|)\} $$

$C\left(\cdot\right)$ represents CRSn, and $D\left(\cdot\right)$ represents DRSn. Let B represent the parameters of the entire model, which are optimized by backpropagation. $y_{i,n}^{Detail}$ and $x_{i,n}^{Detail}$ represent the high-frequency reflection image and high-frequency non-reflection image of the n-th level SWT, $y_{i,n}^{LL}$ and $x_{i,n}^{LL}$ represent the low-frequency reflection image and low-frequency non-reflection image of the n-th level SWT, respectively, and n represents the level number of SWT. In this study, the joint error is used to train the parameter B, so that it makes it possible to jointly estimate the high frequency and low frequency images of ${y_i}$ and ${x_i}$:

(22)$$\begin{array}{c} Loss\left( B \right)| = \frac{1}{P}\mathop \sum \limits_{i = 1}^P \mathop \sum \limits_{n = 1}^N \left\|D\left( {y_{i,n}^{Detail};B} \right) - x_{i,n}^{Detail}\right\|\\ { + \frac{1}{P}\mathop \sum \limits_{i = 1}^P \mathop \sum \limits_{n = 1}^{N - 1} \left\|C\left( {y_{i,n}^{LL};B} \right) - x{{_{i,n}^{LL}}}\right\|^2 + \frac{1}{P}\mathop \sum \limits_{i = 1}^P \left\|C\left( {y_{i,N}^{LL};B} \right) - x{{_{i,N}^{LL}}}\right\|^2} \end{array}$$

Finally, connect the high and low frequency images of each level through and , and use ISWT to obtain the final de-reflection result.

4. Experiments

4.1. Dataset description

It is difficult to obtain the reflection datasets in the real world and the size of datasets is usually too small, and it would lead to the deep learning model over-fitted. Therefore, we perform data augmentation to overcome the issues. In order to train the deep learning model with high accuracy and good generalization, in addition to the structure of the model itself, one of the most important things is how much data we have. The larger the amount of data, the more complete the data, the better. Hence, the SIR² dataset [11] is used by changing the color and flipping the image to augment the data. More specifically, in data augmentation of the experiments, we perform image flipping (including horizontal flipping, vertical flipping, horizontal and vertical flipping) and image enhancement (including adjusting brightness and contrast).

Furthermore, data balancing is used to avoid over-learning the features of a certain category, causing images of other categories to be unrecognizable. Therefore, when there are multiple categories of data sets, it is necessary to ensure that the data of each category does not fall too far to improve the accuracy of the model. The SIR² dataset [11] is divided into three categories with a total of 454 images, namely 199 Postcard images, 200 Object images, and 55 Wild images. However, the number of images in each category has a gap, which may affect the training of the model. Therefore, we balance the number of images for each category. After balance, the numbers of images are 1990 Postcards, 2000 Objects, and 2200 Wilds in the experiments respectively.

4.2. Implementation details and hyper-parameter setting

The proposed TiCrWRRNet is trained on an NVIDIA RTX2080 TI GPU with the Tensorflow framework. For the training set, the size of each mini batch is 16. We use Haar SWT to perform wavelet decomposition on the training set, and the level of wavelet transform is set to 2, and the decomposed high and low frequency sub-images of each level are input into DRSn and CRSn respectively for training. Adam [37] is used as the network optimization, and the learning rate for training CRSn and DRSn is set to 10⁻⁴. The numbers of RLB and RDB in the network are both set to a kernel size of 16.

4.3. Comparisons with the state-of-the-art

To evaluate the performance of our method, we compare our method with six state-of-the-art reflection removal approaches, namely Chang [38], Kim [3], Peng [39], RmNet [2], ERRnet [5] and IBCLN [40], on the SIR² dataset [11], which contains images from three categories: Postcard, Object and Wild. In addition, the Nature [40] dataset is also used to demonstrate that our method is not limited to specific assumptions for reflection removal.

4.3.1 Quantitative evaluation

In the experiments, for a fair comparison, the codes provided by the authors were used and the training parameters were directly used as suggested in their original paper. Table 1 lists the quantitative comparison of the state-of-the-art approaches and ours on four benchmark datasets in terms of SSIM and PSNR. The results indicate that our method outperforms the state-of-the-art approaches on both SSIM and PSNR, and it means that our method recovers images with better visual effect and closer to the reflection-free image.

Table 1. Quantitative comparison of state-of-the-art and our methods on four benchmark datasets, including three sub-datasets of SIR²[11] dataset.^a

View Table | View all tables in this article

4.3.2. Visual comparison

Visual comparisons among the six state-of-the-art approaches and our methods in SIR² dataset [11] are shown in Fig. 5. In first example, most approaches cannot effectively remove the chromatic aberration caused by reflection. However, our method can fade this part and make the color closer to the original image. In the second example, on the background pattern and in the third, on the right half, our results clearly outperform the results of other approaches. In these examples, although there are some areas where reflections from small details could not be removed completely, our method can remove reflections more effectively and restore the details of the background more clearly.

Fig. 5. Visual comparison among the state-of-the-art approaches and our method in SIR2 dataset [11].

Download Full Size | PDF

4.4. Ablation studies

4.4.1 Detail-enhanced reflection information transmission

DRIT is proposed to extract reflections at high frequencies, which is passed on to CRSn to help achieve better performance. Experimental analysis is used to find how the proposed DRIT helps to improve the results of reflection removal. To demonstrate its advantages, ablation experiments are conducted by comparing w/o DRIT and DRIT. Both models are evaluated on the SIR² dataset [11]. The results are shown in Table 2. The quantitative results show that the effect of adding DRIT is better in terms of both PSNR and SSIM, thus verifying that the use of DRIT can effectively improve reflection removal. In the visual comparison, as shown in Fig. 6, in the first example, it can be seen that the result of w/o DRIT cannot completely remove the reflection although the reflection is diluted, so there is still an afterimage of residual reflection. However, the result of adding DRIT shows that it can clearly remove reflections relatively cleanly. In the second example, the same result of adding DRIT is better and closer to the ground truth. From the quantitative and visual comparison, it can be observed that DRIT performs better than “w/o DRIT” in both detail and reflection removal.

Fig. 6. Visual comparison of w/o DRIT and DRIT.

Download Full Size | PDF

Table 2. Quantitative evaluation without and with DRIT

View Table | View all tables in this article

5. Conclusion

A novel TiCrWRRNet is proposed to effectively achieve the reflection removal. Deep learning sub-networks can converge efficiently and quickly by using high and low frequency sub-images obtained from the decomposition of SWT. In addition, a multi-level wavelet transform is used to further enhance the removal of reflections. The sub-images at different levels are trained with DRSn and CRSn to extract richer image background context and reflections respectively. Since a small number of reflections still exists in low frequencies, a DRIT is proposed to extract the reflection layer features of high-frequency sub-images to aid CRSn to separate the transmission layer and the reflection layer more effectively to achieve the performance of reflection removal. Furthermore, CLB and ISWT are employed on the low-frequency results at each level recursively to recover the final clean image. Finally, comparative and ablation studies are conducted to demonstrate that the proposed method both shows better performance than state-of-the-art ones, and it is validated to generalize effectively to other complex datasets.

Funding

Ministry of Science and Technology, Taiwan (MOST108-2410-H-194-088-MY3, MOST110-2221-E-194-027-MY3, MOST111-2410-H-194-038-MY3).

Acknowledgment

The authors acknowledge the financial funding of this work. We also thank the anonymous reviewers for their critical comments on the manuscript.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Refs. [11,40].

References

1. X. Zhang, R. Ng, and Q. Chen, “Single image reflection separation with perceptual losses,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2018), pp. 4786–4794.

2. Q. Wen, Y. Tan, J. Qin, W. Liu, G. Han, and S. He, “Single image reflection removal beyond linearity,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2019), pp. 3771–3779.

3. S. Kim, Y. Huo, and S.-E. Yoon, “Single Image Reflection Removal With Physically-Based Training Images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2020), pp. 5164–5173.

4. Q. Zheng, B. Shi, J. Chen, X. Jiang, L.-Y. Duan, and A.C. Kot, “Single Image Reflection Removal With Absorption Effect,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (IEEE, 2021), pp. 13395–13404.

5. K. Wei, J. Yang, Y. Fu, D. Wipf, and H. Huang, “Single image reflection removal exploiting misaligned training data and network enhancements,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2019), pp. 8178–8187.

6. Y. Li and M.S. Brown, “Exploiting reflection change for automatic reflection removal,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2013), pp. 2432–2439.

7. R. Wan, B. Shi, T.A. Hwee, and A.C. Kot, “Depth of field guided reflection removal,” in 2016 IEEE International Conference on Image Processing (ICIP) (IEEE, 2016), pp. 21–25.

8. A. Levin and Y. Weiss, “User assisted separation of reflections from a single image using a sparsity prior,” IEEE Trans. Pattern Anal. Machine Intell. 29(9), 1647–1654 (2007). [CrossRef]

9. Q. Fan, J. Yang, G. Hua, B. Chen, and D. Wipf, “A generic deep architecture for single image reflection removal and image smoothing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp.3238–3247.

10. R. Wan, B. Shi, H. Li, L.-Y. Duan, A.-H. Tan, and A. K. Chichung, “CoRRN: Cooperative reflection removal network,” IEEE Trans. Pattern Anal. Mach. Intell. 42(12), 2969–2982 (2020). [CrossRef]

11. R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot, “Benchmarking single-image reflection removal algorithms,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 3922–3930.

12. W. Y. Hsu and P. C. Chen, “Pedestrian Detection Using Stationary Wavelet Dilated Residual Super-Resolution,” IEEE Trans. Instrum. Meas. 71, 5001411 (2022). [CrossRef]

13. W. Y. Hsu and P. W. Jian, “Detail-Enhanced Wavelet Residual Network for Single Image Super-Resolution,” IEEE Trans. Instrum. Meas. 71, 5016913 (2022). [CrossRef]

14. W. Y. Hsu and Y. S. Chen, “Single Image Dehazing Using Wavelet-based Haze-Lines and Denoising,” IEEE Access 9, 104547–104559 (2021). [CrossRef]

15. W. Y. Hsu and W. Y. Lin, “Adaptive Fusion of Multi-Scale YOLO for Pedestrian Detection,” IEEE Access 9, 110063–110073 (2021). [CrossRef]

16. N. Kong, Y.-W. Tai, and J.S. Shin, “A physically-based approach to reflection separation: from physical modeling to constrained optimization,” IEEE Trans. Pattern Anal. Mach. Intell. 36(2), 209–221 (2014). [CrossRef]

17. Y. Lyu, Z. Cui, S. Li, M. Pollefeys, and B. Shi, “Reflection separation using a pair of unpolarized and polarized images,” in Advances in Neural Information Processing Systems32, 1 (2019).

18. Y. Pang, M. Yuan, Q. Fu, P. Ren, and D.-M. Yan, “Progressive polarization based reflection removal via realistic training data generation,” Pattern Recognition 124, 108497 (2022). [CrossRef]

19. A. Punnappurath and M.S. Brown, “Reflection removal using a dual-pixel sensor,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (IEEE, 2019), pp. 1556–1565.

20. H. Farid and E. H. Adelson, “Separating reflections from images by use of independent component analysis,” J. Opt. Soc. Am. A 16(9), 2136–2145 (1999). [CrossRef]

21. Q. Fan, D. P. Wipf, G. Hua, and B. Chen, “Revisiting deep image smoothing and intrinsic image decomposition,” arXiv:1701.02965, 4 (2017).

22. Y. Ding, A. Ashok, and S. Pau, “Real-time robust direct and indirect photon separation with polarization imaging,” Opt. Express 25(23), 29432–29453 (2017). [CrossRef]

23. N. Li, Y. Zhao, Q. Pan, and S. G. Kong, “Removal of reflections in LWIR image with polarization characteristics,” Opt. Express 26(13), 16488–16504 (2018). [CrossRef]

24. W. Y. Hsu and W. Y. Lin, “Ratio-and-Scale-Aware YOLO for Pedestrian Detection,” IEEE Trans. on Image Process. 30, 934–947 (2021). [CrossRef]

25. W. Y. Hsu, “Automatic Compensation for Defects of Laser Reflective Patterns in Optics-Based Auto-Focusing Microscopes,” IEEE Sens. J. 20(4), 2034–2044 (2020). [CrossRef]

26. W. Y. Hsu and C. J. Chung, “A Novel Eye Center Localization Method for Head Poses with Large Rotations,” IEEE Trans. on Image Process. 30, 1369–1381 (2021). [CrossRef]

27. W. Y. Hsu, “Automatic pedestrian detection in partially occluded single image,” Integrated Computer-Aided Eng. 25(4), 369–379 (2018). [CrossRef]

28. M. Xi, H. Chen, Y. Yuan, G. Wang, Y. He, Y. Liang, J. Liu, H. Zheng, and Z. Xu, “Bi-frequency 3D ghost imaging with Haar wavelet transform,” Opt. Express 27(22), 32349–32359 (2019). [CrossRef]

29. J.-K. Kim, K.-J. Kim, J.-W. Kang, K.-J. Oh, J.-W. Kim, D.-W. Kim, and Y.-H. Seo, “New compression method for full-complex holograms using the modified zerotree algorithm with the adaptive discrete wavelet transform,” Opt. Express 28(24), 36327–36345 (2020). [CrossRef]

30. Z. Qiao, X. Shi, R. Celestre, and L. Assoufid, “Wavelet-transform-based speckle vector tracking method for X-ray phase imaging,” Opt. Express 28(22), 33053–33067 (2020). [CrossRef]

31. W. Y. Hsu and C. J. Chung, “A Novel Eye Center Localization Method for Multiview Faces,” Pattern Recognition 119, 108078 (2021). [CrossRef]

32. Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2018), pp. 2472–2481.

33. C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, and Z. Wang, “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 4681–4690.

34. L. Bao, Z. Yang, S. Wang, D. Bai, and J. Lee, “Real image denoising based on multi-scale residual dense block and cascaded U-Net with block-connection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (IEEE, 2020), pp. 448–449.

35. X. Fu, J. Huang, D. Zeng, Y. Huang, X. Ding, and J. Paisley, “Removing rain from single images via a deep detail network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 3855–3863.

36. K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE International Conference on Computer Vision (IEEE, 2015), pp. 1026–1034.

37. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980 (2014).

38. Y.-C. Chang, C.-N. Lu, C.-C. Cheng, and W.-C. Chiu, “Single Image Reflection Removal with Edge Guidance, Reflection Classifier, and Recurrent Decomposition,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (IEEE, 2021), pp. 2033–2042.

39. Y.-T. Peng, K.-H. Cheng, I.-S. Fang, W.-Y. Peng, and S. Wu, “Single Image Reflection Removal based on Knowledge-distilling Content Disentanglement,” IEEE Signal Process. Lett. 29, 568–572 (2022). [CrossRef]

40. C. Li, Y. Yang, K. He, S. Lin, and J. E. Hopcroft, “Single Image Reflection Removal through Cascaded Refinement,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (IEEE, 2020), pp. 3565–3574.

Dataset			Chang [38]	Kim [3]	Peng [39]	Wen [2]	Wei [5]	Li [40]	Ours
SIR² [11]	Postcard	PSNR	22.73	23.64	23.29	19.71	22.04	23.39	25.92
	Postcard	SSIM	0.860	0.901	0.890	0.808	0.876	0.875	0.929
	Object	PSNR	25.61	23.11	23.92	20.33	24.87	24.87	27.24
	Object	SSIM	0.905	0.894	0.860	0.793	0.896	0.893	0.933
	Wild	PSNR	25.41	25.55	25.12	21.98	24.25	24.71	25.41
	Wild	SSIM	0.892	0.905	0.910	0.821	0.853	0.886	0.922
Nature [40]		PSNR	20.94	21.20	22.35	19.36	22.18	23.57	23.89
		SSIM	0.861	0.788	0.882	0.725	0.756	0.783	0.813
Average		PSNR	23.67	23.38	23.67	20.35	23.34	24.14	25.62
		SSIM	0.880	0.872	0.886	0.787	0.845	0.859	0.900

Dataset			Chang [38]	Kim [3]	Peng [39]	Wen [2]	Wei [5]	Li [40]	Ours
SIR² [11]	Postcard	PSNR	22.73	23.64	23.29	19.71	22.04	23.39	25.92
	Postcard	SSIM	0.860	0.901	0.890	0.808	0.876	0.875	0.929
	Object	PSNR	25.61	23.11	23.92	20.33	24.87	24.87	27.24
	Object	SSIM	0.905	0.894	0.860	0.793	0.896	0.893	0.933
	Wild	PSNR	25.41	25.55	25.12	21.98	24.25	24.71	25.41
	Wild	SSIM	0.892	0.905	0.910	0.821	0.853	0.886	0.922
Nature [40]		PSNR	20.94	21.20	22.35	19.36	22.18	23.57	23.89
		SSIM	0.861	0.788	0.882	0.725	0.756	0.783	0.813
Average		PSNR	23.67	23.38	23.67	20.35	23.34	24.14	25.62
		SSIM	0.880	0.872	0.886	0.787	0.845	0.859	0.900

Translation-invariant context-retentive wavelet reflection removal network

Abstract

1. Introduction

2. Related work

2.1. Single image reflection removal

2.2. Image restoration with wavelet prior

3. Proposed method

3.1. Translation-invariant wavelet transform

3.2. Multi-level subnetwork models

3.2.1 Detail-enhanced reflection layer removal subnetwork

3.2.2 Context retention subnetwork

3.2.3 Context level blending

3.2.4 Detail-enhanced reflection information transmission

3.3. Loss function

4. Experiments

4.1. Dataset description

4.2. Implementation details and hyper-parameter setting

4.3. Comparisons with the state-of-the-art

4.3.1 Quantitative evaluation

4.3.2. Visual comparison

4.4. Ablation studies

4.4.1 Detail-enhanced reflection information transmission

5. Conclusion

Funding

Acknowledgment

Disclosures

Data availability

References

Data availability

Cited By

Figures (6)

Tables (2)

Equations (22)

Optics Express