Dual stream fusion network for underwater image enhancement of multi-scale turbidity restoration and multi-path color correction

Kai Ji; Weimin Lei; Wei Zhang; Xinyi Chen

doi:10.1364/OE.509344

1. Introduction

In recent years, there has been a significant focus on marine resources, leading to advancements in fields such as marine geological exploration and deep-sea biological investigation. Underwater operations often require high-quality visualization to improve efficiency. However, underwater imaging presents more challenges than land imaging due to its complex and harsh environment. When light propagates through water, it is often subject to scattering caused by particles in the medium and absorption by the water body. This can result in underwater images that are blurry, noisy, and distorted in color [1]. Backward scattering can make the captured image appear foggy, while forward scattering can cause edge overlap and potential artifacts. Additionally, absorption can give the image a blue-green hue. Image processing technology can effectively enhance image quality, making research on underwater image enhancement of great significance.

In response to the prominent problems of haze and color cast in underwater images, enhancement methods typically involve defogging and color correction. Color histogram [2], gamma correction [3], white balance [4] and other methods are common digital image processing methods to remove color deviation. Since image dehazing is an ill-posed problem, existing methods for image dehazing often rely on strong priors or assumptions, such as the dark channel prior (DCP) [5] or Jaffe-McGlamery model [6], to restore the transmission map, global atmospheric light, and scene radiance. Deep learning has recently been effective in enhancing underwater images [7–12]. Existing underwater enhancement methods can achieve reasonable results in most cases. However in turbid water bodies, water disturbances can kick up sediment or bubbles that can be millimeters in diameter. Suspended solids on larger scales not only interfere with the imaging of optical devices through scattering, but also bring a certain degree of obscuration phenomenon, leading to dark or bright spots in underwater images. Therefore, common enhancement methods are inadequate for dealing with foggy underwater images well, and also do not dig distinctive features of underwater images. From an image processing perspective, the impact of large underwater particles on imaging can be considered as speckle noise or salt-pepper noise. This can be effectively suppressed by median filtering. Additionally, it is important to recover the true colors of haze-free images while avoiding issues such as abnormal brightness and over-enhancement.

To address these issues, we propose a dual-stream underwater enhancement method for turbidity recovery and color correction, which is composed of Multi-scale Turbidity Restoration Module (MTRM), Multi-path Color Correction Module (MCCM) and Collaborative Attention Fusion Module (CAFM). In the turbidity restoration stream, the noise caused by large particles is first eliminated by adaptive median filtering with a learned window. Then an ameliorative U-Net [13] extracts non-adjacent layer information from frequency-based dense feature fusion for further fog removal. In the color correction stream, the presence of cross-connections and multi-path residual blocks allows MCCM to expend the receptive field while taking full advantage of internal features learned from different color channels. Finally, CAFM draws on the non-local attention strategy to learn and refine correlated features in the spatial domain and channel domains, respectively, to achieve the rational fusion of the haze-free and color-corrected images.

Our main contributions in this paper are as follows:

(1) We propose a dual-stream fusion network for underwater image enhancement, which effectively restores the images with clear appearance and rich content.
(2) We design an adaptive median filter to suppress the noise caused by large particles, and a modified U-Net for further defogging in the turbidity restoration stream. The frequency-based dense feature fusion breaks through the barriers between non-adjacent layers of the U-Net. In the color correction stream, multi-path residual blocks extend the receptive field, and attention-based cross connection can share features from the color channel. Non-local attention is introduced to capture global dependencies in the spatial and channel dimensions, respectively, at the final fusion stage.
(3) Extensive experiments are conducted to evaluate the performance of our method, which demonstrate its effectiveness in color correction and turbidity restoration. The ablation experiments prove the importance of the method components.

This paper proceeds as follows: Section 2 reviews existing methods for enhancing underwater images in the literature. In Section 3, we propose our model. Section 4 provides the experiments and analysis. Finally, we conclude our work and present the future work in Section 5.

2. Related works

2.1 Non-physical model methods

Non-physical model methods do not consider the physical imaging process. Instead, they focus on adjusting the pixel values of the image to produce subjectively and visually superior results, including histogram, Retinex, and transform domain methods. Ghani et al. [14] combined global and local histogram stretching to enhance image contrast, and converted the image to the HSV (Hue, Saturation, Value) space for color correction. Fu et al. [15] proposed a variational Retinex model, and adopted histogram specification to process the illumination component to prevent overexposure. They also applied CLAHE (Contrast Limited Adaptive Histogram Equalization) [16] to dynamically stretch the reflectance component. Finally, they multiplied the processed reflectance and illumination components to output a clear underwater image. To eliminate non-uniform illumination in underwater scenes, Zhou et al. [17] applied the multi-scale Retinex algorithm based on the human visual system to the Nonsubsampled Contourlet (NSCT) domain. Shahrizan et al. [18] presented a deep-sea image enhancement algorithm that integrates enhanced background filtering and wavelet fusion to reduce the blue and green bias of underwater images. These methods partially improve the contrast and image quality in underwater scenes, without the need for complex physical parameters. However, they are prone to produce over-enhanced or under-enhanced images due to their lack of robustness.

2.2 Physical model methods

Model-based methods typically utilize prior statistical rules to estimate relevant parameters, and then apply the inverse operation of the underwater imaging model to obtain clear underwater images. The most well-known is the dark channel prior (DCP) algorithm. Drews et al. [19] believed that the brightness of the red channel in underwater images was similar to the dark channel. Therefore, the dark channel prior is only applied to the green and blue channels with less attenuation, achieving underwater image restoration. Galdran et al. [20] inverted the red channel and combined it with the green and blue channels to create a new input image, and calculated the transmittance map through DCP. In addition to the dark channel prior, other prior knowledge is also employed for the restoration of underwater images. Carlevaris-Bianco et al. [21] proposed a novel prior method that leverages the difference in attenuation between color channels to restore clean images. Wang et al. [22] introduced the adaptive attenuation-curve prior through cluster analysis of image pixel values and exploited the saturation constraints to reduce the oversaturation and noise in the restored images. These methods rely on mathematical models of underwater imaging and are inherently constrained by certain prior assumptions. Moreover, the parameter estimation for these models is complex and may not produce satisfactory results in tasks such as artifact removal or color correction.

2.3 Deep learning-based methods

Deep learning has achieved great success in the field of computer vision and opened up new ideas for image processing. Nevertheless, obtaining corresponding ground truth for underwater images is difficult in practical situations. Researchers proposed weakly supervised and unsupervised enhancement methods based on Generative Adversarial Network (GAN) [23]. Islam et al. [7] used conditional generative adversarial networks to synthesize a paired dataset named Enhancing Underwater Visual Perception (EUVP). UWCNN [8] developed an underwater image synthesis algorithm based on underwater scene priors, which can simulate various degraded underwater images. However, the proposed physical model has limited applicability. Li et al. [9] built the Underwater Image Enhancement Benchmark (UIEB) and proposed a fusion-based enhancement algorithm where the input image requires white balance, histogram equalization, and gamma correction preprocessing. Wu et al. [10] proposed a two-stage underwater image CNN (Convolutional Neural Network) based on structure decomposition (UWCNN-SD) for underwater image enhancement from a frequency perspective. However, it has a tendency to excessively amplify the high-frequency component, leading to oversaturation in the enhanced results and loss of details in the darker areas. Li et al. [11] presented a parallel multi-color space encoder to learn the feature representations from RGB, HSV and Lab. Besides, the domain knowledge is incorporated into the unified structure. Wang et al. [12] adopted the serial RGB and HSV enhancement channel to improve the utilization efficiency of image features, and relied on an attention-based fusion module to learn the weights of features in different color spaces for final fusion. Liu et al. [24] proposed an adaptive learning attention network (LANet) that combines parallel attention, multiscale fusion and adaptive learning modules to fuse different spatial information to solve the problems of low illumination and color cast in underwater images. However, the local attention employed in the network struggles to capture long-range dependencies in images. Huo et al. [25] developed a multi-stage network to progressively refine the hybrid degradations, and used a wavelet boost learning strategy to restore fine details in the frequency domain. Peng et al. [26] built a large scale underwater image (LSUI) dataset, and presented an U-shape Transformer network where the transformer model is for the first time introduced to the underwater image enhancement task. While this approach presents innovative solutions to address underwater degradation issues, it has limitations in handling images of varying scales. Current data-driven methods often suffer from poor robustness. Due to the heavy reliance of deep learning on the dataset quality, it is crucial to establish larger and more diverse underwater enhancement datasets. Furthermore, a deeper exploration on the characteristics of underwater images is necessary and closely integrate them with real-world applications.

3. Method

In underwater imaging, color distortion mainly occurs at low frequencies, and the degree of energy attenuation has a nonlinear relationship with the color channel. Scattering has a detrimental effect on details and textures at high frequencies. Underwater degradation is a hybrid distortion involving many uncertainties. As shown in Fig. 1, two parallel enhancement streams are designed to target haze and color cast respectively. The image details are restored progressively from coarse to fine through the adaptive median filtering and wavelet boost U-Net in the de-turbidity stream. The color correction stream separates the input image into to R, G, and B channels and enhances them respectively. This conforms to the natural law that different color channels have different attenuation rates. Finally, global attention is introduced to capture long-distance context, which effectively fuses the feature maps enhanced by dual streams into clear and natural underwater images.

Fig. 1. The overall structure of our proposed method.

Download Full Size | PDF

3.1 Multi-scale turbidity restoration module

The multi-scale turbidity restoration module consists of two stages: the removal of noise caused by large particles in the image (rough enhancement), and the elimination of haze caused by scattering (refined enhancement). As mentioned earlier, large underwater particles, bubbles and other substances can cause spots of varying sizes due to the reflection, refraction, or obstruction of light. These spots may appear brighter or darker than nearby areas. We consider them to be noise points that median filtering can effectively handle. The median filter sorts pixel values within a neighborhood. Isolated noise points are often distributed on both sides of the sequence, so the median point is used as the new value for the center pixel of the neighborhood. It effectively preserves pixel information while removing the influence of noise points. The filter’s performance is greatly affected by the filtering window. A small window can better preserve fine details but may compromise the noise filtering effect, while a large window contributes to the better noise reduction but results in image blur. Fig. 2 displays our adaptive median filter design, which adjusts the window size dynamically to balance noise removal and detail preservation. The residual block consists of $3\times 3$ convolution, instance normalization and activation function ReLU. The filtering window is square, and we use two residual blocks to learn the window maps where each value represents the window size of the corresponding pixel coordinate. The process can be expressed as follows:

(1)$$Win= Rb\left ( Rb\left ( I \right ) \right ) .$$

(2)$$P_{x,y,c} = Mdf\left ( Win_{x,y,c} \right ) .$$

Where $Rb\left ( \cdot \right )$ represents the processing of a residual block, $Win$ are the learned window maps, and $Mdf\left ( \cdot \right )$ represents median filtering. $Win_{x,y,c}$ denotes the filter window size at the $c$ channel, $\left ( x,y \right )$ position. $P_{x,y,c}$ is the filtered pixel value at the $\left ( x,y,c \right )$ position.

Fig. 2. The schematic illustration of the adaptive median filter.

Download Full Size | PDF

The Strengthen-Operate-Subtract (SOS) Boosting strategy can reduce the noise level and improve the quality and clarity of images, so it is widely used in the field of image denoising [27]. Its main objective is to progressively refine and enhance the image using previously estimated results. This can be expressed as the following equation:

(3)$$x^{n+1} =G\left ( y+x^{n} \right ) -x^{n}.$$

Where $G\left ( \cdot \right )$ represents a denoising model, $x^{n}$ is the denoised image, and $x^{n+1}$ is the "strengthened" image. Dong et al. [28] introduced SOS into the dehazing task and proved its effectiveness. As shown in Fig. 3, the decoder of U-Net is considered as a dehazing module in our refined enhancement stage, where the output of the encoding block $x_{e}^{n}$ represents the potential feature. The previous level feature $x_{d}^{n}$ is upsampled to maintain the same scale. Then the skip connection transmits $x_{e}^{n}$ to the decoder. $x_{e}^{n}$ and $\left ( x_{d}^{n} \right ) \uparrow$ serve as input signals that are strengthened by the haze-free module $G_{d}^{n+1}$. The boosted feature can be expressed as:

(4)$$x_{d}^{n+1} = G_{d}^{n+1} \left ( x_{e}^{n} +S_{\uparrow 2}\left ( x_{d}^{n} \right ) \right ) - S_{\uparrow 2}\left ( x_{d}^{n} \right ).$$

Where $x_{e}^{n}$ is the latent feature from the encoder at layer $n$, $x_{d}^{n}$ is the feature from the decoder at layer $n$, $G_{d}^{n+1}$ represents the trainable unit at the $\left ( n+1 \right )$th layer, $S_{\uparrow 2}$ denotes 2$\times$ upsampling, and $\left ( x_{e}^{n} +S_{\uparrow 2}\left ( x_{d}^{n} \right ) \right )$ represents the strengthened feature.

Fig. 3. The schematic illustration of the multi-scale turbidity restoration module.

Download Full Size | PDF

The conventional U-Net architecture repeatedly employs shallow features through long connections, but this approach fails to facilitate communication between non-adjacent layers. To optimize the utilization of features across layers, we follow [28] to use the dense connections in U-Net. The related feature map are sampled to the same scale and fused together. However, simple concatenation only achieves the limited expected effect. Images contain shared statistical information (such as background) and independent characteristics (such as textures and details), which exhibit obvious gaps in both scales and data combination patterns. Therefore, they cannot be effectively integrated during the fusion process, resulting in information loss or increased uncertainty. We merge branches based on frequency to ensure consistency of feature attributes, thereby reducing the problems caused by the gaps.

As shown in Fig. 4, each feature branch undergoes a $1\times 1$ convolution to reduce the channel dimension, similar to the bottleneck concept. Then, the feature map is transformed to the same scale through sampling operations. After the 2D discrete wavelet transform (DWT) in rows and columns, the feature map is decomposed into four components (diagonal $D$, vertical $V$, horizontal $H$, and low-frequency $A$). The Haar wavelet filter used is $L_{high} = \left [ \frac {1}{\sqrt {2} }, -\frac {1}{\sqrt {2} } \right ]$, $L_{low} = \left [ \frac {1}{\sqrt {2} }, \frac {1}{\sqrt {2} } \right ]$. The transformed results are reorganized into four groups based on component types. Next, we use a $3\times 3$ convolutional layer to learn the weights for each group separately and apply them to the corresponding frequency components. Finally, the fused dense features are recovered through the inverse discrete wavelet transform. The process can be expressed as:

(5)$$\left ( D,V,H,A \right )_{e} =R\left ( DWT\left ( S_{\downarrow 2^{3-n}} \left ( F_{eds}^{3} \right ) \right ),\ldots , DWT\left ( S_{\downarrow 2} \left ( F_{eds}^{n+1} \right ) \right ) , DWT\left ( F_{e}^{n} \right ) \right ) .$$

(6)$$\left ( D,V,H,A \right )_{d} =R\left ( DWT\left ( S_{\uparrow 2^{m}} \left ( F_{dds}^{0} \right ) \right ),\ldots , DWT\left ( S_{\uparrow 2} \left ( F_{dds}^{m-1} \right ) \right ) , DWT\left ( F_{d}^{m} \right ) \right ) .$$

(7)$$\left ( W_{D},W_{V},W_{H},W_{A} \right ) =Sigmod \left ( Conv_{3\times 3}\left ( D \odot V\odot H\odot A \right ) \right ) .$$

(8)$$F_{eds}^{n} / F_{dds}^{m}=Conv_{1\times 1} \left ( IDWT \left ( \left ( \left ( W_{D}\cdot D \right ) \odot \left ( W_{V}\cdot V \right ) \odot \left ( W_{H}\cdot H \right ) \odot \left ( W_{A}\cdot A \right ) \right ) \right ) \right ) .$$

Where $F_{eds}^{n}$ represents the output of the $n$th layer dense connection on the encoder side, $F_{e}^{n}$ represents the feature map of the ordinary path before the $n$th layer dense connection. $F_{dds}^{m}$ represents the output of the $m$th layer dense connection on the decoder side. $DWT\left ( \cdot \right )$ refers to the wavelet transformation, and $IDWT\left ( \cdot \right )$ is the inverse wavelet transform. $R\left ( \cdot \right )$ indicates regrouping the feature maps based on frequency, and $\odot$ denotes the concatenating operation. $W_{D}$, $W_{V}$, $W_{H}$, $W_{A}$ are the weights of the feature maps with different frequencies.

Fig. 4. The schematic illustration of dense feature fusion.

Download Full Size | PDF

3.2 Multi-path color correction module

Light of varying wavelengths attenuates at different rates when traveling through water, resulting in the blue or green tint in underwater images. The contribution of the R, G, and B channels to color recovery differs, so processing each channel separately can reduce interference with each other. The research [29] suggests that receptive fields for the RGB channels should be allocated according to the wavelength, with the blue channel having the largest receptive field. However, in reality, the presence of other colored substances in water can make the image display different colors. For example, water bodies with abundant algae are likely green, while high sediment content can cause the underwater image to appear yellow. These situations do not align with the expected wavelength attenuation results. As shown in Fig. 5, our proposed multi-path color correction module uses similar receptive fields for different color channels.

Fig. 5. The schematic illustration of multi-path color correction module.

Download Full Size | PDF

In underwater scenes, the targets can vary greatly in terms of size, shape, and location, and other factors. The perception is limited by the information observed through a single receptive field. To enhance multi-contextual awareness, the multi-path residual block (MRB) processes feature maps in parallel using multiple convolutional layers with different kernel sizes. This formulation addresses both local and global coherence spatially, and integrates feature information flows from various stages, enhancing the diversity of feature contents. Furthermore, the channel shuffling [30] in MRB disrupts the original order of feature map channels, which addresses the limited information exchange between convolutional paths. The process can be expressed by the following formula:

(9)$$F_{out}^{m}= F_{in}^{m}+Shuffle_{ch} \left ( f_{1\times 1}\left ( F_{in}^{m} \right ) \odot f_{3\times 3}\left ( F_{in}^{m} \right ) \odot f_{5\times 5}\left ( F_{in}^{m} \right ) \right ).$$

Where $Shuffle_{ch}\left ( \cdot \right )$ is the channel shuffling operation, and $\odot$ denotes the concatenating operation. $f_{1\times 1}\left ( \cdot \right )$, $f_{3\times 3}\left ( \cdot \right )$, $f_{5\times 5}\left ( \cdot \right )$ respectively represent the processing (including IN and ReLU) on the $1\times 1$, $3\times 3$, $5\times 5$ convolution path.

While the RGB color space is a universal image format in computers, it does not conform to the intuitive perception of the human eyes. The information of brightness and color requires the combination of three channels to be properly interpreted. Therefore, RGB is actually a composite mode, where the three color channels are not isolated. Considering this issue, the cross-connections are introduced into parallel pathways to leverage information from other color channels. However, direct connections may inadvertently introduce irrelevant information from other channels, such as noise, which interferes with or even disrupts the characteristics of the current branch. We incorporate the Large Kernel Attention (LKA) [31] into the cross-connections. LKA employs a large convolution kernel to capture the long-range dependence, select the discriminative features in other color channels and automatically ignore noisy responses based on the spatial contexts of the weighted input features. As shown in Fig. 6, the $k\times k$ large kernel convolution is decomposed into depthwise convolution, depthwise dilation convolution, and pointwise convolution, which helps to reduce the number of parameters and computational complexity. LKA can be expressed as follows:

(10)$$F_{out}^{l}= F_{in}^{l}\cdot Conv_{1\times 1} \left ( DwdConv\left ( DwConv\left ( F_{in}^{l} \right ) \right ) \right ).$$

Where $DwdConv$ represents $5\times 5$ depth dilation convolution (dilation rate is 3), $DwConv$ is $5\times 5$ depth convolution, and $Conv_{1\times 1}$ is point convolution.

Fig. 6. The schematic illustration of the components of multi-path color correction module.

Download Full Size | PDF

In the end, channel attention is utilized to fuse the enhanced results from the R, G, and B channels, instead of simply concatenating them. This is because the color channel-wise information may provide different gains in the underwater image enhancement task. The relevant equations are as follows:

(11)$$\left ( W_{R}, W_{G}, W_{B} \right ) = Sigmod \left ( GAP\left ( Conv_{3\times 3}\left ( F_{R}^{\prime}\odot F_{G}^{\prime}\odot F_{B}^{\prime} \right ) \right ) \right ).$$

(12)$$F_{correct}= Conv_{1\times 1}\left ( \left ( F_{R}^{\prime }\cdot W_{R} \right ) \odot \left ( F_{G}^{\prime }\cdot W_{G} \right ) \odot \left ( F_{B}^{\prime }\cdot W_{B} \right ) \right )$$

Where $F_{R}^{\prime }$, $F_{G}^{\prime }$, $F_{B}^{\prime }$ are the enhancement results of red, green and blue channels respectively, and $GAP$ is the global mean pooling. $W_{R}$, $W_{G}$, and $W_{B}$ represent the weights of the color channels, and $\odot$ denotes the concatenating operation.

3.3 Collaborative attention fusion module

While MTRM and MCCM have color-corrected and defogged the raw image, respectively, we need a mechanism to complement their weaknesses with their beneficial information. Motivated by recent image enhancement methods [32,33], we propose the collaborative attention fusion module (CAFM) to integrate the turbidity recovery component and the color correction component. The structure of CAFM is shown in Fig. 7.

Fig. 7. The schematic illustration of collaborative attention fusion module.

Download Full Size | PDF

The relationship between pixels and the surrounding pixels prevents their isolation within an image. Consequently, a large number of interconnected pixels produce various objects in the image. The correlation among adjacent pixels is typically strong, while that among distant ones is weak. The image enhancement task is more inclined to utilize local context information. The CNN structure limits its ability to capture global semantic information within a limited receptive field. Non-local attention can directly constructs long-range dependencies by computing the interaction between any two locations, rather than being limited to neighboring points. Thus, we utilize non-local space and channel attention mechanisms to extract long-range contextual information in different dimensions. Specifically, the input feature $F\in \mathbb {R} ^{rC\times H\times W}$ passes through three $1\times 1$ convolutional layers to reduce the channels to $\frac {1}{r}$ and generate three new feature maps $Q$, $K$ and $V$. The $H$ and $W$ dimensions are then flattened to obtain three $C\times HW$ feature maps. We perform a matrix multiplication between the transpose of $Q$ and $V$, and apply a Softmax layer to calculate the $C\times C$ channel attention map:

(13)$$M_{c} = Softmax\left ( Rs\left ( K \right )_{C\times HW} \otimes \left ( Ts \left ( Rs \left ( Q \right ) \right ) _{HW\times C} \right ) _{C\times C} \right )$$

The principle of non local channel attention can be expressed by the following equations:

(14)$$F_{C}= F + Conv_{1\times 1} \left ( Rs \left ( M_{C}\otimes Rs\left ( V \right )_{C\times HW} \right ) \right )$$

Where $Rs\left ( \cdot \right )$ is the reshaping operation, $Ts\left ( \cdot \right )$ is the transpose operation, and $\otimes$ denotes the matrix multiplication. Nonlocal space attention utilizes the inter space-wise interaction information to generate the attention map:

(15)$$M_{S}= Softmax\left ( Ts \left ( Rs\left ( A \right ) \right )_{HW\times C} \otimes Rs\left ( B \right )_{C\times HW} \right )_{HW\times HW}$$

(16)$$F_{S}= F + Conv_{1\times 1}\left ( Rs\left ( Rs\left ( C \right )_{C\times HW} \otimes M_{S} \right ) \right )$$

The network can capture the long-distance dependence on the whole image by computing the interaction relevance matrix and encodes a wider range of contextual information into local features, which can effectively extract the global information of the current stage.

The non-local attention can discover long-distance contextual relationships by virtue of the global receptive field, but it is difficult to deal with the images that lack enough repetitive details. We introduce the local attention to focus on complex textures, which makes up for the shortcomings of non-local attention. This module is originally designed to adaptively adjust the receptive field size based on input features. Specifically, $F_{C}$ and $F_{S}$ are concatenated along the channel dimension, then channel shuffling is used to redistribute feature maps. They split into two sets of feature maps with halved channels ($F_{1}$ and $F_{2}$):

(17)$$\left ( F_{1}, F_{2} \right ) = Split\left ( Shuffle_{ch}\left ( F_{C}\odot F_{s} \right ) \right )$$

We use $3\times 3$ and $5\times 5$ convolution to obtain different performance states of features. The resulting feature maps $F_{3\times 3}$ and $F_{5\times 5}$ are summed pixel by pixel and fed into the channel attention module to generate two sets of attention maps:

(18)$$\left ( M_{3\times 3}, M_{5\times 5} \right ) = Sigmod\left ( Conv_{1\times 1} \left ( GMP\left ( F_{3\times 3}+F_{5\times 5} \right ) \right ) \right )$$

Where $M_{3\times 3}$ and $M_{5\times 5}$ are the attention weights generated in different convolution paths, $GMP$ represents global max pooling operation. Finally, the two branches are fused into the final enhanced result:

(19)$$\hat{I} = Conv_{1\times 1} \left ( F_{3\times 3}\cdot M_{3\times 3} + F_{5\times 5}\cdot M_{5\times 5} \right )$$

3.4 Loss function

Smooth L1 Loss The L1 loss is not differentiable at $0$, and the gradient of the L2 loss can easily explode when the predicted value is significantly different from the target value. The smooth L1 loss uses a piecewise function to combine L1 loss and L2 loss, overcoming the shortcomings of both. The smooth L1 loss is defined as follow:

(20)$$L_{s1}\left ( y \right ) = \begin{cases} 0.5y ^{2} & \text{ if } \lvert y \rvert < 1 \\ \lvert y \rvert - 0.5 & \text{ otherwise } \end{cases}$$

Where $y$ stands for the difference between the predicted image $\hat {I}$ and the ground truth $I_{gt}$.

Laplacian Loss The image pyramid is a multi-scale representation of images, which is widely used in various fields, including image compression, fusion, and enhancement. In the pyramid structure, the underlying image mainly contains details such as light and shade, edges, and so on, while the high-level image can extract richer overall structural information. The Laplacian pyramid can predict image residuals at different resolutions and provide multi-scale detailed information in particular. It is applied to video frame interpolation to generate fine intermediate frames [34]. The Laplacian loss is a suitable method for high-precision image reconstruction tasks as it can capture both global and local feature differences. The loss is defined as follows:

(21)$$L_{lap} = \sum_{i=1}^{5} 2^{i-1} \left \| L^{i}\left ( \hat{I} \right ) - L^{i} \left ( I_{gt} \right ) \right \| _{1}$$

Where $L^{i}\left ( \cdot \right )$ denotes the $i$-th layer of a Laplacian pyramid of an image.

Perceptual Loss The per-pixel loss is too rigid to evaluate the image the way the human eye does. The perceptual loss utilizes the ability of convolutional layers to extract high-level features and perceives images from a higher dimension, which is more similar to the human thinking. The Visual Geometry Group 16 (VGG-16) [35] a commonly used model for extracting image features in computer vision tasks due to its simplicity and effectiveness. So we utilized the pre-trained VGG-16 to capture perceptual features. The perceptual loss is defined as follows:

(22)$$L_{perce}= \left \| \phi_{j}\left ( \hat{I} \right ) - \phi_{j}\left ( I_{gt} \right ) \right \| _{1}$$

Where $\phi _{j}\left ( \cdot \right )$ denotes the feature maps of the $j$-th layer of VGG16. We compute the perceptual loss at layer $relu2\_2$ .

The joint loss of the proposed enhancement network is defined as the weighted sum of the smooth L1 loss, Laplacian loss and the perceptual loss. This can be represented as:

(23)$$L = \lambda _{s1} L_{s1} + \lambda _{lap} L_{lap} + \lambda _{perce} L_{perce}$$

Where $\lambda _{s1}$, $\lambda _{lap}$, and $\lambda _{perce}$ are hyper-parameters and empirically set to 1, 0.1, 0.2.

4. Experiments and results

4.1 Datasets and implement details

Our network is trained using 800 pairs of images from the UIEB dataset [9] and 1000 pairs of images from the EUVP dataset [7]. The UIEB dataset includes 950 real-world underwater images, 890 of which have corresponding reference images, and 60 challenging images without ground truth. The EUVP dataset contains separate three sets of paired images (Underwater Dark, Underwater ImageNet, and Underwater Scenes) to facilitate supervised training of underwater image enhancement models. We optimize the network using the Adaptive Moment Estimation (ADAM) optimizer with a learning rate of 0.0001. Each image is randomly taken $64\times 64$ crops using data augmentation such as rotating, flipping, and mirroring. The batch size is 32, and the epoch is set to 50. Our model is implemented in Pytorch on an Nvidia GeForce GTX 1080 GPU.

4.2 Comparison experiments

4.2.1 Evaluation metrics

To evaluate the effectiveness of our method in enhancing underwater images, we compare it with 10 algorithms, including underwater dark channel prior (UDCP) [19], image blurriness and light absorption (IBLA) [36], Retinex-based [15], underwater convolutional neural network (UWCNN) [8], Water-Net [9], underwater image convolutional neural network based on structure decomposition (UWCNN-SD) [10], Ucolor [11], LANet [24], U-Trans [26], and UIEC$^{2}$-Net [12]. The first three are traditional algorithms, while the remaining seven are deep learning algorithms.

We select 90 images from the UIEB dataset that have reference images in the non-training set as the test set UIEB-T90. Similarly, 100 pairs of images are randomly selected from the EUVP dataset, called EUVP-T100. The above two datasets are used for Full-Reference Image Quality Assessment (FRIQA). In addition, we conduct Non-Reference Image Quality Assessment (NRIQA) on UIEB-C60 and U45 [37]. UIEB-C60 contains 60 challenging underwater images without reference images provided in the UIEB dataset. U45 uses underwater clear-degraded image pairs to train CycleGAN and generate 45 underwater images, simulating color casts, low contrast and haze-like effects of underwater degradation.

To evaluate the performance of algorithms comprehensively and rigorously, we employ peak signal-to-noise ratio (PSNR), structure similarity index measure (SSIM), and learned perceptual image patch similarity (LPIPS) [38] for FRIQA. Additionally, we use underwater image quality measure (UIQM) [39], underwater color image quality evaluation (UCIQE) [40], and Natural Image Quality Evaluator (NIQE) [41] to assess underwater image quality without corresponding ground truth. The PSNR describes image quality from the energy perspective. The SSIM mainly considers three key features of images, luminance, contrast, and structure to measure the similarity between images. The LPIPS utilizes deep learning to simulate human visual perception, and the generated perceptual distance is more accurately assess image similarity. The UIQM comprises the underwater image colorfulness measure (UICM), the underwater image sharpness measure (UISM), and the underwater image contrast measure (UIConM). The UCIQE aims to quantify non-uniform color projection, blurring, and low contrast of underwater images. The NIQE evaluates natural image quality based on a statistical feature model of the image.

4.2.2 Qualitative comparisons

Water absorbs long wavelengths of light more than short wavelengths, resulting in blue-green underwater images in natural environments. In order to evaluate the color bias correction effect of our method, we choose several greenish, bluish, and yellowish images of underwater scenes for visual comparison.

Figure 8 displays the enhancement results of blue underwater images with different algorithms. The color histograms indicate that raw images have a concentration of R channel values around 0, while the B channel occupies most of the high value areas. UDCP [19] exacerbates the blue color cast, and make the overall image darker. IBLA [36] does not get better in terms of color, but there seems to be a slight improvement in exposure. Retinex-based [15] causes obvious over-enhancement problems in both color and edges, as seen in the second image of Fig. 8(d). This problem is also present in UWCNN-SD [10]. UWCNN [8] brings haze to the predicted images, and leads to more blurred image details. LANet [24] still retains a strong blue color in the third image of Fig. 8(i). In contrast, UIEC$^{2}$-Net [12] and ours exhibit satisfactory performance in color correction and dehazing.

Fig. 8. Visual comparisons on blue underwater images and their color histograms. (a) Raw. (b) UDCP [19]. (c) IBLA [36]. (d) Retinex-based [15]. (e) UWCNN [8]. (f) Water-Net [9]. (g) UWCNN-SD [10]. (h) Ucolor [11]. (i) LANet [24]. (j) UIEC$^{2}$-Net [12]. (k) Ours.

Download Full Size | PDF

As depicted in Fig. 9, UDCP [19] and IBLA [36] fail to recover color information so that the enhanced results still maintain green tones. UWCNN [8] reduces the green cast of the image to a certain extent, as seen by the partial left shift of the G-channel curve in the histogram. However, the R channel is stretched excessively, causing the images to appear yellow, particularly in the second image of Fig. 9(e). To deal with fuzz and under-exposure, Reinex-based [15] applies simple histogram equalization and stretching on the reflection and brightness components. While this improves the image contrast, it also results in darker shadows and overly bright highlights. Water-Net [9] and Ucolor [11] struggle to effectively remove the haze-like effect. On the other hand, our method produces images with natural colors and rich details.

Fig. 9. Visual comparisons on green underwater images and their color histograms. (a) Raw. (b) UDCP [19]. (c) IBLA [36]. (d) Retinex-based [15]. (e) UWCNN [8]. (f) Water-Net [9]. (g) UWCNN-SD [10]. (h) Ucolor [11]. (i) LANet [24]. (j) UIEC$^{2}$-Net [12]. (k) Ours.

Download Full Size | PDF

To visually analyze and compare the enhancement results of different algorithms in yellow underwater scenes, we conducted a statistical analysis of the pixel values in the RGB channels of the images. The results are presented the results in a pie chart, as shown in Fig. 10. Although UDCP [19] can slightly increase the pixel values in the blue channel, there is no visible improvement in the enhanced images from a visual perspective. IBLA [36] fails to restore the three color channels of the image to a reasonable proportion. Instead, it excessively boosts the green channel, which results in a deviation towards the greenish hue. Retinex-based [15] effectively balances the pixel values of the RGB channels, but the processed images significantly reduce the saturation and lose the original color information. UWCNN [8] and Ucolor [11] exhibit the noticeable red coloration due to the increased dominance of the R channel. UWCNN-SD [10] introduces unpleasant artificial colors and produces halos around the edges of objects. Our method has obvious advantages in color fidelity and clearness.

Fig. 10. Visual comparisons on yellow underwater images and their pixel value ratio pie chart of RGB channels. (a) Raw. (b) UDCP [19]. (c) IBLA [36]. (d) Retinex-based [15]. (e) UWCNN [8]. (f) Water-Net [9]. (g) UWCNN-SD [10]. (h) Ucolor [11]. (i) LANet [24]. (j) UIEC$^{2}$-Net [12]. (k) Ours.

Download Full Size | PDF

Figure 11 shows the enhancement results for underwater images with haze. We utilize the transmission map to visually compare the dehazing effects. The transmission indicates the percentage of the scene radiance reaching the camera, which is a crucial prior knowledge in the underwater image formation model. It can be observed that the transmission maps of raw images tend to be dim as a whole, and there is no clear distinction between regions of background light at different distances. Obviously, our transmission maps have higher radiation efficiency and are more responsive to light intensity. Therefore, our algorithm can effectively remove haze while restoring the true colors of objects. UDCP [19] and IBLA [36] rely on the dark channel prior to remove haze from images. However, accurately estimating fog concentration is difficult when dealing with heavily foggy images, resulting in darker results. The simple underwater model established by UWCNN [8] is inadequate for the complex underwater environment. Although Retinex-based [15] and UWCNN-SD [10] can meet the requirements of haze removal, they tend to excessively enhance high-frequency information, leading to color distortion. Water-Net [9] falls short in terms of color correction and overall brightness. The transmission maps of LANet [24] are darker than ours, indicating that the results of LANet have significant residual haze. Our method retains more detail than Ucolor [11] and UIEC$^{2}$-Net [12] due to the finer transmission maps.

Fig. 11. Visual comparisons on turbid underwater images and their transmission maps. (a) Raw. (b) UDCP [19]. (c) IBLA [36]. (d) Retinex-based [15]. (e) UWCNN [8]. (f) Water-Net [9]. (g) UWCNN-SD [10]. (h) Ucolor [11]. (i) LANet [24]. (j) UIEC$^{2}$-Net [12]. (k) Ours.

Download Full Size | PDF

4.2.3 Quantitative comparisons

Table 1 demonstrates the quantitative results of different methods on two benchmarks. The UIEB-T90 and EUVP-T100 serve as full-reference test sets, with each raw image paired with a corresponding reference image. We choose PSNR, SSIM, and LPIPS as the quantitative metrics to assess the quality of the images. Our approach outperforms the runner-up method on the UIEB-T90 test dataset by approximately 1dB in terms of PSNR, while achieving optimal values for SSIM and LPIPS. In addition, our performance also maintains a leading position on the EUVP-T100 dataset. It is evident that both UDCP [19] and IBLA [36] fail to deliver satisfactory results. The Retinex-based [15] shows promising performance in defogging and can even compete with deep learning algorithms. However, its performance suffers greatly when there are severe color cast issues. From a holistic perspective, deep learning algorithms outperform traditional algorithms.

Table 1. Image quality evaluation of different methods on UIEB-T90 and EUVP-T100 datasets.

View Table | View all tables in this article

We also conduct the quantitative study on the non-reference datasets UIEB-C60 and U45. Table 2 reports the average scores of the results obtained by different methods. UIQM and UCIQE are metrics specifically designed for underwater image evaluation, and NIQE is typically used to measure the quality of natural images. Due to the algorithm design of U-Trans, the input image only supports $256 \times 256$ resolution, so only the corresponding results for the U45 dataset are shown in Table 2. It can be observed that our approach obtain the best metric values on both test sets, which indicates the superior performance in comprehensive aspects such as color, contrast, and clarity. UWCNN [8] designs underwater degradation model that is too simple to effectively adapt to the changing underwater conditions. Ucolor [11] and UIEC$^{2}$-net [12] both utilize multiple color spaces, but Ucolor [11] falls behind in various metrics. Because there exists a certain gap between feature maps in different color spaces, and Ucolor [11] directly concatenates them without any processing, which may introduce negative interference. On the other hand, Water-Net [9] applies traditional algorithms to preprocess the input images, which plays a positive role to some extent. However, the fusion network is merely a stack of ordinary convolutional layers and does not fully leverage the advantages of deep learning.

Table 2. Image quality evaluation of different methods on UIEB-C60 and U45 datasets.

View Table | View all tables in this article

4.3 Ablation study

To demonstrate the necessity of critical modules, we conduct ablation studies on the key components and loss functions on the UIEB-T90 dataset respectively:

(1) Our method without multi-scale turbidity restoration module (w/o MTRM).
(2) Our method without multi-path color correction module (w/o MCCM).
(3) Our method removes the non-local channel and space attention (w/o NLA).
(4) Our method removes Laplacian Loss (w/o $L_{lap}$).
(5) Our method removes Perceptual Loss (w/o $L_{perce}$).

As depicted in Fig. 12(b), the images are still obscured by haze, and the fine textures on the foreground objects, such as the coral rocks, become blurred. This indicates that MTRM significantly contributes to haze removal performance. Figure 12(c) presents enhancement results without MCCM, leading to a limited ability to remove color casts. Figure 12(d) illustrates that the non-local attention focuses on the internal contextual information of images. It helps the network to simultaneously consider both inter-image and intra-image weights in the fusion stage, thereby improving image quality. Table 3 data clearly indicates that PSNR and SSIM values decrease significantly when MCCM is not used. It can be attributed to the color bias, which leads to a substantial increase in Mean Squared Error (MSE).

Fig. 12. Ablation study of the key modules. (a) Raw. (b) w/o MTRM. (c) w/o MCCM. (d) w/o NLA. (e) Ours.

Download Full Size | PDF

Table 3. Qualitative results of the ablation study on the proposed modules.

View Table | View all tables in this article

Figure 13 and Table 4 present the results of ablation studies on different loss terms. The third image in Fig. 13(b) shows that the text on the sign is too blurred to be discerned, and the contours of the stones on the riverbed are not clear. The Laplacian loss, by utilizing the image residual pyramid, narrows the gap between the predicted image and the ground truth at multiple scales, which is effective for image detail enhancement. Table 4 also indicates that the absence of Laplacian loss leads to a decrease in PSNR and SSIM values by 3.8dB and 5${\%}$, respectively. Figure 13(c) demonstrates that relying solely on per-pixel loss can result in global perceptual biases in the image, such as color distortion and abnormal brightness. The perceptual loss constrains the network at high-level semantic features to produce pleasing images, which is indispensable for enhancing the visual quality of the final results.

Fig. 13. Ablation study of the loss functions. (a) Raw. (b) w/o $L_{lap}$. (c) w/o $L_{perce}$. (d) Ours.

Download Full Size | PDF

Table 4. Qualitative results of the ablation study on the loss functions.

View Table | View all tables in this article

5. Conclusion

In this paper, we presented a dual-stream fusion network for underwater image enhancement. The multi-scale turbidity restoration module (MTRM) removes the haze from underwater images and reduces scattering damage to clarity. The multi-path color correction module (MCCM) is mainly responsible for restoring the true colors as much as possible and avoiding visual interference from color cast. The collaborative attention fusion module extracts the self-correlation characteristics through non-local attention, providing important information for the fusion process. The quantitative and qualitative evaluations show that our algorithm provides competitive results on UIEB, EUVP and U45 datasets. The diverse testing images visually demonstrate our algorithm’s excellent ability to restore blue, green, yellow biases and eliminate fog. In addition, the effectiveness of key components has been validated in ablation studies. In the future, we will further explore the field of underwater image enhancement.

Funding

Fundamental Research Funds for the Central Universities (No. N2216010); ’Jie Bang Gua Shuai’ Science and Technology Major Project of Liaoning Province in 2022 (No. 2022JH1/10400025); National Key Research and Development Program of China (No. 2018YFB1702000).

Acknowledgments

The authors acknowledge the financial funding of this work. We also thank the anonymous reviewers for their critical comments on the manuscript.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. S. Anwar and C. Li, “Diving deeper into underwater image enhancement: A survey,” Signal Process. Image Commun. 89, 115978 (2020). [CrossRef]

2. D. Huang, Y. Wang, W. Song, et al., “Shallow-water image enhancement using relative global histogram stretching based on adaptive parameter acquisition,” in MultiMedia Modeling: 24th International Conference, MMM 2018, Bangkok, Thailand, February 5-7, 2018, Proceedings, Part I 24, (Springer, 2018), pp. 453–465.

3. W. Xiang, P. Yang, and S. Wang, “Underwater image enhancement based on red channel weighted compensation and gamma correction model,” Opto-Electron. Adv. 1(10), 18002401 (2018). [CrossRef]

4. J. Zhou, D. Zhang, W. Ren, et al., “Auto color correction of underwater images utilizing depth information,” IEEE Geosci. Remote Sensing Lett. 19, 1–5 (2022). [CrossRef]

5. K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” IEEE Trans. Pattern Anal. Mach. Intell. 33(12), 2341–2353 (2011). [CrossRef]

6. J. S. Jaffe, “Computer modeling and the design of optimal underwater imaging systems,” IEEE J. Oceanic Eng. 15(2), 101–111 (1990). [CrossRef]

7. M. J. Islam, Y. Xia, and J. Sattar, “Fast underwater image enhancement for improved visual perception,” IEEE Robot. Autom. Lett. 5(2), 3227–3234 (2020). [CrossRef]

8. C. Li, S. Anwar, and F. Porikli, “Underwater scene prior inspired deep underwater image and video enhancement,” Pattern Recognit. 98, 107038 (2020). [CrossRef]

9. C. Li, C. Guo, W. Ren, et al., “An underwater image enhancement benchmark dataset and beyond,” IEEE Trans. on Image Process. 29, 4376–4389 (2020). [CrossRef]

10. S. Wu, T. Luo, and G. Jiang, “A two-stage underwater enhancement network based on structure decomposition and characteristics of underwater imaging,” IEEE J. Oceanic Eng. 46(4), 1213–1227 (2021). [CrossRef]

11. C. Li, S. Anwar, J. Hou, et al., “Underwater image enhancement via medium transmission-guided multi-color space embedding,” IEEE Trans. on Image Process. 30, 4985–5000 (2021). [CrossRef]

12. Y. Wang, J. Guo, H. Gao, et al., “Uiecˆ 2-net: Cnn-based underwater image enhancement using two color space,” Signal Process. Image Commun. 96, 116250 (2021).

13. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, (Springer, 2015), pp. 234–241.

14. A. S. A. Ghani and N. A. M. Isa, “Enhancement of low quality underwater image through integrated global and local contrast correction,” Appl. Soft Comput. 37, 332–344 (2015). [CrossRef]

15. X. Fu, P. Zhuang, Y. Huang, et al., “A retinex-based enhancing approach for single underwater image,” in 2014 IEEE international conference on image processing (ICIP), (IEEE, 2014), pp. 4572–4576.

16. K. Zuiderveld, “Contrast limited adaptive histogram equalization,” Graphics gems pp. 474–485 (1994).

17. Y. Zhou, Q. Li, and G. Huo, “Human visual system based automatic underwater image enhancement in nsct domain,” KSII Trans. on Internet & Inf. Syst. 10(2), 1 (2016). [CrossRef]

18. A. S. A. Ghani, A. F. A. Nasir, and W. F. W. Tarmizi, “Integration of enhanced background filtering and wavelet fusion for high visibility and detection rate of deep sea underwater image of underwater vehicle,” in 2017 5th International Conference on Information and Communication Technology (ICoIC7), (IEEE, 2017), pp. 1–6.

19. P. Drews, E. Nascimento, F. Moraes, et al., “Transmission estimation in underwater single images,” in Proceedings of the IEEE international conference on computer vision workshops, (2013), pp. 825–830.

20. A. Galdran, D. Pardo, A. Picón, et al., “Automatic red-channel underwater image restoration,” J. Vis. Commun. Image Represent. 26, 132–145 (2015). [CrossRef]

21. N. Carlevaris-Bianco, A. Mohan, and R. M. Eustice, “Initial results in underwater single image dehazing,” in Oceans 2010 Mts/IEEE Seattle, (IEEE, 2010), pp. 1–8.

22. Y. Wang, H. Liu, and L.-P. Chau, “Single underwater image restoration using adaptive attenuation-curve prior,” IEEE Trans. Circuits Syst. I 65(3), 992–1002 (2018). [CrossRef]

23. I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative adversarial nets,” Advances in neural information processing systems 27, 1 (2014).

24. S. Liu, H. Fan, and S. Lin, “Adaptive learning attention network for underwater image enhancement,” IEEE Robot. Autom. Lett. 7(2), 5326–5333 (2022). [CrossRef]

25. F. Huo, B. Li, and X. Zhu, “Efficient wavelet boost learning-based multi-stage progressive refinement network for underwater image enhancement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 1944–1952.

26. L. Peng, C. Zhu, and L. Bian, “U-shape transformer for underwater image enhancement,” IEEE Trans. on Image Process. (2023).

27. C. Chen, Z. Xiong, and X. Tian, “Real-world image denoising with deep boosting,” IEEE Trans. Pattern Anal. Mach. Intell. 42(12), 3071–3087 (2020). [CrossRef]

28. H. Dong, J. Pan, L. Xiang, et al., “Multi-scale boosted dehazing network with dense feature fusion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2020), pp. 2157–2167.

29. P. Sharma, I. Bisht, and A. Sur, “Wavelength-based attributed deep neural network for underwater image restoration,” ACM Trans. Multimedia Comput. Commun. Appl. 19(1), 1–23 (2023). [CrossRef]

30. X. Zhang, X. Zhou, M. Lin, et al., “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 6848–6856.

31. M.-H. Guo, C.-Z. Lu, and Z.-N. Liu, “Visual attention network,” Comp. Visual Media 9(4), 733–752 (2023). [CrossRef]

32. J. Fu, J. Liu, H. Tian, et al., “Dual attention network for scene segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2019), pp. 3146–3154.

33. X. Li, W. Wang, X. Hu, et al., “Selective kernel networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2019), pp. 510–519.

34. S. Niklaus and F. Liu, “Context-aware synthesis for video frame interpolation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 1701–1710.

35. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, arXiv:1409.1556 (2014). [CrossRef]

36. Y.-T. Peng and P. C. Cosman, “Underwater image restoration based on image blurriness and light absorption,” IEEE Trans. on Image Process. 26(4), 1579–1594 (2017). [CrossRef]

37. H. Li, J. Li, and W. Wang, “A fusion adversarial underwater image enhancement network with a public test dataset,” arXivarXiv:1906.06819 (2019).10.48550/arXiv.1906.06819

38. R. Zhang, P. Isola, A. A. Efros, et al., “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 586–595.

39. K. Panetta, C. Gao, and S. Agaian, “Human-visual-system-inspired underwater image quality measures,” IEEE J. Oceanic Eng. 41(3), 541–551 (2016). [CrossRef]

40. M. Yang and A. Sowmya, “An underwater color image quality evaluation metric,” IEEE Trans. on Image Process. 24(12), 6062–6071 (2015). [CrossRef]

41. A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal Process. Lett. 20(3), 209–212 (2013). [CrossRef]

	UIEB-T90			EUVP-T100
Methods	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
UDCP [19]	12.9076	0.5927	0.3833	15.2813	0.5860	0.4858
IBLA [36]	16.0519	0.5823	0.3977	19.6960	0.7140	0.4405
Reinex-based [15]	20.1815	0.8563	0.2809	16.5975	0.6539	0.5564
UWCNN [8]	15.8116	0.7348	0.4000	19.9705	0.7299	0.4543
Water-Net [9]	19.9586	0.8661	0.1780	21.1987	0.7674	0.4309
UWCNN-SD [10]	21.3661	0.8267	0.3750	18.8932	0.7094	0.6141
Ucolor [11]	22.7324	0.8925	0.1513	21.3184	0.7740	0.4174
LANet [24]	22.3055	0.9067	0.1501	20.4101	0.7738	0.4097
UIEC $^{2}$ -Net [12]	25.0005	0.9095	0.1369	20.3936	0.7721	0.4374
Ours	25.9446	0.9178	0.1276	21.1307	0.7933	0.4059

	UIEB-C60			U45
Methods	UIQM↑	UCIQE↑	NIQE↓	UIQM↑	UCIQE↑	NIQE↓
UDCP [19]	0.9254	0.5204	6.4374	2.0864	0.5908	4.6058
IBLA [36]	1.4668	0.5834	6.3667	1.6725	0.5922	5.4875
Reinex-based [15]	2.1883	0.6030	5.2210	3.2114	0.6324	5.1904
UWCNN [8]	1.9798	0.4938	6.8763	2.9848	0.5199	4.5414
Water-Net [9]	2.2636	0.5724	5.9351	3.2706	0.5850	5.5619
UWCNN-SD [10]	2.3047	0.6168	5.6706	3.1344	0.6110	5.5181
Ucolor [11]	2.2684	0.5423	6.1515	3.3510	0.5796	4.7789
LANet [24]	2.2168	0.5622	5.5164	3.2809	0.6013	3.9782
U-Trans [26]	$-$	$-$	$-$	3.2479	0.5673	4.6295
UIEC $^{2}$ -Net [12]	2.3485	0.5897	5.3437	3.4041	0.6055	4.0682
Ours	2.4213	0.6122	5.1205	3.4945	0.6427	4.0567

Models	PSNR↑	SSIM↑	LPIPS↓
w/o MTRM	23.5415	0.8417	0.1628
w/o MCCM	21.7388	0.7835	0.1643
w/o NLA	23.0048	0.8091	0.1313
Ours	25.9446	0.9178	0.1276

Models	PSNR↑	SSIM↑	LPIPS↓
w/o $L_{l a p}$	22.0764	0.8699	0.1472
w/o $L_{p e r c e}$	20.0163	0.8104	0.1723
Ours	25.9446	0.9178	0.1276

	UIEB-T90			EUVP-T100
Methods	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
UDCP [19]	12.9076	0.5927	0.3833	15.2813	0.5860	0.4858
IBLA [36]	16.0519	0.5823	0.3977	19.6960	0.7140	0.4405
Reinex-based [15]	20.1815	0.8563	0.2809	16.5975	0.6539	0.5564
UWCNN [8]	15.8116	0.7348	0.4000	19.9705	0.7299	0.4543
Water-Net [9]	19.9586	0.8661	0.1780	21.1987	0.7674	0.4309
UWCNN-SD [10]	21.3661	0.8267	0.3750	18.8932	0.7094	0.6141
Ucolor [11]	22.7324	0.8925	0.1513	21.3184	0.7740	0.4174
LANet [24]	22.3055	0.9067	0.1501	20.4101	0.7738	0.4097
UIEC $^{2}$ -Net [12]	25.0005	0.9095	0.1369	20.3936	0.7721	0.4374
Ours	25.9446	0.9178	0.1276	21.1307	0.7933	0.4059

Dual stream fusion network for underwater image enhancement of multi-scale turbidity restoration and multi-path color correction

Abstract

1. Introduction

2. Related works

2.1 Non-physical model methods

2.2 Physical model methods

2.3 Deep learning-based methods

3. Method

3.1 Multi-scale turbidity restoration module

3.2 Multi-path color correction module

3.3 Collaborative attention fusion module

3.4 Loss function

4. Experiments and results

4.1 Datasets and implement details

4.2 Comparison experiments

4.2.1 Evaluation metrics

4.2.2 Qualitative comparisons

4.2.3 Quantitative comparisons

4.3 Ablation study

5. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (13)

Tables (4)

Equations (23)

Optics Express