Classification guided thick fog removal network for drone imaging: ClassifyCycle

Yan Liu; Wenting Qi; Guan Huang; Fubao Zhu; Yanqiu Xiao; Yanqiu Xiao

doi:10.1364/OE.498444

1. Introduction

The condensation of water vapor saturated in the air forms small droplets (fog) and dust (haze). Due to the optical absorption and scattering of droplets and dust in the air, it leads to the serious quality degradation of images captured by the drone camera in foggy weather. The drone imaging model in foggy environment is shown in Fig. 1. The drone images are usually captured at a higher altitude. The fog is thicker, and the image degradation is severer. The success of computer vision tasks such as object detection, target tracking and region segmentation depends on high quality images with minimal degradation. Hence, the thick fog removal algorithm embedded in the drone can enhance the visual imaging ability of the drone in foggy weather.

Fig. 1. The imaging model of drone in foggy weather.

Download Full Size | PDF

Early methods [1–4] based on image prior are performed on image defogging through the calculation of fog concentration, the results of which are easily inconsistent with the real scene for the unstable of the algorithm. Deep learning methods [5–10] learn the fog-free images information directly to reduce the impacts of parameter inaccuracy and the defogging effect is superior to the image prior algorithms. Most researches based on deep learning rely on a great deal of paired foggy images and fog-free images, which is impractical to capture the foggy images and fog-free images at one time in real life. Most existing researches focus on the thin and uniform fog removal, but the foggy images captured by drones are nonuniform due to inhomogeneous distribution of fog in higher altitude, leading to the obvious fog thickness differences in the images. Hence, the performance of previous defogging algorithms is poor in the thick fog imaging environment. To address these problems, a thick fog dataset based on drone images, named Aerial-Fog, is proposed in this paper. A classification guided thick fog removal network based on style migration is designed to improve the drone imaging quality.

Overall, our contributions are listed as follows:

• The drone foggy imaging dataset, Aerial-Fog, is proposed by altering concentration random parameters on multi-angle drone images, including 1635 foggy images with different fog thickness.
• The classification guided thick fog removal network based on style migration, termed ClassifyCycle, is designed to address the problems of drone imaging in thick fog weather avoiding the phenomena of overexposure, distortion, color deviation and fog residue.
• The image classification network module (ICLFn) is incorporated to enhance the reliability of follow-up learning network. The image style migration network module (ISMn) is introduced to effectively preserve the texture information by adding skip connections while defogging.
• Extensive experimental results show that the defogging network proposed in this paper has outstanding learning performance and strong generalization ability on synthetic and realistic drone images, especially in thick fog weather, and the defogging effect is better than the previous algorithms.

2. Related works

The image visibility can impact the identification accuracy of computer vision tasks, so high quality image with minimal degradation is useful for vision tasks in foggy environment. This section focuses on previous defogging work, in terms of both image prior-based defogging methods and learning-based defogging methods.

2.1 Image prior-based defogging methods

Image prior-based defogging methods (Hand-crafted priors) adopt the atmospheric scattering model to defog. Mathematically, the fogging process based on the atmospheric scattering model [11–13] can be expressed by:

(1)$${I^{fog}}(x )= {J^{free}}(x )t(x )+ A({1 - t(x )} )$$

where $x$ represents the spatial position of the image, ${I^{fog}}(x )$ and ${J^{free}}(x )$ represent the pixel value of the $x$ coordinate position in foggy and fog-free images respectively. $t(x )$ denotes the atmospheric light transmittance and $A$ indicates the luminance of the light intensity from infinity. To express image generation principle better, the formula (1) can be described as:

(2)$${J^{free}}(x) = A + {{({I^{fog}}(x) - A)} / {t(x)}}$$

Many defogging methods based on image prior are proposed in the early stage. Fattal [1] proposed a color-line prior method to eliminate scattered light to obtain a depth map and a fog-free map, but this method is not suitable for grayscale scene. Based on this limitation, Tan et al. [14] introduced an automated defog method, which uses the feature that the contrast of fog-free images is higher than that of foggy images, but it has poor effect on color restoration of image. The DCP model developed by He et al. [15] eliminated fog by inverting calculation of the fog formation process. However, the model has poor effect on gray and white objects in some scenes which similar to atmospheric scene. The above methods are limited by the assumption of “statistical independence”, which easily leads to inaccurate results. The linear color attenuation prior defog method designed by Zhu et al. [16] was proposed by establishing a linear model, using the DOF (depth of field) of the foggy map to recover the depth information. Berman et al. [17] proposed the tight clusters that is formed by haze-line to express the fog-free image, but only pixels that fit the model assumptions are useful for results.

Although the above methods using image prior are effective for defogging, the effect is limited when the foggy image is different from the prior assumption, which causes the change of color hue and the loss of texture information.

2.2 Learning-based defogging methods

Unlike the image prior defogging methods described above, learning-based defogging methods adopt deep network on the synthetic foggy images to defog faster. The defogging model called DehazeNet designed by Cai et al. [18] used the BReLU function for medium transmission estimation to recover fog-free images. The MSCNN model proposed by Ren et al [19] adopted a faster method to learn the mapping between transmission maps and foggy maps to predict the defog results, which has poor defogging results in night foggy environment. The AIPNet model proposed by Wang et al. [20] showed that the illumination has a significant effect on luminance channel in YCrCb space and has little effect on chrominance channel in foggy weather, which is adopted to enhance the visual contrast and recover the degraded texture information. The DCPDN model designed by Zhang and Patel et al. [10] adopted a joint discriminator to determine the authenticity of the images with long computation time.

The unsupervised learning method aims to transform various information from source domain to target domain. Recently, the most common method of unsupervised is Generative Adversarial Networks (GAN) [21]. Qu et al. [22] transformed the study of defogging into the study of images interconversion, but the problem of this method is the use of paired images. Zhu et al. [23] proposed an unpaired network called CycleGAN that transforms horses in the images into zebra. This method later was applied to the domain of defogging, and most unsupervised defogging methods were designed based on it. Yang et al. [24] designed a disentangled defogging network by relieving the training constraints using unpaired images. The CDNet model [25] was proposed to use an optical model based on the CycleGAN. The DD-CycleGAN model [26] proposed a double-discriminators to stabilize the cyclic training, which has great effect of autonomous learning information from target domain, but adds the extra time consumption. Although unsupervised learning methods [27–29] reduce the difficulty of obtaining paired images, the randomness of GAN has taken problems unpredictable. For example, due to the randomness of unsupervised learning parameters, defogging algorithms learn different hue and texture information from fog-free images, which may result unrealistic images generation. To address the above problems, the proposed network focuses on eliminating the inauthenticity, distort and color deviation phenomena.

3. Method

The proposed network architecture includes: (i) a module of image classification (ICLFn); (ii) a module of style migration (ISMn). The framework is shown in Fig. 2. Broadly, the ICLFn module classifies the input into certain category. Then, the classified drone thick fog images are input into the ISMn module to produce fog-free images by learning target domain images.

Fig. 2. The main architecture of the proposed method. The ICLFn module (see Fig. 3) classifies the drone thick fog images into different categories and then the foggy images are defogged by the ISMn module (see Fig. 4). The ICLFn module is shown on the upper left. The classified foggy images are input to the ISMn module (see the right panel above). The generator ${G_{AB}}$ converts images from foggy to fog-free and the generator ${G_{BA}}$ runs the other way around. The discriminators ${D_A}$ and ${D_B}$ will determine the true or false values of images.

Download Full Size | PDF

3.1 Method overview

As illustrated in Fig. 2, in this study, to address the problems of domain shift, distort, color deviation and fog residue caused by the difference of image hue and texture information, a classification guided thick fog removal network based on style migration is proposed. A synthetic drone thick fog dataset ${O_i} = {\{ {o_i}\} _{i = 1,2\ldots N}}$ as input to the ICLFn classify module, bridging the gap between distortion and color deviation caused by defogging process of ISMn module, where N represents the total quantity of the images. Then the ISMn module removes fog while effectively preserving the texture information. The generator ${G_{BA}}$ creates the corresponding reconstructed map by learning from the pseudo-real image produced by the generator ${G_{AB}}$, which reduces the additional burden of using new generators. In this paper, images with different categories information are fed into different ISMn architectures (This paper divides categories into five) to defog. From the ICLFn module,

(3)$${D_j}/{W_j}/{S_j}/{T_j}/{R_j} = {\{{{d_j}/{w_j}/{s_j}/{t_j}/{r_j}} \}_{j = 1,2\ldots N}}$$

and

(4)$${F_m}/{G_m}/{H_m}/{K_m}/{L_m} = {\{{{f_m}/{g_m}/{h_m}/{k_m}/{l_m}} \}_{m = 1,2\ldots N}}$$

are input respectively as source domain and target domain to ISMn module to remove the fog. In the Eq. (3), ${D_j}/{W_j}/{S_j}/{T_j}/{R_j}$ is the five categories of foggy images, $\{{{d_j}/{w_j}/{s_j}/{t_j}/{r_j}} \}$ represents the data set in each category, $N$ represents the total number of samples of this category, but the value of each category is different. Equation (4) is defined similarly to Eq. (3), but it is for fog-free images.

3.2 Image classification module (ICLFn module)

ICLFn can solve distortion problem caused by defogging efffectively and enhance the reliability of follow-up learning network simultaneously. The thick fog drone image is classified by ICLFn, a module with ResNet-50 [30] as the initial custom model and fine-tuned by convolution layer, normalization layer, linear layer and ELU activation function, as shown in Fig. 3. The ICLFn module divides the input images into five categories and outputs different feature labels. Then, the input images of the ICLFn module and the output labels of the ICLFn module are input into the ISMn module together for the subsequent operation of the ISMn module. The images in this study range from a maximum size of 1920 × 1080 to a minimum size of 960 × 540. Any resolution image input to the network is resized as a 540 × 540 resolution image. This choice ensures that smaller images are not stretched to the point of distortion, while still allowing important feature information to be captured from the input images for the purpose of category classification. The convolutional block contains four convolutional layers of 3${\times} $3. In the process of up-sampling and down-sampling, different sizes of feature maps use different convolutional layers to resize the feature maps and the quantity of feature maps for obtaining the information of feature maps.

Fig. 3. The schematic illustration about the ICLFn module.

Download Full Size | PDF

3.3 Image style migration module (ISMn module)

Compared with previous defogging algorithms [31–36], the ISMn module uses generator for continuous learning to defog while retaining texture information. As shown in Fig. 4, category information from ICLFn module is input to corresponding ISMn module via convolution, up-sampling, down-sampling and residual blocks. Using larger input sizes for the ISMn module would result in computationally intensive and high memory, making model training difficult and time-consuming. Hence, the input sizes of ISMn module are adjusted to 256 × 256 in order to strike a balance between model performance and computation cost. The generator network has six convolutional layers and nine residual blocks. The generator first adopts reflective filling method (ReflectionPad2d) to increase image resolution, then through one convolution layer and two down-sampling layers to increase effective area, feature map is fed into the residual block and two up-sampling layers. Finally, the generator adopts a reflective network to restore the image resolution to the original resolution.

Fig. 4. The schematic illustration about the ISMn module.

Download Full Size | PDF

During this process, skip connections are added to the generator (as indicated by the red arrow heading in Fig. 4). As the depth of the architecture increases, the deeper information of the image is gradually acquired and more feature information of the overlapping regions is captured. However, the shallow information is lost and more detailed information of image is not available, the model will be overfitted, resulting the accuracy of model decrease. To alleviate the difficulties caused by overfitting, the skip connection is embedded before the down-sampling stage, the result of first convolution block will be output as shallow information and the two results will be combined after all convolution blocks output. This method collects both deep and shallow information, which helps the model capture the global image structure and context information earlier. Embedding skip connection before down-sampling also helps avoid information loss.

The discriminator network aims to judge the authenticity of foggy image. Discriminator uses three convolutional blocks which contains one convolutional layer and uses LeakyRelu activation function.

3.4 Training losses

The loss of class module (ICLFn) is cross-entropy loss and the loss of defogging module (ISMn) are adversarial loss and cycle consistency loss. The above loss is used as the network loss of the training model.

The cross-entropy loss describes the uncertainty of category probability distribution, whether the result of judged category is true or false. This loss does not require complex mathematical operations, simplifying the training process of the model. Moreover, the algorithm is designed for multi-class classification where the number of samples in each class is not the same. This loss assigns different weights to samples from different classes, thereby addressing the issue of imbalanced data to some extent. The formula is as follows:

(5)$$Loss ={-} \sum\limits_{i = 1}^n {g({m_i})} \log (f({m_i}))$$

where $n$ represents the image total quantity, ${m_i}$ represents the actual label of the ${i^{th}}$ sample, $f({{m_i}} )$ represents predicted labels for the ${i^{th}}$ sample.

Similar to the binary classification method, multi-classification can be expressed as:

(6)$${L_{ICLFn}} ={-} \frac{1}{N}\sum\limits_{i = 1}^N {\sum\limits_{j = 1}^K {{g_{i,j}}} \ln {f_{i,j}}}$$

where ${g_{i,j}}$ represents the true label of the ${i^{th}}$ sample as $j$, with a total of K label values N samples, ${f_{i,j}}$ represents the probability that the category of the ${i^{th}}$ sample is misidentified as the category of sample $j$.

The adversarial loss is used to optimize the performance of both the generator and discriminator. It helps the generator produce realistic target images and aids the discriminator in distinguishing between generated and real images. The cycle consistency loss ensures that the generator maintains consistency between images during the image transformation process. This loss functions by comparing whether source domain images, after undergoing transformations by both generators, remain consistent with the original source domain images. This measure is vital for preventing information loss and maintaining image stability. Furthermore, cycle consistency loss also guarantees the feasibility of the generator’s inverse mapping process. It is described by the formula as follows:

(7)$$\begin{array}{r} {L_{ISMn}} = {E_{x\sim {P_{data(x)}}}}[\log {D_A}({G_{AB}}(x)) + \log {D_A}(x) + ||{G_{BA}}({G_{AB}}(x)) - x|{|_1}]\\ + {E_{y\sim {P_{data(y)}}}}[\log {D_B}(y) + \log {D_B}({G_{BA}}(y)) + ||{G_{AB}}({G_{BA}}(y)) - y|{|_1}] \end{array}$$

where ${G_{AB}}$ continuously learns the images of domain $Y$ and generates images ${G_{AB}}(x )$, ${G_{BA}}$ continuously learns the images of domain $X$ and generates images ${G_{BA}}(y )$, ${D_B}$ and ${D_A}$ aim to discriminate the authenticity of the generate images of${G_{BA}}$ and ${G_{AB}}$ respectively.

The full objective loss can be described as follows:

(8)$$L = {L_{ICLFn}} + \lambda {L_{ISMn}}$$

where $\lambda$ regulates the relative relationship between the two loss functions during training. The value of $\lambda$ ranges from 0-1. In this experiment, the value of $\lambda$ starts at 0.5, gradually increases the value of $\lambda$ and obverses the resulting maps to balance the two tasks.

3.5 Incorporation of ClassifyCycle model in drone imaging pipeline

The raw data captured by the image sensor must be processed by image processing pipeline (IPP) to get final images corresponding to human visual habits. The proposed network ClassifyCycle can be incorporated in the traditional color image IPP, which is shown in Fig. 5. Although the model can be embedded before image compression or after image compression, we suggest that the proposed model is incorporated into image IPP before image compression for better quality of the output image.

Fig. 5. The image signal process of our algorithm, where DCP indicates defect point correction, BLD indicates black level correction, LSC indicates lens shading correction.

Download Full Size | PDF

Compressed images and uncompressed images are compared on our model to demonstrate the effectiveness of embedding the model before compression, as shown in Fig. 6.

Fig. 6. Comparison of using uncompressed foggy image and compressed foggy image.

Download Full Size | PDF

As can be seen from Fig. 6, the content information retained by uncompressed images after defogging is more realistic. The blue boxes in the lower right corner of the image are magnified effect of the red boxes in the image. The fogging parameters applied to the uncompressed BMP image and the compressed JPEG image are same. As can be seen from the first row of images, the uncompressed image has clearer details and more realistic colors than the compressed image. As can be seen from the second and third rows of images, the uncompressed image has more detail than the compressed image. Hence, we suggest embedding the module before compression.

4. Experiments

4.1 Dataset

The existing defogging methods are mainly based on the study of thin fog. The foggy images captured by drones are nonuniform due to inhomogeneous distribution of fog in higher altitude, leading to the obvious fog thickness differences in the images. Due to the lack of thick fog image dataset captured by the drone, a new dataset Aerial-Fog is proposed, including a variety of images style from the perspective of drone photography.

The dataset is obtained by center point synthetic fog method. The fog is diffused from a fog central point, and the further away from the fog center, the weaker effect the fog is generated. The original images from Visdrone2019 dataset [37]. The Visdrone2019 dataset consists of 288 video clips formed by 261,908 frames and 10,209 static images, captured by drone-mounted cameras, including a variety of scenes such as city street, villages, highway and multiple ranges landforms of 14 different cities (both sparse and dense scenes). The dataset is collected using drone cameras, in different scenarios, and under various weather and lighting conditions. There are various image resolutions in dataset, which are 1360${\times} $765, 960${\times} $540, 1920${\times} $1080, 1916${\times} $1078, 1400${\times} $788 and 1400${\times} $1050, respectively. We randomly select a portion of the images to use as a clean dataset for later operations. The center point synthetic fog method proposed in the paper is based on the idea of atmospheric scattering model. By controlling the atomization parameters such as brightness, concentration, atomization size and atomization center, each pixel of the image is adjusted, and fog is added to the three RGB channels to achieve image atomization. We specify the brightness and fog concentration, or randomly initialize the fog concentration to get foggy images with different atomization effects. An atomization center point is randomly generated, and the position of the point is within the range of one-fourth to one-third of the image center point, and then the atomization size is adjusted to control the atomization range. After many attempts, the adjustment ranges of the input images are determined, the brightness value is set in the range of 0.8-1, the concentration value is set in the range of 0.06-0.16, and the size is set in the range of 48-60. In the process of fog generation, by randomly selecting the values in this rang, generating and selecting the available foggy images, repeatedly generating inappropriate images until there is no distortion. The generated foggy images can be divided into thin fog, thick fog and agglomerate fog images, and the area occupied by fog in the image is different depending on the size range. The process is described by the formula in the following:

(9)$${H_{fog}} = im{g_{f[u ][v ][o ]}}zn + A({1 - zn} )$$

(10)$$n ={-} 0.04\sqrt {{{({u - ce[0 ]} )}^2} + {{({v - ce[1 ]} )}^2}} + Q$$

where $zn$ represents the transmittivity, $n$ represents the distance from the current pixel to the center pixel, $A$ represents the luminance, $ce[]$ is the center point position, $u$ and $v$ denote the size of the image respectively, $Q$ is the set atomization size, $im{g_{f[u ][v ][o ]}}$ indicates the dimensional size and number of channels after normalization of the image, the value of $o$ ranges from 1 to 3 and is used to access the three channel colors of the pixel.

The proposed Aerial-Fog dataset contains 1559 pairs of foggy and corresponding fog-free images, and 76 unpaired foggy images. The 1559 pairs of images are used for training, in which 278 in the ‘dayway’ category, 262 in the ‘house’ category, 455 in the ‘night’ category, 432 in the ‘skyscrapers’ category and 132 in the ‘twilight’ category. The foggy images randomly learn different fog-free images information, this process does not require one-to-one correspondence of images. The Aerial-Fog dataset proposed in this paper means the dataset used for training and testing. We randomly selected 76 foggy images from the generated foggy image to use as the test set, and the corresponding fog-free images of these 76 foggy images do not exist in the training set and test set. Hence, these 76 foggy images are considered unpaired in the context of our model. The 76 unpaired foggy images and the captured real foggy images by ourselves are used for testing. By setting random parameters of concentration, the images are classified into clean, thin fog, thick fog and agglomerate fog images during the fog map generation process. Our dataset involves a wide range of fog concentrations and presents more complete results compared to a dataset with only one class of concentration images.

Some paired images instances selected randomly from the Aerial-Fog dataset are shown in Fig. 7.

Fig. 7. Some images of Aerial-Fog dataset. (a): ground-truth of different categories. (b): foggy images with different concentrations corresponding to images of (a).

Download Full Size | PDF

4.2 Implementation details and competitors

The experimental results in this paper are obtained by training on a Tesla V100 GPU in PyTorch. The batch size is set to 1 and used Adam optimizer in this training process. Our total number of iterations is 300, we first using learning rate of 0.0002 for 150 rounds, and then in the last 150 rounds, the learning rate is decreased according to a linear rule after each 50 iterations.

The proposed method is compared on datasets Aerial-Fog and HSTS with several previous defogging algorithms, which include DMPHN [32], DCPDN [38], DehazeFormer [39], D4 [40], GCANet [41], PFFNet [42]. Among them, D4, PFFNet, DMPHN, DehazeFormer and DCPDN are tested and displayed the effect after retraining on the Aerial-Fog dataset, while GCANet only called the pretrained model provided by the original author to test and display the effect because it does not provide training code.

4.3 Metrics

To better demonstrate the superiority of our method, the metrics of peak signal-to-noise ratio (PSNR), structural similarity (SSIM), learned perceptual image patch similarity (LPIPS) and inference time are calculated to quantitatively evaluate performance. Inference time refers to the time taken for testing with trained models, that is, the time taken for defogging. The higher value of PSNR and SSIM (SSIM values are in the range of 0 to 1) and the lower value of LPIPS, the higher image quality. PSNR metric can reflect the quality of image reconstruction to some extent. It quantifies image quality by comparing the peak signal-to-noise ratio between the original image and the defogging image. SSIM metric not only considers the brightness, contrast and structure of the image, but also simulates the perception of the image by the human eye, which can provide a more comprehensive image quality assessment, including the retention of the image structure. LPIPS metric is a perceptional image quality evaluation indicator based on deep learning. It simulates the characteristics of image perception by human visual perception system, which is more in line with human subjective perception, and can provide higher-level image quality evaluation that is more in line with human perception.

Image Image_a after defogging and the ground-truth image Image_b as the image pair to evaluate the effect.

The calculation formula of PSNR is as follows:

(11)$$PSNR({Imag{e_a},Imag{e_b}} )= 10 \ast {\log _{10}}({MA{X^2}} )/MSE$$

(12)$$MSE = \frac{1}{N} \ast \sum\limits_{i = 1}^N {{{({Imag{e_a}[i ]- Imag{e_b}[i ]} )}^2}}$$

where $MAX$ is 255, $N$ is the number of pixels, $MSE$ is the mean squared error, representing the square average of the pixel difference between two images.

The calculation formula of SSIM is as follows:

(13)$$\begin{array}{c} SSIM({Imag{e_a},Imag{e_b}} )= ({2 \ast {\alpha_a} \ast {\alpha_b} + C1} )\ast ({2 \ast {\delta_{ab}} + C2} )\\ /({({{\alpha_a}^2 + {\alpha_b}^2 + C1} )\ast ({{\delta_a}^2 + {\delta_b}^2 + C2} )} )\end{array}$$

where ${\alpha _a}$ and ${\alpha _b}$ are the mean values of the brightness of Image_a and Image_b respectively, ${\delta _a}$ and ${\delta _b}$ are the standard deviations of the brightness of Image_a and Image_b, ${\delta _{ab}}$ are the covariance of the brightness of Image_a and Image_b, $C1$ and $C2$ are stability constants to avoid denominators of 0.

The calculation formula of LPIPS is as follows:

(14)$$LPIPS({Imag{e_a},Imag{e_b}} )= \sum\limits_l {{\raise0.7ex\hbox{$1$} \!\mathord{/ {\vphantom {1 {{H_l}{W_l}}}}}\!\lower0.7ex\hbox{${{H_l}{W_l}}$}}\sum\limits_{h,w} {||{{\omega_l} \odot ({y_{ahw}^l - y_{bhw}^l} )} ||} } _2^2$$

where $l$ represents $l$ feature layer, ${H_l}{W_l}$ stands for weight layer and is used to control the weight of different feature layers, ${\omega _l}$ represents the weights of different feature layers, ${\odot} $ is used to measure the degree of similarity and correlation, $y_{ahw}^l$ and $y_{bhw}^l$ represent the pixel value of Image_a and Image_b at pixel positions (h,w), respectively.

4.4 Simulated paired foggy images performance evaluation

A comparison of our ClassifyCycle method with previous methods on the proposed Aerial-Fog and HSTS datasets are listed in Table 1 and 2. The ClassifyCycle method yields a performance of 25.4265 dB PSNR, 0.8596 SSIM and 0.3465 LPIPS on Aerial-Fog dataset, 27.0480 dB PSNR, 0.9133 SSIM and 0.0699 LPIPS on HSTS dataset.

Table 1. Quantitative comparisons with previous methods on Aerial-Fog dataset. Bolded text indicates the best performance

View Table | View all tables in this article

Table 2. Quantitative comparisons with previous methods on HSTS dataset. Bolded text indicates the best performance.

View Table | View all tables in this article

From the Table 1, although the time metrics is poor compared with other algorithms, it is sufficient for fluency, and our approach yields the best performance on other metrics. From the Table 2, the SSIM metrics of DehazeFormer is better than ClassifyCycle, but in the fourth row, the restoration of texture information in the exposed area of the sky has distortion problem.

In Fig. 8, the area shown in the blue box in the lower right corner of the figure, is the magnified version of the area in the red box. By visualizing the results, it can be seen that for this thick fog drone image dataset, the PFFNet algorithm cannot effectively remove the fog, and it significantly changes the color hue of the image. Also the brightness is greatly reduced and the image information is not clear. Moreover, it is difficult to judge whether the image is captured in the day or in the night. The D4 algorithm does not change the color information, but the defogging effect is poor and it causes overexposure to the overly bright areas in the image. The DCPDN algorithm has effect on the thin fog region where the edge region of the fog. The DMPHN algorithm does not perfectly retain the hue information, and the overall color of the generated defogged image becomes white. Unlike these four methods, DehazeFormer algorithm and GCANet algorithm have relatively good defogging effect on this dataset, but there is obvious color difference compared with the GT images. In the seventh row, the DehazeFormer algorithm darkens the overall color, with an accompanying distortion that appears as a blocky shape, the GCANet algorithm does not remove the fog in the middle part of the image. Our method removes fog cleanly while restoring the color hue well and preserving the various details of the image.

Fig. 8. Qualitative results for paired datasets in Aerial-Fog.

Download Full Size | PDF

The results and performance in HSTS dataset are shown in Fig. 9 and Table 2. In the first and third rows, it can be found that the methods of PFFNet and DMPHN do not work well, the color of the sky becomes completely different, the color of D4 becomes darker and the lake surfaces fail to show reflections. The method of DCPDN has a poor defogging effect, while GCANet has a relatively good effect but the color of the sky become brighter. The color of our method is closest to the real fog-free image. In the fourth row, the color of DehazeFormer and GCANet methods become exaggerated, the D4 method is relatively better but the house surfaces feel lighter than real image. Our method has the best defogging effect while the color of whole image is closest to GT.

Fig. 9. Qualitative results for paired datasets in HSTS.

Download Full Size | PDF

4.5 Real fog images and simulated unpaired foggy images performance evaluation

In Fig. 10, the aera shown in the blue box in the lower right corner of the figure, it is the magnified version of the area in the red box. By visualizing the results, as can be seen from the first three rows of images, the DehazeFormer, DMPHN and PFFNet methods cause the overall color of the image changed, the DCPDN, D4 and GCANet methods have poor defogging effect on sky, while our method restores the color hue realistic. In the sixth row, the PFFNet causes a dark overall image, the DCPDN and D4 methods have less effect on defogging. In DMPHN and GCANet, the fog is removed but the color of house becomes darker. The DehazeFormer removes fog relatively better, the houses on the edge seem to be defogged appropriately, but the houses in the middle hardly changed. Our method generates the most natural houses and the fog removed clearly. In the night images, the fog areas of DCPDN method are overexposed, the methods of DMPHN and PFFNet change the overall color. D4 and GCANet methods cannot remove fog completely, only the edge areas are appropriately defogged, DehazeFormer is relatively good, but the restoration of dark hue is not perfect, while our method removes fog perfectly and restores the color hue authentic.

Fig. 10. Qualitative results for unpaired datasets in Aerial-Fog dataset and real foggy images.

Download Full Size | PDF

Compared to the algorithms experimented above, our method generates the most complete and most authentic images. This suggests that the method in this paper can alleviate the problem of insufficient realism when applied to real datasets to some extent. The metrics and the results images show that the fog removal performance of ClassifyCycle network outperforms others methods in case of both the proposed Aerial-Fog dataset and other dataset.

4.6 Anomaly detection

Due to the case of classification error in the ICLFn module, the wrong images are input to the ISMn module, which may cause image distortion and artifacts. The ICLFn module is tested separately and the accuracy rate is calculated, as shown in Table 3. The classification accuracy rate of each category and total accuracy rate are listed.

Table 3. The classification accuracy of ICLFn module

View Table | View all tables in this article

In addition, the impact of image classification error on the defogging performance of ISMn module is evaluated, and the corresponding effect maps is shown, as shown in Fig. 11.

Fig. 11. Defogging maps after classification error.

Download Full Size | PDF

As can be seen from Fig. 11, the category classification errors lead to distortion and artifacts in images after defogging. The ‘dayway’ category in the first and second rows is categorized as the ‘house’ category, and the ‘dayway’ category in the third row is categorized as the ‘skyscraper’ category. They all learn the feature information provided in the misclassification category, causing the image to show colors and objects that do not belong to the image after defogging. Of course, classification errors do not always have distortion and artifacts. The ‘skyscraper’ category in the fourth row is categorized as the ‘house’ category, although there is a certain error in image color recovery, it is better for texture information recovery.

4.7 Ablation study

Our ablation experiments with the proposed ClassifyCycle model are carried out to analyze the feasibility of different modules step by step, as described below.

CycleGAN network as a baseline for defogging algorithms. Subsequently, each module is tested: (1) base + ICLFn: add ICLFn module to the baseline. (2) base + Skip: add skip connection to the baseline generator. (3) ours: use our ClassifyCycle model. The corresponding results are shown in Table 4 and Fig. 12.

Fig. 12. The images in (A)-(D) are the defogged results of input images by different network configurations.

Download Full Size | PDF

Table 4. Ablation study on ClassifyCycle. Bolded text indicates the best performance

View Table | View all tables in this article

By studying the ablation experiments described above, it is found that both the components have effect on defogging. Adding skip connection operation to solution B allows direct information from previous layer to subsequent layers without traversing through a series of intermediate operations, which can reduce the computation of the entire network, improve the performance of the network, and consequently reduce the inference time. The ICLFn module is added to solution C to share the feature extractor portion of the ISMn module. In this way, useful representations of input images have been learned during the training process. During test, there representations can be used for classification tasks without recalculating the input of feature extractor, leading to a reduction in inference time.

By observing the results in Fig. 12, both in the high-illumination images (see the day images of Fig. 12) and in the low-illumination images (see the night images of Fig. 12), the resultant maps generated by ClassifyCycle network are closest to the real fog-free images in terms of image color and detail restoration. For example, the lighting of the low-illumination images (see the second row), the color restoration of the buildings and trees (see the third row), the color of the roads and the color restoration of the surrounding flowers and plants (see the first row) all demonstrate that the results by ClassifyCycle network are most realistic.

Figure 12 shows the defogging effect of three categories foggy images in our Aerial-Fog dataset. The blue box in the image is the enlarged effect of the red box. It can be seen that in the results with high visibility, the defog results of ClassifyCycle network clearly. In the results with low visibility, some vehicles and road signs were recovered well in image D. The restoration effect of image D on the tree is not good, which may be related to factors such as image tone, and the limited number of samples of this data category. In agglomerate fog results, the color of trees shows that the ClassifyCycle network generated the most natural images while effectively defogging. In summary, our method has good defogging effect for all three categories, while effectively recovering the image information.

5. Conclusion

A classification guided thick fog removal network based on style migration, named ClassifyCycle, is proposed to address the difficulty of defogging in foggy weather. The algorithm focuses on the problems of defogging in terms of distortion, color deviation and fog residue. ICLFn module enhances the reliability of follow-up learning network, whereas ISMn effectively preserves the texture information by adding skip connection while defogging. Extensive experimental results show that the proposed ClassifyCycle network surpass the state-of-the-art algorithms resulting in better visual effects on synthetic and realistic drone images in thick fog weather. The ClassifyCycle model has outstanding learning performance and strong generalization ability. In the future work, we will generalize the proposed model ClassifyCycle to resolve the problems in different domain, by including deraining and desnowing.

Funding

Henan Province Science and Technology Research and Development Program Young Scientist Program (22520080098); Henan Province Higher Education Key Research Project Plan Basic Research Project (23ZX013); Henan Provincial Science and Technology Research Project (222102210015); National Natural Science Foundation of China (61605175).

Acknowledgment

The authors thank the National Natural Science Foundation of China for help identifying collaborators for this work.

Disclosures

The authors declare no potential conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Dataset 1 found in Ref. [43].

References

1. R. Fattal, “Single image dehazing,” ACM Trans. Graph. 27(3), 1–9 (2008). [CrossRef]

2. A. Galdran, “Image dehazing by artificial multiple-exposure image fusion,” Sig. Process. 149, 135–147 (2018). [CrossRef]

3. T. M. Bui and W. Kim, “Single image dehazing using color ellipsoid prior,” IEEE Trans. on Image Process. 27(2), 999–1009 (2018). [CrossRef]

4. W. Y. Hsu and Y. S. Chen, “Single image dehazing using wavelet-based haze-lines and denoising,” IEEE Access. 9, 104547 (2021). [CrossRef]

5. J. Zhang and D. Tao, “FAMED-Net: a fast and accurate multi-scale end-to-end dehazing network,” IEEE Trans. on Image Process. 29, 72–84 (2020). [CrossRef]

6. G. Tang, L. Zhao, R. Jiang, et al., “Single image dehazing via lightweight multi-scale networks,” in 2019 IEEE International Conference on Big Data (IEEE) (2019), pp. 5062–5069.

7. X. Liu, Y. Ma, Z. Shi, et al., “GridDehazeNet: attention-based multi-scale network for image dehazing,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), pp. 7313–7322.

8. N. A. Husain and M. S. M. Rahim, “The dynamic scattering coefficient on image dehazing method with different haze conditions,” Lect. Notes Inst. for Comput. Sci. Soc. Informatics Telecommun. Eng. 429, 223–241 (2022). [CrossRef]

9. R. Li, J. Pan, Z. Li, et al., “Single image dehazing via conditional generative adversarial network,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition (IEEE) (2018), pp. 8202–8211.

10. W. Ren, L. Ma, J. Zhang, et al., “Gated fusion network for single image dehazing,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition (IEEE) (2018), pp. 3253–3261.

11. D. Yang and J. Sun, “Proximal dehaze-net: a prior learning-based deep network for single image dehazing,” in European Conference on Computer Vision (Springer) (2018), pp. 729–746.

12. S. Gautam, T. K. Gandhi, and B. K. Panigrahi, “An improved air-light estimation scheme for single haze images using color constancy prior,” IEEE Signal Process. Lett. 27, 1695–1699 (2020). [CrossRef]

13. X. Zhang, T. Wang, G. Tang, et al., “Single image haze removal based on a simple additive model with haze smoothness prior,” IEEE Trans. Circuits Syst. Video Technol. 32(6), 3490–3499 (2022). [CrossRef]

14. R. T. Tan, “Visibility in bad weather from a single image,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition (IEEE) (2008), pp. 1.

15. K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” IEEE Trans. Pattern Anal. Mach. Intell. 33(12), 2341–2353 (2011). [CrossRef]

16. Q. Zhu, J. Mai, and L. Shao, “A fast single image haze removal algorithm using color attenuation prior,” IEEE Trans. on Image Process. 24(11), 3522–3533 (2015). [CrossRef]

17. D. Berman and S. Avidan, “Non-local image dehazing,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (IEEE) (2016), pp. 1674–1682.

18. B. Cai, X. Xu, K. Jia, et al., “DehazeNet: an end-to-end system for single image haze removal,” IEEE Trans. on Image Process. 25(11), 5187–5198 (2016). [CrossRef]

19. W. Ren, J. Pan, H. Zhang, et al., “Single image dehazing via multi-scale convolutional neural networks with holistic edges,” Int. J. Comput. Vision. 128(1), 240–259 (2020). [CrossRef]

20. A. Wang, W. Wang, J. Liu, et al., “AIPNET: image-to-image single image dehazing with atmospheric illumination prior,” IEEE Trans. on Image Process. 28(1), 381–393 (2019). [CrossRef]

21. J. Goodfellow, J. Abadie, M. Mirza, et al., “Generative adversarial nets,” in 27th International Conference on Neural Information Processing Systems (NIPS) (2014), pp. 2672–2680.

22. P. Isola, J. -Y. Zhu, T. Zhou, et al., “Image-to-image translation with conditional adversarial networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (IEEE) (2017), pp. 5967–5976.

23. J. -Y. Zhu, T. Park, P. Isola, et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in 2017 IEEE International Conference on Computer Vision (ICCV) (2017), pp. 2242–2251.

24. X. Yang, Z. Xu, and J. Luo, “Towards perceptual image dehazing by physics-based disentanglement and adversarial training,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI) (2018), pp. 7485–7492.

25. A. Dudhane and S. Murala, “CDNet: single image de-hazing using unpaired adversarial training,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (2019), pp. 1147–1155.

26. J. Zhao, J. Zhang, and Z. Li, “DD-CycleGAN: unpaired image dehazing via double-discriminator cycle-consistent generative adversarial network,” Engineer. Applica. Arti. Intell. 82, 263–271 (2019). [CrossRef]

27. H. Wu, Y. Qu, S. Lin, et al., “Contrastive learning for compact single image dehazing,” in 2021 IEEE Conference on Computer Vision and Pattern Recognition (IEEE) (2021), pp. 10546–10555.

28. Z. Chen, Y. Wang, Y. Yang, et al., “PSD: principled synthetic-to-real dehazing guided by physical priors,” in 2021 IEEE Conference on Computer Vision and Pattern Recognition (IEEE) (2021), pp. 7176–7185.

29. Y. Shao, L. Li, W. Ren, et al., “Domain adaptation for image dehazing,” in 2020 IEEE Conference on Computer Vision and Pattern Recognition (IEEE) (2020), pp. 2805–2814.

30. K. He, X. Zhang, S. Ren, et al., “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (IEEE) (2016), pp. 770–778.

31. P. Ling, H. Chen, X. Tan, et al., “Single image dehazing using saturation line prior,” IEEE Trans. on Image Process. 32, 3238–3253 (2023). [CrossRef]

32. S. D. Das and S. Dutta, “Fast deep multi-patch hierarchical network for nonhomogeneous image dehazing,” in 2020 IEEE Conference on Computer Vision and Pattern Recognition Workshops (IEEE) (2020), pp. 1994–2001.

33. Y. Liu, J. Pan, J. Ren, et al., “Learning Deep Priors for Image Dehazing,” in 2019 IEEE International Conference on Computer Vision (ICCV) (2019), pp. 2492–2500.

34. M. Song, R. Li, R. Guo, et al., “Single image dehazing algorithm based on optical diffraction deep neural networks,” Opt. Express 30(14), 24394 (2022). [CrossRef]

35. X. Zhang, H. Dong, J. Pan, et al., “Learning to restore hazy video: a new real-world dataset and a new method,” in 2021 IEEE Conference on Computer Vision and Pattern Recognition (IEEE) (2021), pp. 9235–9244.

36. X. Song, D. Zhou, W. Li, et al., “TUSR-Net: triple unfolding single image dehazing with self-regularization and dual feature to pixel attention,” IEEE Trans. on Image Process. 32, 1231–1244 (2023). [CrossRef]

37. P. Zhu, L. Wen, D. Du, et al., “Detection and tracking meet drones challenge,” IEEE Trans. Pattern Anal. Mach. Intell. 44(11), 7380–7399 (2022). [CrossRef]

38. H. Zhang and V. M. Patel, “Densely connected pyramid dehazing network,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition (IEEE) (2018), pp. 3194–3203.

39. Y. Song, Z. He, H. Qian, et al., “Vision transformers for single image dehazing,” IEEE Trans. on Image Process. 32, 1927–1941 (2023). [CrossRef]

40. Y. Yang, C. Wang, R. Liu, et al., “Self-augmented unpaired image dehazing via density and depth decomposition,” in 2022 IEEE Conference on Computer Vision and Pattern Recognition (IEEE) (2022), pp. 2027–2036.

41. D. Chen, M. He, Q. Fan, et al., “Gated context aggregation network for image dehazing and deraining,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (2019), pp. 1375–1383.

42. K. Mei, A. Jiang, and M. Wang, “Progressive feature fusion network for realistic image dehazing,” Lect. Notes Comput. Sci. 11361, 203–215 (2018). [CrossRef]

43. Y. Liu, W. Qi, G. Huang, et al., “Aerial-Fog,” GitHub , (2023). https://github.com/qiwenting123/ClassifyCycle/

Method	Pair	PSNR(dB) ↑	SSIM↑	LPIPS↓	Inference Times (s)
DCPDN [38]	√	9.0385	0.3161	0.4916	0.0738
DehazeFormer [39]	√	19.0426	0.8174	0.4424	0.1819
D4 [40]	×	11.4912	0.6889	0.5391	0.1132
DMPHN [32]	√	10.2149	0.5572	0.4783	0 . 0302
GCANet [41]	×	19.4365	0.8354	0.3656	0.0665
PFFNet [42]	√	13.5972	0.4585	0.9328	0.1055
Ours	×	25.4265	0.8596	0.3465	0.0731

Method	Pair	PSNR(dB)↑	SSIM↑	LPIPS↓	Inference Times (s)
DCPDN [38]	√	13.8748	0.4335	0.1405	0.1627
DehazeFormer [39]	√	24.7793	0 . 9432	0.0900	0.1368
D4 [40]	×	20.0589	0.8230	0.1281	0.1231
DMPHN [32]	√	13.0568	0.6141	0.1025	0.1648
GCANet [41]	×	23.2635	0.9084	0.1164	0.0699
PFFNet [42]	√	10.6621	0.6739	0.6700	0.0982
Ours	×	27.0480	0.9133	0.0699	0.0920

Category	Accuracy Rate (%)
dayway	96.0
house	95.0
twilight	100.0
night	100.0
skyscraper	96.0
total	97.0

	Method	PSNR(dB)	SSIM	LPIPS↓	Inference Times (s)
A	base	13.88	0.5304	0.5781	0.11300
B	base + Skip	22.07	0.8405	0.5065	0.0718
C	base + ICLFn	17.31	0.6062	0.4396	0 . 0703
D	Ours (ClassifyCycle)	25.43	0.8596	0.3465	0.0731

Method	Pair	PSNR(dB) ↑	SSIM↑	LPIPS↓	Inference Times (s)
DCPDN [38]	√	9.0385	0.3161	0.4916	0.0738
DehazeFormer [39]	√	19.0426	0.8174	0.4424	0.1819
D4 [40]	×	11.4912	0.6889	0.5391	0.1132
DMPHN [32]	√	10.2149	0.5572	0.4783	0 . 0302
GCANet [41]	×	19.4365	0.8354	0.3656	0.0665
PFFNet [42]	√	13.5972	0.4585	0.9328	0.1055
Ours	×	25.4265	0.8596	0.3465	0.0731

Classification guided thick fog removal network for drone imaging: ClassifyCycle

Abstract

1. Introduction

2. Related works

2.1 Image prior-based defogging methods

2.2 Learning-based defogging methods

3. Method

3.1 Method overview

3.2 Image classification module (ICLFn module)

3.3 Image style migration module (ISMn module)

3.4 Training losses

3.5 Incorporation of ClassifyCycle model in drone imaging pipeline

4. Experiments

4.1 Dataset

4.2 Implementation details and competitors

4.3 Metrics

4.4 Simulated paired foggy images performance evaluation

4.5 Real fog images and simulated unpaired foggy images performance evaluation

4.6 Anomaly detection

4.7 Ablation study

5. Conclusion

Funding

Acknowledgment

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (12)

Tables (4)

Equations (14)

Optics Express