Cross-domain colorization of unpaired infrared images through contrastive learning guided by color feature selection attention

Tong Jiang; Xiaodong Kuang; Sanqian Wang; Tingting Liu; Yuan Liu; Xiubao Sui; Qian Chen; Qian Chen

doi:10.1364/OE.519284

1. Introduction

In adverse conditions such as low-light nighttime scenarios and dense fog, human visual perception of complex scenes is compromised [1]. Infrared detectors offer a solution by capturing the infrared radiation emitted by objects with temperatures above absolute zero, enabling imaging even in complete darkness. This capability addresses the limitations of visible light devices that are highly sensitive to environmental conditions and illumination. The advent of infrared detectors has greatly enhanced human capabilities in detecting and perceiving external targets and their surrounding environments. However, the grayscale images produced by infrared imaging systems lack color diversity and exhibit low contrast, which significantly hampers subsequent tasks, including infrared target detection. Improving the quality of infrared image colorization can enhance visual comfort, accelerate infrared target detection, and reduce the error rate in infrared target identification. Consequently, it promotes the widespread adoption of infrared imaging technology in domains such as military security, power monitoring, and medical diagnostics. Therefore, it is of great significance to effectively harness the deep semantic color information between infrared images and visible color images, facilitate cross-domain mapping of semantic color and structural content details, and achieve rational and natural colorization of infrared images for the advancement of infrared imaging technology.

Infrared image colorization presents a challenging and ill-posed problem, as there is no unique solution to accurately predict the true semantic colors of objects in a scene based solely on infrared information. This difficulty adds to the complexity of achieving cross-domain colorization for infrared images. Moreover, grayscale infrared images, which represent temperature variations through intensity values, not only lack color information compared to their visible counterparts but also lack detailed structural details. These limitations pose significant challenges for the task of cross-domain colorization in the infrared domain.

Traditional approaches to infrared image colorization often involve methods such as multispectral fusion or reference-based techniques. Multispectral fusion methods leverage information from multiple spectral bands, such as microwaves, and apply predefined color mapping strategies to generate colorized infrared images. Reference-based methods, on the other hand, rely on reference images to transfer color information to corresponding regions with similar semantic features in the infrared domain. However, these conventional methods heavily rely on prior knowledge, such as predefined color mapping strategies. moreover, constrained by factors like system complexity, cost, and availability of suitable reference images with the desired color and semantic features. Consequently, they are unable to adaptively assign colors that align with semantic information to different regions within the infrared images.

Recently, the advancement of deep learning in the field of computer vision has introduced new possibilities for fully automatic infrared image colorization without relying on prior information. Existing deep learning-based methods for infrared image colorization predominantly employ supervised generative adversarial network architectures. These data-driven approaches utilize large-scale paired datasets comprising infrared and visible images for training. The objective is to learn the mapping relationship between grayscale values in infrared images and color information in visible images by minimizing the discrepancy between generated images and corresponding labeled images. However, the acquisition of accurately registered infrared-visible image pairs poses significant challenges, influenced by factors such as camera imaging principles, spatial positioning, and camera optical axes.

To address the aforementioned challenges, we propose a Color Feature Selection Attention-guided Unpaired Infrared Cross-domain Colorization Generative Adversarial Network (CFSA-ICGAN). CFSA-ICGAN is built upon a generative adversarial network architecture and leverages the adversarial loss to penalize the discrepancy between the source domain infrared images and the target domain visible images, enabling fully automatic infrared image colorization without relying on prior information. To avoid feature loss or detail blurring during the propagation of information from low-level to high-level layers, we introduce a Residual Fusion Attention Network as the generator. This network consists of multiple residual fusion attention modules, which synergistically integrate residual connections, channel attention, and spatial attention from the residual network. By adaptively learning the correlations between different channels and spatial positions, the residual fusion attention network enhances the model’s feature extraction and semantic perception capabilities, while preserving important spatial structures in key target regions.

The proposed infrared cross-domain colorization method in this study maximizes the mutual information between unpaired infrared-color image sample data in a deep feature space through contrastive learning. The color feature selection attention module selectively computes the contrastive loss to guide the learning of similar target object colors, maximizing the consistency of the feature information between input infrared image patches and output colorized image patches. This achieves unsupervised infrared cross-domain colorization, eliminating the need for large-scale collection of paired infrared-visible data. Furthermore, we introduce a joint global composite loss function that combines detail content and color style. This loss function ensures the consistency of color styles between the output colorized images and the target domain visible images while restoring the lost texture details in the source domain infrared images, thereby improving the overall effectiveness of infrared image colorization.

In summary, this paper makes the following contributions:

1) We propose a novel non-paired infrared cross-domain colorization generative adversarial network, guided by color feature selection attention. Our approach combines contrastive learning and generative adversarial networks to enable the unsupervised generation of high-quality colorized infrared images. Unlike existing methods that rely on paired colorization data, our approach is well-suited for scenarios where target domain color images are unavailable as label information.
2) We introduce a meticulously designed generator called the residual fusion attention network, which incorporates residual connections, channel attention modules, and spatial attention modules. This sophisticated architecture empowers the generator to selectively focus on crucial feature regions, capturing contextual information and fine details within infrared images while effectively propagating feature information. The integration of residual connections and attention mechanisms significantly enhances the network’s ability to preserve context and capture intricate details.
3) To address the challenge of semantic color prediction, we develop a color selection attention module that computes the contrastive loss based on the color attention matrix. This module guides the selection of essential region features containing the same infrared targets, thereby enabling accurate color learning without modifying structural detail features. By automatically selecting the most relevant input features, the color selection attention module boosts the accuracy and generalization capability of the model, while mitigating semantic encoding errors during the colorization process.

Through comprehensive experiments and rigorous evaluations, we demonstrate the effectiveness of our proposed method in generating high-quality colorized infrared images in a non-paired setting.

2. Related work

2.1 Deep learning-based methods for automatic cross-domain infrared image colorization

In recent years, significant progress has been made in the field of deep learning-based methods for automatic cross-domain infrared image colorization. These methods have shown remarkable robustness and generalization compared to traditional approaches. Limmer et al. [2] pioneered the use of convolutional neural networks for cross-domain translation of near-infrared images. Berg et al. [3] proposed a convolutional neural network-based approach for the automatic colorization of thermal infrared images, producing visible color images with realistic brightness and chromaticity.

Existing deep learning-based single-frame infrared image colorization methods can be broadly classified into two categories: fully supervised and unsupervised approaches. Fully supervised methods require paired infrared-color image data. For instance, Kuang et al. [4]and Neeraj Bhat [5] employed the paired KAIST-MS [6]dataset to train conditional generative networks, leveraging content loss [7] and adversarial loss to map infrared images to visible color images. However, the availability of paired infrared colorization datasets is severely limited due to various factors, such as scene and camera constraints.

Unsupervised infrared image colorization methods do not require paired labeled images to learn the similarities and differences between infrared image data and color image data, thus aligning unpaired infrared-visible image data, which is a good solution to the undesirable effects of the lack of paired data. CycleGAN [8–11] is widely used to solve the unpaired image-to-image conversion problem. Adam Nyberg [12] adopted the idea of CycleGAN to train unpaired infrared images using cycle-consistent generative adversarial network. However, this method cannot directly control the texture and shape of the output colorized infrared images, which leads to distortion of targets such as car volume and crosswalk, and semantic color distortion.

2.2 Image-to-image translation

Image-to-image translation tasks [13–16] involve transforming images from a source domain to a target domain, with the objective of learning the mapping relationship between input and output images. In this paper, we focus on the translation task from infrared images to visible color images. Image-to-image translation encompasses various application scenarios, such as image style transfer (converting the style of one image to another) [17], image semantic segmentation (segmenting input images into different regions based on semantic information) [18], and image super-resolution reconstruction (converting low-resolution images to high-resolution images) [19], among others. In recent years, several methods have been employed for image-to-image translation tasks, including variational autoencoders, generative adversarial networks [20,21], Transformers [22–24], and diffusion models [25]. For the specific task of converting infrared images to color images, methods based on generative adversarial network architectures are commonly used. For example, Pix2pix [26] is a classic supervised image-to-image translation model that employs a CNN-based GAN architecture and requires paired images from the source and target domains for training. UNIT [27] is an unsupervised image-to-image translation model developed under the VAE-GAN framework, which learns deep shared features between the source and target domains to accomplish the translation task.

2.3 Contrastive learning

Unsupervised image-to-image translation refers to the task of translating images from a source domain to a target domain without paired training data. CUT [28] introduced the concept of contrastive learning for unsupervised image-to-image translation, aiming to maximize the shared feature information between the source and target domains. Contrastive learning involves categorizing samples into positive and negative pairs, where patches at corresponding positions in the dataset serve as positive pairs, while other elements act as negative pairs. By learning a mapping function in the feature space, CUT encourages positive pairs to be closer together in the representation space. CUT adopts a patch-based approach, dividing the input source domain image into small patches and randomly selecting positive and negative pairs from these patches. By computing the contrastive loss function to measure the distance between positive and negative pairs, CUT generates high-quality image representations.

While CUT has shown great potential in non-paired image translation tasks, its random selection of positive and negative pairs limits its ability to effectively narrow the distance between positive samples and query samples. This limitation hinders its capacity to extract both color and structural features accurately from the images, particularly in the infrared image colorization task. To address this, our study introduces a novel color selection attention module that guides the selection of the most relevant samples for learning. By leveraging this module, our approach captures rich color features while preserving the consistency of content and structural features. The proposed color selection attention module enhances the performance of the image translation process by enabling the model to focus on the most pertinent features. This module intelligently determines the most relevant samples for color learning, facilitating the extraction of accurate and context-aware color information. By leveraging the color selection attention module, our method achieves an effective balance between color fidelity and structural integrity in the generated images.

3. Proposed method

Figure 1 illustrates the overall architecture of CFSA-ICGAN proposed in this paper. It comprises three main components: the Colorized Infrared Images Generation module, the Color Selection Attention module, and the Global Feature Matching module. The Colorized Infrared Images Generation module consists of a generator and a discriminator. The generator synthesizes colorized images from input infrared images and aims to deceive the discriminator, while the discriminator distinguishes between real and generated images, attempting to minimize the output probability of colorized infrared images.

Fig. 1. The overall architecture of CFSA-ICGAN.

Download Full Size | PDF

The Color Selection Attention module extracts features through an encoder and computes a color attention matrix. Unlike CUT, we no longer randomly select anchor points and positive/negative samples. Instead, based on the entropy of the color feature attention matrix, we select blocks of interest that require focused attention for computing the contrastive loss. By incorporating the Color Feature Attention module, we ensure semantic consistency of color features in the infrared image colorization task and avoid unnecessary content structural deformations during the colorization process.

The Global Feature Matching module consists of two parts: global content feature matching and global style feature consistency. It aims to bridge the gap between the global features of infrared images and color images to the maximum extent possible.

In summary, CFSA-ICGAN’s architecture integrates the Colorization of Infrared Images Generation module, Color Selection Attention module, and Global Feature Matching module. The colorization module generates colorized images by employing a generator and discriminator, while the attention module focuses on important regions guided by the color feature attention matrix. The global feature matching module facilitates alignment between global content and style features of infrared and color images. The proposed CFSA-ICGAN framework ensures semantic consistency of color features in infrared image colorization tasks and mitigates unnecessary content structural deformations.

3.1 Residual fusion attention generator

The UNet architecture, which exhibits a U-shaped structure, is commonly employed as a generator for image generation tasks. These generators primarily rely on a symmetric encoder-decoder structure to propagate feature information. However, the UNet generator’s ability to capture high-level semantic information and achieve global coloring is limited when it comes to complex scenarios in infrared image colorization tasks. Furthermore, the UNet generator has a high demand for training data, often requiring large paired datasets for effective training. In contrast, the unsupervised learning approach adopted in this study requires only a small amount of training data. Consequently, the UNet generator struggles to fully learn accurate colorization mappings in scenarios where there is a lack of sufficient data information.

Based on these limitations, we propose a Residual Fusion Attention Network (RFAN) as a generator for the colorization task of infrared images, the structure of which is shown in Fig. 2. The generator employs ResNet as the backbone network for the encoder-decoder structure, where the encoder part consists of stacked residual fusion attention modules. The residual connection structure of the Residual Fusion Attention Network allows for the handling of deeper networks and the capture of more complex semantic information. This enables better preservation of detailed information in infrared images, resulting in the generation of clearer and more natural colorized infrared images. The ResNet-based generator, relative to the UNet generator, may exhibit stronger generalization capabilities with limited training data. In situations where there is a lack of sufficient annotated data for the infrared image colorization task, the proposed Residual Fusion Attention generator is more likely to learn accurate color mappings that align with semantic information.

Fig. 2. Structure of the residual fusion attention generator.

Download Full Size | PDF

The residual fusion attention generator consists of 9 residual fusion attention modules, and the structure of the residual fusion attention modules is shown in Fig. 3. Channel attention and spatial attention have been incorporated into the residual fusion attention module. Channel attention, spatial attention, and the combined CBAM are frequently employed in computer vision tasks. The channel attention module mainly focuses on the correlation between different channels of the input feature map, making the network more focused on important channel information in the colorization task, thereby better capturing abstract features in the image. In contrast, the spatial attention module mainly focuses on the importance of different spatial locations in the input feature map, helping to extract information such as local details and texture in the image. The CBAM module combines channel attention with spatial attention, enabling it to simultaneously focus on channel information and spatial locations within the input feature map. To comprehensively extract image features in the infrared image colorization task, we constructed a residual fusion attention module based on the CBAM module. This enhancement enables the colorized infrared image generator network to extract color and structural features more effectively.

Fig. 3. Residual fusion attention module.

Download Full Size | PDF

3.2 Discriminator

Figure 4 shows the structure of discriminators integrated into CFSA-ICGAN. While a conventional global discriminator typically produces a single evaluation value (True or False) for the entire generated image, the PatchGAN discriminator employed in this paper subdivides the input image into multiple patches, assessing each individually. As a result, the PatchGAN discriminator captures the local structural information of the generated image and generates the final true/false discrimination by averaging or aggregating the results of the discriminations of the local blocks. This enables PatchGAN to more effectively guide the generator network in learning the details and textures of real images when dealing with image generation tasks.

Fig. 4. Structure of the PatchGAN discriminator.

Download Full Size | PDF

3.3 Color selection attention module (CSAM)

Attention mechanisms allocate different weights to different vectors based on their similarity, allowing them to extract valuable information relevant to the current task while suppressing irrelevant information. Therefore, in the cross-domain colorization task of infrared images, this study introduces attention mechanisms to assign semantic colors to important objects without altering their underlying structures. We utilize three vectors, namely Value ($V$), Query ($Q$), and Key ($K$), to accomplish the cross-domain colorization task for infrared images. $V_I$ and $V_C$ store the color information of the infrared image and colorized infrared image, respectively. $Q$ stores the feature information of the infrared image, while $K$ stores essential information such as edges, textures, and color distributions of the input infrared image.

The framework of the color attention selection module proposed in this paper is shown in Fig. 5. First, the infrared image and colorized infrared image are fed into the encoder to obtain feature matrices $F_I$ and $F_C$, respectively. The feature matrix $F$ is multiplied by trainable parameter matrices $W_V$, $W_Q$, and $W_K$ to obtain $V$, $Q$, and $K$, respectively. The similarity between $Q$ and $K$ is computed by calculating $QK^T$. After applying softmax normalization, the attention matrix $A_I$ of the infrared image in the source domain is obtained. Each weight value in the attention matrix $A_I$ indicates the importance of different patches in the entire infrared image. In the color selection region, each patch of the input image is filtered according to the magnitude of each weight value in matrix $A_I$. The weight distributions obtained from the color selection region are used to weigh $V_I$ and $V_C$, and the contrastive loss between them is calculated. The inclusion of the color selection attention module ensures the consistency between the source and target domains in the task of infrared image cross-domain colorization, making the computation of the contrastive loss more efficient.

(1)$$ColorAttention=Soft\max{(\frac{QK^T}{\sqrt{d_K}})}V$$

In the formula, $d_K$ represents the dimension of $K$.

Fig. 5. Color selection attention module.

Download Full Size | PDF

3.4 Loss function

In this study, the loss function of CFSA-ICGAN consists of three components: the generative adversarial loss $L_{GAN}$, the contrastive adversarial loss $L_{Contrastive}$, and the global feature loss $L_{Globalfeature}$. In the task of cross-domain colorization of infrared images, we aim to restore both the chromatic and luminance information of the infrared images. In this paper, we employ a method based on the architecture of generative adversarial networks (GANs) to achieve the transformation from the source domain infrared images $X$ to the target domain visible color images $Y$ through the interplay between the generator $G$ and the discriminator $D$. In the module for generating colorized infrared images, for an infrared image $I_x\in \mathbb {R}^{H\times W\times 3}$ from the source domain $X$, we utilize adversarial loss to encourage the generator to output colorized infrared images $G(I_x)$ that resemble the real colorized infrared images $I_y\in \mathbb {R}^{H\times W\times 3}$ from the target domain $Y$. The adversarial loss function $L_{GAN}$ can be expressed as follows:

(2)$$L_{GAN}=\mathbb{E}_{I_y\sim Y}\log D(I_y)+\mathbb{E}_{I_x\sim X}\log(1-D(G(I_x)))$$

Due to the existence of certain inherent feature information shared between infrared images and visible color images, exploring and learning deep-level feature information between the two domains can effectively constrain the restoration of color and detail information in infrared images. To achieve unsupervised cross-domain colorization of infrared images without paired data, we employ an encoder $E$ in the color feature attention module to extract the feature information from the infrared images $I_x$ and the colorized infrared images $G(I_x)$. Guided by the color feature attention, we select an anchor $q$, positive samples $k^{+}$, and $N-1$ negative samples $k^{-}$ that contain important target information. By calculating the contrastive loss $L_{Contrastive}$ between the infrared image domain and the visible domain, we compare the similarity features in the cross-domain colorization task. The contrastive loss provides a constraint for the generator $G$ of colorized infrared images, establishing self-supervision between the infrared images $I_x$ and the colorized infrared images $G(I_x)$, eliminating the need for paired visible images as label information. The representation of the contrastive loss $L_{Contrastive}$ is as follows:

(3)$$L_{Contrastive}={-}\log[\frac{\exp(q\cdot k^+{/}\tau)}{\exp(q\cdot k^+{/}\tau)+\sum_{i=1}^{N-1}\exp(q\cdot k^-{/}\tau)}]$$

where $\tau$ is a fixed parameter with a value of 0.07.

In order to make up for the singularity of the global feature recovery in the colorization process of the adversarial loss, this paper introduces a compound global feature loss $L_{Globalfeature}$:

(4)$$L_{content}=\sum_k\frac1{C_kH_kW_k}\sum_{i=1}^{H_k}\sum_{j=1}^{W_k}\left\|\phi_k\left(I_x\right)_{i,j}-\phi\left(G\left(I_x\right)\right)_{i,j}\right\|_1$$

(5)$$L_{style}=\sum_k\frac1{C_kH_kW_k}\sum_{i=1}^{H_k}\sum_{j=1}^{W_k}\left\|\phi_k\left(I_y\right)_{i,j}-\phi\left(G(I_x)\right)_{i,j}\right\|_1$$

(6)$$L_{Globalfeature}=L_{content}+L_{style}$$

where $\phi _k\left (\bullet \right )$ represents the feature image of the kth output in the VGG-16 network, and $C_kH_kW_k$ represents the size of the output feature image.

The final objective function is as follows:

(7)$$L=L_{GAN}+\lambda_1L_{Contrastive}+\lambda_2L_{Globalfeature}$$

where $\lambda _1$ and $\lambda _2$ represent the weights of the contrastive loss and the global feature loss, respectively.

4. Experiments

4.1 Experimental settings

4.1.1 Datasets

The proposed network model in this paper is trained and evaluated on multiple datasets. In previous works on infrared image colorization, most studies utilized the KAIST multispectral pedestrian detection dataset [29]. The KAIST dataset consists of approximately 95,000 calibrated infrared-visible color image pairs captured in typical traffic scenarios during both daytime and nighttime. Considering the lower quality of images captured at night in this dataset, only the daytime infrared-visible color image pairs are used for training and evaluation in this paper, aiming to generate colorized infrared images that better align with human visual perception.

For comparative evaluation with other state-of-the-art unsupervised image translation models, we selected 1,000 image pairs from the KAIST dataset as the training set. Furthermore, to demonstrate the effectiveness of our model in improving the colorization of unpaired infrared images across multiple datasets, we also conducted training and evaluation on the FLIR dataset [30]. The FLIR dataset consists of uncalibrated infrared-visible color image pairs captured with the thermal and visible cameras mounted at different positions on vehicles. Therefore, in this paper, the FLIR dataset is considered as an auxiliary dataset. In our experiments, we not only used the long-wave infrared dataset but also incorporated the Near Infrared NIRScene dataset [31] to ensure that our model can restore color information and preserve the structural details of infrared images captured in different spectral bands. The NIRScene dataset contains 477 near-infrared-visible color image pairs, and a feature-based alignment algorithm is applied to correct the positional discrepancy between the near-infrared and visible color images during data collection.

4.1.2 Implementation details

The CFSA-ICGAN model utilizes a PatchGAN as the discriminator, which is jointly constructed with the proposed Multi-Dimensional Nested Attention Generator. During training, the image size of the datasets mentioned in Section 4.1.1 is resized to 256x256. We selected the pre-trained VGG16 network and used the output features of the relu1_2 layer, relu2_2 layer, relu3_3 layer, and relu4_3 layer to calculate the global feature loss. In the generator loss, we set $\lambda _1$ and $\lambda _2$ to 1 and 0.5, respectively. The model is trained on a single NVIDIA 2080Ti GPU with a batch size of 1. The Adam optimizer is employed with $\beta _1=0.5$ and $\beta _2=0.999$. For the KAIST, FLIR, and NIRScene data sets, our initial learning rate is 0.0001. After 100 epochs, the learning rate gradually linearly decayed to zero according to the number of iterations. We trained the model for a total of 200 epochs.

4.1.3 Quantitative evaluation indicators

The Mean Squared Error (MSE) compares the square of the average pixel value difference between the original image and the generated image, the smaller the value of MSE means the smaller the difference between the two images. The MSE is calculated using the following formula:

(8)$$\mathrm{MSE}=\frac{1}{MN}\sum_{i=1}^{M}\sum_{j=1}^{N}\left(I(i,j)-\hat{I}(i,j)\right)^{2}$$

where $M$ and $N$ denote the number of rows and columns of the image, respectively, $I(i,j)$ denotes the pixel value with pixel coordinates of $(i,j)$ in the original image, and $\hat {I}(i,j)$ denotes the pixel value with pixel coordinates of $\hat {I}(i,j)$ in the generated image.

In tasks such as image translation and image reconstruction, Peak Signal-to-Noise Ratio (PSNR) is usually used to measure the difference between the target domain image and the reconstructed image, in general, the higher the PSNR, the closer the image quality is to that of the target domain image. The calculation of the PSNR is based on the MSE, which is formulated as follows:

(9)$$\mathrm{PNSR}=10\mathrm{log}_{_{10}}\frac{\mathrm{L}^2}{\mathrm{MSE}}$$

where L denotes the maximum possible dynamic range of the pixel values in the image and PSNR is in decibels (dB).

The structural similarity index (SSIM) is a commonly used metric in the field of infrared images to evaluate the image quality, which not only takes into account the difference in luminance, but also includes the information of contrast and structure. In the field of infrared images, SSIM is usually more appropriate than traditional metrics such as Mean Square Error (MSE) because infrared images involve special attributes such as heat distribution, temperature differences, etc. The value of SSIM ranges from −1 to 1, and the closer it is to 1 indicates that the more similar the images are, the better the quality of the generated images. The specific formula of SSIM is as follows:

(10)$$\mathrm{SSIM}(x,y)=\frac{(2\mu_x\mu_y+C_1)(2\sigma_{xy}+C_2)}{(\mu_x^2+\mu_y^2+C_1)(\sigma_x^2+\sigma_y^2+C_2)}$$

where $\mu _{x}$ and $\mu _{y}$ are the luminance mean values of the generated image and the original image respectively, $\sigma _{x}^{2}$ and $\sigma _{y}^{2}$ represent the variance of the generated image and the original image respectively, $\sigma _{xy}$ represents the covariance of the generated image and the original image, $C_{1}$ and $C_{2}$ are constants used for stabilization calculation.

In order to comprehensively evaluate the brightness, contrast, structure and human eye perception comfort of colorized infrared images, this paper adopts the PSNR, SSIM and MSE as the quantitative evaluation indexes to measure the quality of colorized infrared images. The quantitative experimental results in this paper are the average scores of all images in the test set.

4.2 Results

In this subsection, we present a comparative analysis between the outcomes achieved by our proposed method and those of leading unsupervised techniques, including CycleGAN, CUT, and FastCUT. For the reference experiments, we meticulously trained and tested the methods using publicly available code.

CycleGAN is a classic unpaired image translation task network. Both it and the method proposed in this article are based on the framework of a generative adversarial network, which can achieve unsupervised conversion of infrared images to color images. Different from this article, CycleGAN has two generators, which are used to convert images in one domain into another domain. These two generators are called $G$ and $F$ respectively. Correspondingly, CycleGAN also has two discriminators $D_X$ and $D_Y$, which are used to determine whether the images generated corresponding to the two fields are real. In the image translation task, CycleGAN converts the input image $X$ from a source domain to a target domain, and then from the target domain to the source domain. The cycle consistency loss is used to constrain the input image $X$ and the image $\hat {X}$ after cycle transformation, thereby achieving unsupervised image translation. This article uses contrastive loss to achieve unsupervised cross-domain mapping of infrared images.

CUT, FastCUT and the method proposed in this article all use the similar information between the source domain and target domain images as the basis, and introduce contrastive loss to maximize the mutual information of the corresponding positions between the two images. By encouraging the generated image to be close to its corresponding target image in the feature space, it helps the generator better capture the characteristics of the input image, thereby improving the generative model. In order to ensure that the generator makes unnecessary changes to the generated image, identity loss is also added to the objective function of CUT like CycleGAN. FastCUT is regarded as a faster and lighter version of CycleGAN, which is a special case of CUT that only calculates the consistency loss of the input domain in the objective function. Different from CUT and FastCUT, the CFSA-ICGAN proposed in this article no longer randomly selects anchor points and positive and negative samples, but selects the blocks that need to be focused on to calculate the contrastive loss based on the entropy value of the color feature attention matrix.

4.2.1 Quantitative tests

Table 1 presents the quantitative test results for CycleGAN, CUT, FastCUT, and the method proposed in this paper across different datasets. The results unequivocally demonstrate that our method attains optimal metrics on each dataset. As shown in Fig. 6, we normalized the average measurement results of different colorization methods on the test set.

Fig. 6. Normalized performance comparison of colorization methods on test sets.

Download Full Size | PDF

Table 1. Average metric results of different colorization methods on each test set

View Table | View all tables in this article

For the KAIST dataset, our method excels by achieving the highest PSNR (16.160), SSIM (0.541), and the lowest MSE (0.041). The CFSA-ICGAN proposed in this paper demonstrates substantial improvements compared to CycleGAN, enhancing PSNR by 23%, SSIM by 28%, and reducing MSE by 35.9%. In comparison to CUT, CFSA-ICGAN exhibits improvements of 18.8% in PSNR, 25.5% in SSIM, and 29.3% in MSE. Furthermore, when contrasted with FastCUT, CFSA-ICGAN achieves increases of 9.7% in PSNR, 22.7% in SSIM, and reductions of 30.5% in MSE.

Our proposed method also yields superior results on the FLIR dataset, with PSNR and SSIM reaching their highest values, and MSE reaching its lowest. CFSA-ICGAN outperforms CycleGAN by 2.4% in PSNR, 1.5% in SSIM, and reduces MSE by 14.3%. In comparison to CUT, CFSA-ICGAN exhibits improvements of 19.3% in PSNR, 13.1% in SSIM, and a remarkable reduction of 47.3% in MSE. When compared to FastCUT, CFSA-ICGAN achieves enhancements of 19.4% in PSNR, 12.9% in SSIM, and reduces MSE by 45.4%. These findings underscore the superiority of the proposed method in thermal infrared image colorization.

For the NIR dataset, the proposed method again emerges as the top performer in terms of PSNR and SSIM. Despite a slightly higher MSE, the overall metrics underscore the method’s efficacy in infrared image colorization.

In conclusion, the results unequivocally establish that the proposed method is proficient in generating higher quality color images that closely resemble real images for both NIR and thermal infrared band images.

4.2.2 Qualitative tests

Figure 7 illustrates the test results of different unsupervised infrared image colorization methods on the KAIST dataset. Our proposed CFSA-ICGAN demonstrates significant advantages in terms of semantic colorization and structural details. Although CUT, CycleGAN, and FastCUT can restore most of the background colors, they suffer from issues such as blurred details and loss of objects. As shown in Fig. 7, target objects such as “vehicles,” “pedestrian crossings,” and “buildings” exhibit problems such as edge distortion, color distortion, and lack of structural details.

Fig. 7. Qualitative test results of different colorization methods on the KAIST test set.

Download Full Size | PDF

Figure 8 showcases the test results of different unsupervised infrared image colorization methods on the FLIR dataset. CUT exhibits the most severe issue of missing objects, such as “vehicles” and “trees.” CycleGAN and FastCUT can restore most of the image colors and structural features, but they fail to reconstruct small targets and local details effectively. For instance, the “small car” is missing, and the rear end of vehicles appears blurry.

Fig. 8. Qualitative test results of different colorization methods on the FLIR test set.

Download Full Size | PDF

Figure 9 demonstrates the test results of different unsupervised infrared image colorization methods on the NIR dataset. CUT, CycleGAN, and FastCUT generate reasonably natural-colored near-infrared images. However, they suffer from semantic colorization errors in objects such as “roads” and “snowy mountains.” Additionally, our proposed CFSA-ICGAN exhibits superior restoration of structural details for target objects like “chandeliers” and “stairs” compared to the other three methods.

Fig. 9. Qualitative test results of different colorization methods on the NIR test set.

Download Full Size | PDF

4.3 Ablation study

In addition to compare with other unsupervised methods, we conducted comprehensive ablation experiments to showcase the utility of each key module in CFSA-ICGAN for the task of infrared image colorization. Table 2 presents the PSNR, SSIM, and MSE values with different modules removed. After the removal of the CSAM module, PSNR is 13.975, SSIM is 0.534, and MSE is 0.049, indicating a noteworthy enhancement in SSIM and MSE compared to the baseline method. After removing the Att-block, the values of PSNR, SSIM, and MSE were lower. Upon the amalgamation of individual modules, CFSA-ICGAN attains superior performance across all metrics, achieving the highest PSNR (16.160), SSIM (0.541), and the lowest MSE (0.041), thereby validating the effectiveness of each individual module within the method.

Table 2. Quantitative results for ablations

View Table | View all tables in this article

Figure 10 illustrates the qualitative test results on the KAIST dataset after removing different modules. From Fig. 10, it can be observed that removing the color selection attention module and residual attention module only allows for the partial restoration of background colors, resulting in the loss of detailed features and color information of most infrared targets. After removing the global feature loss, although most of the infrared image colors can be restored, issues such as vertical stripe noise and color distortion arise in regions like the sky, vehicles, and buildings. In comparison, our proposed method achieves the most suitable infrared image colorization for human visual perception while preserving detailed features.

Fig. 10. Qualitative comparison of different ablation studies of CFSA-ICGAN on the KAIST dataset.

Download Full Size | PDF

5. Conclusion

In this paper, we propose a cross-domain infrared image colorization network, CFSA-ICGAN, guided by color feature selection attention. By leveraging unsupervised learning, our approach achieves semantically meaningful cross-domain infrared image colorization in the absence of a large-scale paired infrared-color image dataset. The residual fusion attention module is employed to reduce information loss in the feature extraction process of the generator and enhance the model’s focus on important features, thus preserving and restoring image details and textures, resulting in more natural and realistic colorized images. The color selection attention module maximizes the mutual information between the source and target domains, assigning semantically consistent colors to infrared images without altering their underlying structures. Furthermore, the introduction of the global feature loss mitigates the limitation of adversarial loss in restoring global features during the colorization process, enhancing the performance of the infrared image colorization network in terms of structural content and color style. Extensive experimental results demonstrate the effectiveness of our proposed method in both objective and subjective evaluations.

Although our unsupervised infrared image coloring method has achieved excellent results in experimental evaluations, there are still some limitations that require further research. On the one hand, existing thermal infrared image coloring datasets primarily concentrate on specific scenes, particularly “roads”. Consequently, the model’s adaptability to diverse scenes may be constrained by the limited scene variety in the training data. On the other hand, the multitude of potential mapping relationships between inputs and outputs in the infrared image colorization task leads to uncertainty in the colorization outcomes. The method proposed in this paper is primarily designed for colorizing single infrared images. However, when applied to infrared video data, the uncertainty inherent in the colorization process may introduce differences between consecutive frames, potentially resulting in distortions of features such as “lane lines” in the colorized infrared video. Consequently, our subsequent research will focus on two main aspects. Firstly, to construct a more comprehensive infrared image coloring dataset covering various scenes to enhance the applicability of the model. Secondly, integrating temporal constraints into the network architecture to minimize deviations between consecutive frames in infrared video data and consequently reduce structural distortion.

Funding

Youth Innovation Promotion Association of the Chinese Academy of Sciences (No. Y2021071, No.Y202058); Leading Technology of Jiangsu Basic Research Plan (BK20192003); Funds of the Key Laboratory of National Defense Science and Technology (no.6142113210205); Fundamental Research Funds for the Central Universities (JSGP202202, no.30919011401, no.30922010204, no.30922010718); National Natural Science Foundation of China (no.62105152, no.62301253, no.62305163).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Refs. [29,31].

References

1. Y. Zhou, Y. Liu, L. Yuan, et al., “Real-infraredSR: Real-World Infrared Image Super-Resolution via Thermal Imager,” Opt. Express 31(22), 36171–36187 (2023). [CrossRef]

2. M. Limmer and H. P. Lensch, “Infrared Colorization Using Deep Convolutional Neural Networks,” in 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), (IEEE, 2016), pp. 61–68.

3. A. Berg, J. Ahlberg, and M. Felsberg, “Generating Visible Spectrum Images From Thermal Infrared,” (2018), pp. 1143–1152.

4. X. Kuang, J. Zhu, X. Sui, et al., “Thermal Infrared Colorization via Conditional Generative Adversarial Network,” Infrared Phys. Technol. 107, 103338 (2020). [CrossRef]

5. N. Bhat, N. Saggu, and S. Kumar, “Generating Visible Spectrum Images from Thermal Infrared Using Conditional Generative Adversarial Networks,” in 2020 5th International Conference on Communication and Electronics Systems (ICCES), (IEEE, 2020), pp. 1390–1394.

6. S. Hwang, J. Park, N. Kim, et al., “Multispectral Pedestrian Detection: Benchmark Dataset and Baseline,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2015), pp. 1037–1045.

7. C. Ledig, L. Theis, F. Huszár, et al., “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2017), pp. 4681–4690.

8. J.-Y. Zhu, T. Park, P. Isola, et al., “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks,” in Proceedings of the IEEE international conference on computer vision, (2017), pp. 2223–2232.

9. D. Engin, A. Genç, and H. Kemal Ekenel, “Cycle-Dehaze: Enhanced Cyclegan for Single Image Dehazing,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, (2018), pp. 825–833.

10. M. Yang and J. He, “Image Style Transfer Based on DPN-CycleGAN,” in 2021 4th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), (IEEE, 2021), pp. 141–145.

11. Y. Li, H. Shi, and Q. Wang, “Generative Adversarial Networks Combined with Deep Feature Interpolation for Image Style Transfer,” in 2022 Global Conference on Robotics, Artificial Intelligence and Information Technology (GCRAIT), (IEEE, 2022), pp. 271–275.

12. A. Nyberg, A. Eldesokey, D. Bergstrom, et al., “Unpaired Thermal to Visible Spectrum Transfer Using Adversarial Training,” in Proceedings of the European conference on computer vision (ECCV) Workshops, (2018), pp. 0–0.

13. X. Cao, Y. Yao, and N. Yu, “Nearly Reversible Image-to-Image Translation Using Joint Inter-Frame Coding and Embedding,” in 2021 International Conference on Visual Communications and Image Processing (VCIP), (IEEE, 2021), pp. 1–5.

14. C.-T. Lin, “Cross Domain Adaptation for On-Road Object Detection Using Multimodal Structure-Consistent Image-to-Image Translation,” in 2019 IEEE international conference on image processing (ICIP), (IEEE, 2019), pp. 3029–3030.

15. X. Peng, Q. Li, T. Wu, et al., “Cross-GAN: Unsupervised Image-to-Image Translation,” in 2022 IEEE 6th Information Technology and Mechatronics Engineering Conference (ITOEC), (IEEE, 2022), Vol. 6, pp. 1755–1759.

16. B. Aubert, T. Cresson, J. A. de Guise, et al., “X-Ray to DRR Images Translation for Efficient Multiple Objects Similarity Measures in Deformable Model 3D/2D Registration,” IEEE Trans. Med. Imaging 42(4), 897–909 (2023). [CrossRef]

17. H. Li and Y. Chen, “Unpaired Night-to-Day Translation: Image Restoration and Style Transfer under Low Illumination,” in 2021 IEEE International Conference on Image Processing (ICIP), (IEEE, 2021), pp. 1699–1703.

18. M. Eslami, S. Tabarestani, S. Albarqouni, et al., “Image-to-Images Translation for Multi-Task Organ Segmentation and Bone Suppression in Chest x-Ray Radiography,” IEEE Trans. Med. Imaging 39(7), 2553–2565 (2020). [CrossRef]

19. X. Yang, Z. Yu, L. Xu, et al., “Underwater Ghost Imaging Based on Generative Adversarial Networks with High Imaging Quality,” Opt. Express 29(18), 28388–28405 (2021). [CrossRef]

20. I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative Adversarial Nets. Advances in neural information processing systems,” (2014), 27.

21. X. Li, X. Guo, and J. Zhang, “N2d-Gan: A Night-to-Day Image-to-Image Translator,” in 2022 IEEE international conference on multimedia and expo (ICME), (IEEE, 2022); pp. 1–6.

22. A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale,” arXiv, arXiv:2010.11929v2 (2021). [CrossRef]

23. Z. Liu, Y. Lin, Y. Cao, et al., “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows,” in Proceedings of the IEEE/CVF international conference on computer vision, (2021), pp. 10012–10022.

24. D. Torbunov, Y. Huang, H. Yu, et al., “Uvcgan: Unet Vision Transformer Cycle-Consistent Gan for Unpaired Image-to-Image Translation,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, (2023), pp. 702–712.

25. P. Dhariwal and A. Nichol, “Diffusion Models Beat Gans on Image Synthesis,” Advances in neural information processing systems 34, 8780–8794 (2021).

26. P. Isola, J.-Y. Zhu, T. Zhou, et al., “Image-to-Image Translation with Conditional Adversarial Networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2017), pp. 1125–1134.

27. M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised Image-to-Image Translation Networks. Advances in neural information processing systems,” (2017), 30.

28. T. Park, A. A. Efros, R. Zhang, et al., “Contrastive Learning for Unpaired Image-to-Image Translation,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, J.-M. Frahm, eds., Lecture Notes in Computer Science (Springer International Publishing, 2020), Vol. 12354, pp. 319–345.

29. S. Hwang, J. Park, N. Kim, et al., “Multispectral Pedestrian Detection: Benchmark Dataset and Baseline,” in Conference on computer vision and pattern recognition (IEEE, 2015), pp. 1037–1045.

30. L. Chen, Y. Liu, Y. He, et al., “Colorization of Infrared Images Based on Feature Fusion and Contrastive Learning,” Opt. Lasers Eng. 162, 107395 (2023). [CrossRef]

31. M. Brown and S. Süsstrunk, “Multi-Spectral SIFT for Scene Category Recognition,” in CVPR (IEEE, 2011), pp. 177–184.

Dataset	Method	PSNR	SSIM	MSE
KAIST	CycleGAN	13.064	0.420	0.064
	CUT	13.597	0.431	0.058
	FastCUT	14.730	0.441	0.059
	Ours	16.160	0.541	0.041
FLIR	CycleGAN	15.758	0.518	0.056
	CUT	13.523	0.465	0.091
	FastCUT	13.520	0.466	0.881
	Ours	16.138	0.526	0.048
NIR	CycleGAN	17.794	0.589	0.036
	CUT	17.892	0.621	0.034
	FastCUT	18.600	0.676	0.031
	Ours	18.998	0.679	0.032

Ablation	PSNR	SSIM	MSE
Baseline	13.064	0.420	0.064
Remove CSAM	13.975	0.534	0.049
Remove Att-block	13.597	0.431	0.059
Remove GlobalFeature loss	16.086	0.535	0.042
Our method	16.160	0.541	0.041

Dataset	Method	PSNR	SSIM	MSE
KAIST	CycleGAN	13.064	0.420	0.064
	CUT	13.597	0.431	0.058
	FastCUT	14.730	0.441	0.059
	Ours	16.160	0.541	0.041
FLIR	CycleGAN	15.758	0.518	0.056
	CUT	13.523	0.465	0.091
	FastCUT	13.520	0.466	0.881
	Ours	16.138	0.526	0.048
NIR	CycleGAN	17.794	0.589	0.036
	CUT	17.892	0.621	0.034
	FastCUT	18.600	0.676	0.031
	Ours	18.998	0.679	0.032

Ablation	PSNR	SSIM	MSE
Baseline	13.064	0.420	0.064
Remove CSAM	13.975	0.534	0.049
Remove Att-block	13.597	0.431	0.059
Remove GlobalFeature loss	16.086	0.535	0.042
Our method	16.160	0.541	0.041

Cross-domain colorization of unpaired infrared images through contrastive learning guided by color feature selection attention

Abstract

1. Introduction

2. Related work

2.1 Deep learning-based methods for automatic cross-domain infrared image colorization

2.2 Image-to-image translation

2.3 Contrastive learning

3. Proposed method

3.1 Residual fusion attention generator

3.2 Discriminator

3.3 Color selection attention module (CSAM)

3.4 Loss function

4. Experiments

4.1 Experimental settings

4.1.1 Datasets

4.1.2 Implementation details

4.1.3 Quantitative evaluation indicators

4.2 Results

4.2.1 Quantitative tests

4.2.2 Qualitative tests

4.3 Ablation study

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (10)

Tables (2)

Equations (10)

Optics Express