Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

PSCAT: a lightweight transformer for simultaneous denoising and super-resolution of OCT images

Open Access Open Access

Abstract

Optical coherence tomography (OCT), owing to its non-invasive nature, has demonstrated tremendous potential in clinical practice and has become a prevalent diagnostic method. Nevertheless, the inherent speckle noise and low sampling rate in OCT imaging often limit the quality of OCT images. In this paper, we propose a lightweight Transformer to efficiently reconstruct high-quality images from noisy and low-resolution OCT images acquired by short scans. Our method, PSCAT, parallelly employs spatial window self-attention and channel attention in the Transformer block to aggregate features from both spatial and channel dimensions. It explores the potential of the Transformer in denoising and super-resolution for OCT, reducing computational costs and enhancing the speed of image processing. To effectively assist in restoring high-frequency details, we introduce a hybrid loss function in both spatial and frequency domains. Extensive experiments demonstrate that our PSCAT has fewer network parameters and lower computational costs compared to state-of-the-art methods while delivering a competitive performance both qualitatively and quantitatively.

© 2024 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

Optical Coherence Tomography (OCT) utilizes the interference properties of light waves to non-invasively measure optical path differences, thus obtaining high-resolution images of biological tissues [1,2]. It is widely used in medicine for observing tissue microstructure and diagnosing abnormalities including ophthalmology, cardiovascular medicine, dermatology, and more. Due to its non-invasive, high-resolution, and real-time imaging nature, OCT plays a crucial role in medical diagnostics and research. Especially in the field of ophthalmology, it is used as one of the safest and most effective tools for diagnosing various eye diseases such as retinal diseases, macular diseases, optic nerve diseases, and glaucoma. Due to the limitations of the interferometric imaging principle of OCT technology, its images are often affected by inherent speckle noise [3], decreasing the signal-to-noise ratio (SNR) of OCT images. On the other hand, to achieve a large field of view and reduce the impact of unconscious microsaccades, clinical practitioners often use downsampling methods to accelerate the acquisition speed while maintaining the same scanning frequency of the light source. However, this also reduces the acquired information, thus lowering the resolution of the OCT image. Developing appropriate methods to improve the SNR and resolution of OCT images is crucial, offering clinicians clearer images for observing retinal structure and disease characteristics.

Over the past decades, considerable efforts have been made to find a reliable method for reconstructing low signal-to-noise ratio and low-resolution (LSLR) OCT images into high signal-to-noise ratio and high-resolution (HSHR) images. Fang et al. [4] introduced an efficient sparse representation-based image reconstruction framework called SBSDI, which simultaneously performs interpolation and denoising of retinal OCT images. Trinh et al. [5] proposed a competitive example-based super-resolution (SR) method for medical images capable of enhancing resolution while being robust to heavy noise. Seelamantula et al. [6] introduced a super-resolution reconstruction method based on a parametric representation that leverages an iterated singular-value decomposition algorithm, which is named Cadzow denoiser. Abbasi et al. [7] presented a nonlocal weighted sparse representation (NWSR) method for reconstructing HSHR retinal OCT images. Most of these traditional methods require complex regularizers, resulting in high computational complexity and inflexibility. The image restoration quality is not ideal and is difficult to apply in clinical practice.

In recent years, deep learning-based algorithms have shown their overwhelming advantages in image processing, ranging from low-level tasks such as image denoising, deblurring, and SR to high-level tasks such as segmentation, detection, and recognition. A large number of deep learning-based methods are employed for speckle noise reduction and resolution enhancement in OCT images [812]. Huang et al. [13] proposed a generative adversarial network-based approach, SDSR-OCT, to simultaneously denoise and super-resolve OCT images. Qiu et al. [14] proposed a semi-supervised learning approach named N2NSR-OCT to generate denoised and super-resolved OCT images simultaneously using up- and down-sampling networks. Cao et al. [15] modified the existing super-resolution generative adversarial network (SR-GAN) for OCT image reconstruction to address the problem of generating a high-resolution OCT image from a low optical and low digital resolution image. Das et al. [16] proposed an unsupervised framework to perform fast and reliable SR without the requirement of aligned LR-HR pairs, using adversarial learning with cycle consistency and identity mapping priors to preserve the spatial correlation, color, and texture details. These CNN-based methods have promoted the development of OCT image denoising and super-resolution. As more extensive and deeper CNN models are developed to improve learning ability, image quality has also been greatly improved. CNN models are based on the idea of local receptive fields, which extract features by sliding convolutional kernels over the image. However, this local receptive field mechanism limits the model’s ability to perceive global information.

Recently, Transformer [17] proposed in natural language processing (NLP) has shown outstanding performance in multiple high-level vision tasks. The core of the Transformer is the self-attention mechanism, which enables the establishment of global dependencies and alleviates the limitations of CNN-based algorithms. Considering the potential of Transformer, some researchers have attempted to apply it to low-level tasks such as image denoising and super-resolution [1821]. Despite its success and great promise, the Transformer has been investigated little in OCT denoising and super-resolution. We aim to explore the potential of the Transformer fully in simultaneous denoising and super-resolution of OCT. Specifically, we propose a lightweight parallel spatial and channel attention Transformer (PSCAT) to simultaneously denoise and super-resolve LSLR OCT images. The window-based multi-head self-attention and channel attention modules in the Transformer block aggregate features from both spatial and channel dimensions. The two attention mechanisms complement each other. Spatial attention enriches each feature map’s spatial representation, helping model channel dependencies. Channel attention provides global information between features for spatial attention, expanding the receptive field of spatial attention. Compared with the state-of-the-art methods, PSCAT has fewer network parameters and lower computational costs, making it more suitable for rapidly processing large clinical scan samples.

In summary, the key contributions of this paper are as follows:

  • • We propose a parallel spatial and channel attention Transformer module that combines window-based multi-head self-attention and channel attention to capture spatial and channel features simultaneously, achieving inter-block feature aggregation of different dimensions.
  • • We develop an effective lightweight Transformer network that utilizes a hybrid loss function in both spatial and frequency domains for simultaneous denoising and super-resolving OCT images in an end-to-end manner.
  • • Extensive experimental results demonstrate that our PSCAT has achieved the SOTA results for the OCT image enhancement task compared to traditional, CNN-based, and other Transformer-based methods.

2. Method

2.1 Problem statement

The goal of the simultaneous denoising and super-resolution task for OCT images is to restore HSHR images from LSLR images. A typical OCT denoising and super-resolution model can be expressed as

$$\mathbf{\hat{I}}_{HSHR}=G(\mathbf{I}_{LSLR})$$
where $\mathbf {I}_{LSLR}$ is an input OCT image with low SNR and resolution, $G$ is the operator for noise reduction and resolution enhancement, $\mathbf {\hat {I}}_{HSHR}$ represents a denoised and super-resolved OCT image generated by G.

Given a set of paired LSLR and HSHR images $\left \{ (\mathbf {I}_{LSLR}, \mathbf {I}_{HSHR}) \mid \mathbb {R}^{n} \right \}$, the model $G$ can be represented by the parameterized function $G_{\Theta }$, where $\Theta$ is the vector of parameters. The parameterized vector can be computed as:

$$\Theta=arg\,\underset{\Theta}{min}\frac{1}{N}\sum_{1}^{N}L(G_\Theta (\mathbf{I}_{LSLR};\Theta)-\mathbf{{I}}_{HSHR})$$
where $G_\Theta (\mathbf {I}_{LSLR};\Theta ): \mathbb {R}^{n} \to \mathbb {R}^{n}$ is the deep learning network model represented by a parameterized vector $\Theta$, $N$ is the number of input images, and $L$ is the loss function used by the network.

2.2 Network architecture

2.2.1 Overall structure

As illustrated in Fig. 1, the overall network of the proposed PSCAT comprises three parts: shallow feature extraction, deep feature extraction, and image reconstruction. This architecture design is widely used in natural image super-resolution networks [19,2123]. Initially, given a LSLR input image $\mathbf {I}_{LSLR} \in \mathbb {R}^{H\times W \times C_in}$, we first exploit one $3\times 3$ convolutional layer to extract the shallow feature $F_S \in \mathbb {R}^{H\times W \times C}$. $H$ and $W$ denote the height and width of the input image, While $C_in$ and $C$ represent the channel number of the input image and intermediate feature.

 figure: Fig. 1.

Fig. 1. The overall architecture of the proposed lightweight parallel spatial and channel attention Transformer (PSCAT) and the structure of channel attention module (CAM). $\oplus /\otimes$: element-wise addition / multiplication.

Download Full Size | PDF

Subsequently, the shallow feature $F_S$ enters the deep feature extraction module to obtain the deep feature $F_D \in \mathbb {R}^{H\times W \times C}$. The deep feature extraction module is stacked by $N_G$ parallel attention Transformer groups (PATGs). The residual strategy is introduced here to ensure the stability of training. Each PATG contains $N_B$ parallel attention Transformer blocks (PATBs). Each PATB contains a spatial attention module (SAM) and a channel attention module (CAM), arranged in parallel. At the end of each PATG and deep feature extraction module, there is a $3\times 3$ convolutional layer for refining features.

Finally, the shallow feature $F_S$ and the deep feature $F_D$ are fused through global residual connections and entered into the image reconstruction module. In this module, the HSHR output image $\mathbf {\hat {I}}_{HSHR} \in \mathbb {R}^{H\times W \times C_in}$ is reconstructed from the fused feature through upsampling operation horizontal direction PixelShuffle (HDPS), and $3\times 3$ convolutional layers are adopted to aggregate features before and after the upsampling operation.

2.2.2 Spatial attention module (SAM)

Attention mechanism has become one of the most widely used components in deep learning, especially in NLP and computer vision. Its core idea is to imitate human attention, focusing on the most relevant or important parts when processing a large amount of information. The self-attention mechanism reduces the dependence on external information and is better at capturing the internal correlation of data or features. ViT [24] is the first to introduce multi-head self-attention (MSA) [17] into computer vision. Swin Transformer [25] introduces the shifted windowing scheme, which increases efficiency by limiting self-attention computation to non-overlapping local windows while allowing cross-window connection. It represents a significant advancement in applying Transformer models to computer vision, combining the strengths of Transformers and CNNs to process and understand visual data efficiently.

Our SAM follows Swin Transformer’s window-based multi-head self-attention (W-MSA), reduces receptive fields, and limits self-attention computation to local windows. Given an input $X \in \mathbb {R}^{H\times W \times C}$, we first reshape $X$ into $\frac {HW}{M^2}$ non-overlapping local windows of the size $M\times M$. Then, we calculate the standard Softmax attention within each window. For a local window feature $X_{W} \in \mathbb {R}^{N \times C}$, where $N = M \times M$, the query, key and value matrices $Q$, $K$, and $V$ are computed as follows in each head:

$$Q = X_{W}P_{Q}, K=X_{W}P_{K}, V=X_{W}P_{V}$$
where $P_{Q}$, $P_{K}$ and $P_{V}$ are projection matrices that are shared across different windows. The Softmax attention is computed as:
$$Attention(Q,K,V)=SoftMax(\frac{QK^{T}}{\sqrt{d}}+B)V$$
where $d$ represents the dimension of $Q/K$, $B$ denotes the relative position encoding.

2.2.3 Channel attention module (CAM)

Channel attention aims to model the correlation between different channels, automatically obtain the importance of each feature channel through network learning, and finally assign different weight coefficients to each channel to enhance important features and suppress non-important features. The representative model of the channel attention mechanism is Squeeze and Excitation Networks (SENet) [26].

As shown in Fig. 1, our CAM consists of two $3\times 3$ convolutional layers with a GELU activation and a standard channel attention calculation following CBAM [27]. The channel attention in our CAM is computed as:

$$\begin{aligned} F_{Avg}(X) & =Conv(ReLU(Conv(AvgPool(X)))),\\ F_{Max}(X) & =Conv(ReLU(Conv(MaxPool(X)))),\\ CAM(X) & = X\ast \delta (F_{Avg}(X)+F_{Max}(X)) \end{aligned}$$
where $X$ represents the input feature map, $AvgPool$ and $MaxPool$ represent adaptive average pooling and maximum pooling operations that aggregate spatial information into the channel. $Conv$ is $1\times 1$ convolutional layer and $ReLU$ is adopted between two convolutional layer, $\delta$ is a nonlinear activation function Sigmoid, and $\ast$ is an element-wise multiplication operation. $F_{Avg}(X)$ and $F_{Max}(X)$ denote the intermediate features, $CAM(X)$ is the output of CAM.

2.2.4 Parallel attention transformer block (PATB)

The hybrid attention mechanism can more comprehensively capture and represent the complexity of the input data, thereby improving the model’s ability to understand the data and effectively boosting the modeling ability of the Transformer. Some studies [21,23] explore introducing channel attention in Transformer to aggregate spatial and channel information. The spatial window self-attention models the fine spatial relationship between pixels, and the channel attention models the relationship between feature maps, thereby utilizing global image information.

Our PATB adopts a hybrid attention mechanism and arranges spatial and channel attention in parallel. It comprises three parts: SAM, CAM, and multilayer perceptron (MLP). The three parts are interspersed with LayerNorm (LN) and residual connections, as shown in Fig. 1. For a given input feature $X$, the entire calculation process of PATB is as follows:

$$\begin{aligned} F_{SAM}(X) & = (S)W \text{-} MSA(LN(X))),\\ F_{CAM}(X) & = CAM(LN(X)),\\ F_{Att}(X) & = F_{SAM}(X) + \gamma F_{CAM}(X) + X,\\ PATB(X) & = MLP(LN(F_{Att}(X)))+F_{Att}(X) \end{aligned}$$

$W \text {-} MSA$ and $SW \text {-} MSA$ represent window-based multi-head self-attention and shifted window-based multi-head self-attention. In continuous PATB, $W \text {-} MSA$ and $SW \text {-} MSA$ will be used intermittently. $F_{SAM}(X)$, $F_{CAM}(X)$, and $F_{Att}(X)$ denote the intermediate features, $PATB(X)$ represents the output of PATB, $\gamma$ is a constant parameter utilized to balance SAM and CAM.

2.2.5 Horizontal direction PixelShuffle (HDPS)

In the OCT acquisition process, a line scanning mode is typically employed, where the scanning frequency in the A-scan direction directly influences the resolution in the horizontal direction. While high-frequency A-scans can yield high-resolution images, their acquisition speed may be constrained. Unlike natural images, we use the modified PixelShuffle [28] for upsampling, namely horizontal direction PixelShuffle (HDPS). The use of HDPS can improve the collection speed in clinical applications. As shown in Fig. 2, input features $F_{in} \in \mathbb {R}^{H\times W \times C}$ is firstly processed through $3\times 3$ convolutional layers to obtain the amplified channel number features $F_{amp} \in \mathbb {R}^{H\times W \times (r \ast C)}$.

$$F_{amp} = Conv(F_{in})$$
where $H$, $W$, and $C$ represent the height, width, and number of channels of the feature map, and $r$ is the upsampling factor. After that, $F_{amp}$ is shaped to obtain output feature $F_{out} \in \mathbb {R}^{H\times (r \ast W) \times C}$.
$$F_{out} = Reshape(F_{amp})$$

 figure: Fig. 2.

Fig. 2. Horizontal direction PixelShuffle (HDPS) for OCT images.

Download Full Size | PDF

Finally, the feature map was upsampled by r times in the horizontal direction.

2.3 Loss function

To obtain the HSHR OCT image from an LSLR input, we introduce a hybrid loss function in both spatial and frequency domains. Two items are included in our loss function: the MAE loss $\mathcal {L}_{MAE}$, and the FFT loss $\mathcal {L}_{FFT}$. The MAE loss $\mathcal {L}_{MAE}$ is defined as follows:

$$\mathcal{L}_{MAE} = \left \| \hat{\mathbf{I}}_{HSHR} - \mathbf{I}_{HSHR} \right \|_{1}$$
where $\hat {\mathbf {I}}_{HSHR}$ and ${\mathbf {I}}_{HSHR}$ represent the HSHR image output by the network and the real HSHR image, respectively. $\left \| \cdot \right \|_{1}$ represents L1 distance, which is generally used in learning-based OCT denoising and resolution enhancement. MAE loss ensures that the network’s output is close to the ground truth, but using only the pixel-level loss function cannot effectively help restore high-frequency details. Therefore, we add frequency constraints to regularize network training:
$$\mathcal{L}_{FFT} = \left \| F(\hat{\mathbf{I}}_{HSHR}) - F(\mathbf{I}_{HSHR}) \right \|_{1}$$

$F$ represents fast Fourier transform.

The overall loss function of our proposed PSCAT is as follows:

$$\mathcal{L} = \mathcal{L}_{MAE} + \lambda \mathcal{L}_{FFT}$$
where $\lambda$ is a constant parameter utilized to balance the two terms. Typically, $\lambda$ is set to a small value close to 0. For image denoising and enhancement tasks, the MAE loss is the main component for ensuring model convergence, while frequency domain constraints are solely employed for the further protection of structural details.

3. Experiments and results

3.1 Data preparation

The dataset used for training and validation in this work is PKU37 [12], which collected data from 37 healthy eyes of 37 subjects using a customized Spectral Domain OCT (SDOCT) system. The center wavelength and full width at half the maximum bandwidth of the light source are 845 nm and 45 nm, respectively. The lateral and axial resolutions are 16 µm and 6 µm, respectively. More details about obtaining PKU37 can be found in [9].

Two publicly available datasets were used as test sets, namely DUKE17 [29] and DUKE28 [4]. DUKE17 was acquired from 17 eyes from 17 subjects, 10 normal subjects, and 7 with non-neovascular age-related macular degeneration (AMD) in the A2A SDOCT study. Volumetric scans were acquired using SDOCT imaging systems from Bioptigen, Inc. (Research Triangle Park, NC). DUKE28 was obtained from 28 eyes of 28 subjects enrolled in the Age-Related Eye Disease Study 2 (AREDS2) Ancillary SDOCT (A2A SDOCT) with and without non-neovascular AMD by 840 nm wavelength SDOCT imaging systems from Bioptigen, Inc. (Durham, NC, USA).

The division of the training, validation, and test set is shown in Table 1. Referring to the original article of PKU37 dataset [12], we selected the images from 20 subjects in PKU37 for training (namely PKU37-train) and used the remaining 17 subjects for validation (namely PKU37-val). DUKE17 and DUKE28 were used for cross-domain tests to verify the generalization ability of the deep learning-based methods. PKU37-val was used for hyperparameter adjustment of all deep learning-based methods.

Tables Icon

Table 1. Training, validation, and test datasets used in this work.

3.2 Implementation details

We implemented the proposed PSCAT using the PyTorch [30] toolbox, and all the experiments were conducted on an Ubuntu 20.04 operation system and an NVIDIA Geforce RTX 3090 GPU. In the training stage, the Adam [31] optimizer was adopted with $\beta 1 = 0.9$ and $\beta 2 = 0.99$, batch size was set to 4, and the learning rate was set to 2e-4 for all 2e5 iterations. We keep the depth and width of the PSCAT structure the same as SwinIR [19]; the numbers of PATG and PATB are both set to 6. The channel number of the hidden layers in PATB is set to 96. The attention head number and window size are set to 6 and 16 for (S)W-MSA. The weight $\gamma$ and $\lambda$ were set as 0.01 and 0.05 through a lot of searches.

To better evaluate the performance of the proposed PSCAT, twelve methods were considered for comparison. These methods can be roughly divided into three categories: three typical traditional methods commonly used for OCT denoising, Wavelet [32], NLM [33], and BM3D [34]; five excellent CNN-based methods, EDSR [35], RCAN [22], HAN [36], IMDN [37] and SAFMN [38]; four innovative Transformer-based methods, SwinIR [19], HAT [21], DAT [23], and DLGSANet [39]. Among them, IMDN, SAFMN, and DLGSANet are lightweight models. For traditional methods, we combine them with bicubic interpolation and optimize their parameters. For the deep learning-based method, we modified the upsampling part of the published code provided by their authors and used the same dataset partitioning as in our PSCAT.

For quantitative comparison, we used three metrics, including peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and root mean square error (RMSE). PSNR represents the ratio of the maximum possible power of a signal to the destructive noise power that affects its representation accuracy. SSIM measures image similarity from three aspects: brightness, contrast, and structure. RMSE reflects the pixel-by-pixel difference between the generated image and the ground truth.

3.3 Performance comparison

To quantitatively evaluate the simultaneous denoising and resolution enhancement performance of the proposed method, Table 2 summarizes two model evaluation metrics (Params, FLOPs) and three image quality evaluation metrics (PSNR, SSIM, RMSE) results (mean$\pm$standard deviation) of all the methods on PKU37-val. Our PSCAT is significantly superior to other methods in all image quality evaluation metrics, regardless of whether the scale factor is $\times$2 or $\times$4. Particularly, when the scale factor is $\times$4, compared to conventional CNN-based model EDSR and Transformer-based model SwinIR, the proposed PSCAT achieves gains of 0.08dB and 0.12dB in PSNR, while the network parameters and FLOPs of EDSR and SwinIR methods are 10 times and 3 times higher than those of PSCAT, respectively. In models based on Transformer, whether the scale factor is $\times$2 or $\times$4, our PSCAT has the fastest inference time. Especially when the scale factor is $\times$4, the inference time of PSCAT is about 1/3 of SwinIR and HAT, and 1/5 of DAT. In comparison among lightweight models, DLGSANet has slightly more parameters than PSCAT, whereas IMDN and SAFMN have fewer. However, all three models significantly underperform relative to PSCAT. All comparisons presented in Table 2 show that PSCAT is lightweight and much more efficient than the state-of-the-art methods.

Tables Icon

Table 2. Quantitative evaluation of the proposed PSCAT against Non-learning/CNN-based/Transformer-based methods on PKU37-val. #Params means the number of network parameters. #FLOPs denotes the number of the FLOPs, which are calculated on images with a resolution of 128 $\times$ 128 pixels. The best and second best results are marked in bold and underlined, respectively.

To validate the visual effects of the proposed method, two representative OCT images were selected from PKU37-val and presented in Figs. 3 and 4. Two ROIs were chosen and magnified for better visualization. It is easy to notice that the results of the deep learning methods are obviously better than traditional methods in both speckle reduction and detail preservation. The results of traditional methods (Wavelet+Bicubic, NLM+Bicubic, BM3D+Bicubic) contain a large amount of noise, and NLM+Bicubic even introduces some streak artifacts. All deep learning methods appear to remove noise while enhancing resolution. Our lightweight model PSCAT matches the visual quality of leading methods like SwinIR, HAT, and DAT with only about 1/5 to 1/3 of their parameters, while surpassing all in PSNR, SSIM, and RMSE metrics.

 figure: Fig. 3.

Fig. 3. Performance comparison of different methods on PKU37-val of $\times$2 SR.

Download Full Size | PDF

 figure: Fig. 4.

Fig. 4. Performance comparison of different methods on PKU37-val of $\times$4 SR.

Download Full Size | PDF

3.4 Generalization comparison

Due to the inconsistencies in OCT acquisition devices, objects, protocols, and other factors used in clinical practice, there is a domain shift problem between different datasets. It is necessary to study the generalization performance of well-trained OCT simultaneous denoising and super-resolution networks. Therefore, we conducted cross-domain testing on two datasets, DUKE17 and DUKE28, using various deep learning networks trained on PKU37-train.

3 presents the quantitative results of all deep learning methods on DUKE17 and DUKE28. The proposed PSCAT achieves the optimal performance on the DUKE17 dataset with a scale factor of $\times$4 and on the DUKE28 dataset with a scale factor of either $\times$2 or $\times$4. When the scale factor is $\times$2 on the DUKE17 dataset, the SSIM of PSCAT is the best, with PSNR and RMSE being second best and closely matching the optimal results. It is evident that, regardless of whether the scale factor is $\times$2 or $\times$4, the SSIM of PSCAT consistently surpasses that of HAT, thereby confirming our method’s superiority in retaining structural details. Furthermore, considering the Params and FLOPs data presented in Table 2, our model achieved slightly better generalization performance than HAT while consuming only a fifth of the computational costs.

Tables Icon

Table 3. Quantitative comparison of cross-domain test with different deep learning-based methods on DUKE17 and DUKE28. The best and second best results are marked in bold and underlined, respectively.

To compare the simultaneous denoising and super-resolution results of different methods on the cross-domain test datasets, we selected one representative image from each dataset for display in Figs. 5 and 6. It can be seen that all deep learning methods exhibit varying degrees of denoising effects while improving resolution. Consistent with the denoising performance evaluation results on PKU37-val, the proposed PSCAT is significantly superior to other methods. The qualitative and quantitative evaluation results indicate that the proposed lightweight model PSCAT has superior generalization ability compared to all reference methods.

 figure: Fig. 5.

Fig. 5. Performance comparison of different methods on DUKE17 of $\times$4 SR.

Download Full Size | PDF

 figure: Fig. 6.

Fig. 6. Performance comparison of different methods on DUKE28 of $\times$4 SR.

Download Full Size | PDF

4. Discussion

4.1 Ablation studies

4.1.1 Effectiveness of CAM

We conduct experiments to inspect the effectiveness of the proposed CAM. The quantitative performance reported on the PKU37-val dataset for $\times$2 SR is shown in Table 4. Where dim is the channel number of the hidden layers in PATB. When dim=96 or 144, the PSNR of parallel CAM is higher. When dim=180, the PSNR of the serial CAM is higher. So, parallel CAM is more suitable for our lightweight network. From Table 4, it can be observed that, contrary to our usual intuition, the baseline model without CAM experiences a decrease in PSNR as dim increases. We believe that this is due to overfitting of the model caused by an increase in the number of parameters.

Tables Icon

Table 4. Ablation study on the proposed CAM.

4.1.2 Effects of different dim values

We further investigated the performance impact of different dim values on the baseline model without CAM, and the results are shown in Fig. 7. It is evident that, regardless of the scale factor being $\times$2 or $\times$4, once the dim value exceeds 96, there is a downward trend in PSNR as the dim value and the number of model parameters increase. This suggests that in our specific OCT image denoising and resolution enhancement tasks, overfitting does occur as the number of parameters increases, leading to a decline in model performance. This is also why our lightweight Transformer can outperform models with large parameter counts.

 figure: Fig. 7.

Fig. 7. Effects of different dim values.

Download Full Size | PDF

4.1.3 Effects of different designs of CAM

We conduct experiments to explore the effects of different CAM designs. Three implementation methods of channel attention are shown in Fig. 8. CBAM [27] aggregates channel information of a feature map by using two pooling operations, generating two 2D maps, while SENet [26] only uses one pooling operation. NFANet [40] proposes simplified channel attention (SCA), preserving channel attention’s two most crucial roles, aggregating global and channel information. Based on Table 5, the channel attention implementation of CBAM is better suited for our specific task.

 figure: Fig. 8.

Fig. 8. Illustration of (a) Simplified Channel Attention in NAFNet [40], (b) Channel Attention in SENet [26], and (c) Channel Attention in CBAM [27]. $\oplus /\otimes$: element-wise addition / multiplication.

Download Full Size | PDF

Tables Icon

Table 5. Effects of different channel attention (CA) in CAM.

4.1.4 Effectiveness of hybrid loss function

We conduct experiments to demonstrate the effectiveness of the proposed hybrid loss function. The quantitative performance reported on the PKU37-val dataset is shown in Table 6. It can be seen that after using the hybrid loss function, the PSNR, SSIM, and RMSE metrics of $\times$2 and $\times$4 SR have all been improved to varying degrees. To explore the effects of different hybrid ratios, we set a $\lambda$ group from 0.01 to 0.1 to examine the performance change, as shown in Fig. 9. It can be found that when $\lambda =0.05$, the model achieves the highest PSNR regardless of whether the scaling factor is $\times$2 or $\times$4.

 figure: Fig. 9.

Fig. 9. Effects of the constant parameter $\lambda$ in hybrid loss function.

Download Full Size | PDF

Tables Icon

Table 6. Ablation study on the proposed hybrid loss function.

4.2 Enhancement in retinal layer segmentation

For retinal OCT images, segmenting layers containing various anatomical and pathological structures is crucial for diagnosing and researching eye diseases. The preprocessing of denoising and super-resolution preserves important clinical structures, making segmentation results more accurate. To further demonstrate the effectiveness of our PSCAT, we compared the impact of different methods on downstream retinal layer segmentation tasks. We used the images from DUKE17 that were processed using various methods and fed them into a public segmentation tool OCTSEG [41] to segment seven layers automatically. Figure 10 shows the results of a typical case, and it is evident PSCAT and HAT achieve the best performance in layer segmentation, as the segmentation lines are relatively flat, and there are no abnormal burrs, protrusions, or offsets. However, PSCAT restored more choroidal details than HAT.

 figure: Fig. 10.

Fig. 10. Visual comparison of layer segmentation performance on DUKE17 of $\times$4 SR.

Download Full Size | PDF

To better evaluate the enhancement effect, we employed various methods to process OCT images from a public retinal layer segmentation dataset [42]. We selected 772 image pairs, comprising OCT retinal images and corresponding retinal layer segmentation masks, from 20 subjects. The U-Net [43] and $\Upsilon$-Net [44] were trained for segmentation, and mean dice score and mean interaction over the union (mIoU) were used to evaluate all the methods. Figure 11 shows the visual segmentation results with the help of denoising and super-resolution by various methods. We can observe that the proposed PSCAT achieves the best segmentation results. Table 7 presents the quantitative results, which also demonstrate the superiority of our PSCAT in serving the downstream segmentation task.

 figure: Fig. 11.

Fig. 11. Visual comparison of retinal layer segmentation after preprocessing with different methods of x2 SR.

Download Full Size | PDF

Tables Icon

Table 7. Quantitative results of retinal layer segmentation after preprocessing with different methods of x2 SR.

4.3 Enhancement in pronounced retinal pathologies

As the training set PKU37-train used in this study was collected from healthy eyes, to further validate the effectiveness of our PSCAT in processing retinal pathological images, we employed the trained model to analyze OCT images with drusen and retinal edema from the retinal layer segmentation dataset [42] and retinal edema segmentation challenge dataset [45]. Figure 12 presents the visual comparison before and after PSCAT denoising and $\times$2 super-resolution. PSCAT significantly eliminates noise and enhances visual quality. Notably, although the PKU37-train does not include pathological images, the trained PSCAT can effectively enhance the quality of various OCT images with different pathologies, shapes, and structures. This is because the characteristics of speckle noise in OCT images are largely determined by the imaging system.

 figure: Fig. 12.

Fig. 12. Visual comparison of OCT images with drusen and retinal edema before and after processing by PSCAT of $\times$2 SR.

Download Full Size | PDF

4.4 Limitations

The ablation studies demonstrate that the performance of our PSCAT is heavily dependent on the settings of hyperparameters. The process of identifying the optimal hyperparameters is complex and time-consuming, as it typically involves conducting numerous experiments to assess the impact of various hyperparameter combinations on model performance. Looking ahead, we can consider the implementation of automated hyperparameter tuning techniques, which aim to diminish the burden associated with manual hyperparameter adjustments. In addition, although our PSCAT is a lightweight Transformer model, the inherently complex architecture of the Transformer and the computations involved in its self-attention mechanism result in longer inference times. This can be a limiting factor for clinical applications that require rapid processing. Future improvements could focus on employing more efficient attention mechanisms, such as sparse attention, and utilizing techniques like model pruning and quantization to reduce the computational load.

5. Conclusion

In this paper, we propose an effective lightweight Transformer that parallelizes spatial and channel attention for OCT image simultaneous denoising and resolution enhancement in an end-to-end manner. Our method uses spatial window self-attention and channel attention in the Transformer block to aggregate features from both spatial and channel dimensions. It explores the potential of the Transformer for OCT image quality improvement while having low computational costs. Extensive experiments have shown that our proposed method exhibits competitive performance in qualitative and quantitative aspects compared to traditional, CNN-based, and Transformer-based methods. The benefit of its lightweight design is that our method has fewer network parameters, lower computational costs, and faster processing speed, and it is more suitable for clinical application needs.

Funding

National Natural Science Foundation of China (82371112, 62394311); Beijing Municipal Natural Science Foundation (Z210008); Shenzhen Science and Technology Program (KQTD20180412181221912).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [4,12,29,42,45].

References

1. D. Huang, E. A. Swanson, C. P. Lin, et al., “Optical coherence tomography,” Science 254(5035), 1178–1181 (1991). [CrossRef]  

2. J. G. Fujimoto, “Optical coherence tomography for ultrahigh resolution in vivo imaging,” Nat. Biotechnol. 21(11), 1361–1367 (2003). [CrossRef]  

3. J. M. Schmitt, S. Xiang, and K. M. Yung, “Speckle in optical coherence tomography,” J. Biomed. Opt. 4(1), 95–105 (1999). [CrossRef]  

4. L. Fang, S. Li, R. P. McNabb, et al., “Fast acquisition and reconstruction of optical coherence tomography images via sparse representation,” IEEE Trans. Med. Imaging 32(11), 2034–2049 (2013). [CrossRef]  

5. D.-H. Trinh, M. Luong, F. Dibos, et al., “Novel example-based method for super-resolution and denoising of medical images,” IEEE Trans. on Image Process. 23(4), 1882–1895 (2014). [CrossRef]  

6. C. S. Seelamantula and S. Mulleti, “Super-resolution reconstruction in frequency-domain optical-coherence tomography using the finite-rate-of-innovation principle,” IEEE Trans. Signal Process. 62(19), 5020–5029 (2014). [CrossRef]  

7. A. Abbasi, A. Monadjemi, L. Fang, et al., “Optical coherence tomography retinal image reconstruction via nonlocal weighted sparse representation,” J. Biomed. Opt. 23(03), 1–036011 (2018). [CrossRef]  

8. Z. Jiang, Z. Huang, B. Qiu, et al., “Weakly supervised deep learning-based optical coherence tomography angiography,” IEEE Trans. Med. Imaging 40(2), 688–698 (2020). [CrossRef]  

9. B. Qiu, Z. Huang, X. Liu, et al., “Noise reduction in optical coherence tomography images using a deep neural network with perceptually-sensitive loss function,” Biomed. Opt. Express 11(2), 817–830 (2020). [CrossRef]  

10. B. Qiu, S. Zeng, X. Meng, et al., “Comparative study of deep neural networks with unsupervised noise2noise strategy for noise reduction of optical coherence tomography images,” J. Biophotonics 14(11), e202100151 (2021). [CrossRef]  

11. M. Geng, X. Meng, J. Yu, et al., “Content-noise complementary learning for medical image denoising,” IEEE Trans. Med. Imaging 41(2), 407–419 (2021). [CrossRef]  

12. M. Geng, X. Meng, L. Zhu, et al., “Triplet cross-fusion learning for unpaired image denoising in optical coherence tomography,” IEEE Trans. Med. Imaging 41(11), 3357–3372 (2022). [CrossRef]  

13. Y. Huang, Z. Lu, Z. Shao, et al., “Simultaneous denoising and super-resolution of optical coherence tomography images based on generative adversarial network,” Opt. Express 27(9), 12289–12307 (2019). [CrossRef]  

14. B. Qiu, Y. You, Z. Huang, et al., “N2nsr-oct: Simultaneous denoising and super-resolution in optical coherence tomography images using semisupervised deep learning,” J. Biophotonics 14(1), e202000282 (2021). [CrossRef]  

15. S. Cao, X. Yao, N. Koirala, et al., “Super-resolution technology to simultaneously improve optical & digital resolution of optical coherence tomography via deep learning,” in 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), (IEEE, 2020), pp. 1879–1882.

16. V. Das, S. Dandapat, and P. K. Bora, “Unsupervised super-resolution of oct images using generative adversarial network for improved age-related macular degeneration diagnosis,” IEEE Sens. J. 20(15), 8746–8756 (2020). [CrossRef]  

17. A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30 I. Guyon, U. V. Luxburg, S. Bengio, et al., eds. (Curran Associates, Inc., 2017).

18. Z. Wang, X. Cun, J. Bao, et al., “Uformer: A general u-shaped transformer for image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 17683–17693.

19. J. Liang, J. Cao, G. Sun, et al., “Swinir: Image restoration using swin transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, (2021), pp. 1833–1844.

20. S. W. Zamir, A. Arora, S. Khan, et al., “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 5728–5739.

21. X. Chen, X. Wang, J. Zhou, et al., “Activating more pixels in image super-resolution transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023), pp. 22367–22377.

22. Y. Zhang, K. Li, K. Li, et al., “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European Conference on Computer Vision (ECCV), (2018).

23. Z. Chen, Y. Zhang, J. Gu, et al., “Dual aggregation transformer for image super-resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2023), pp. 12312–12321.

24. A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, (2021).

25. Z. Liu, Y. Lin, Y. Cao, et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, (2021), pp. 10012–10022.

26. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018).

27. S. Woo, J. Park, J.-Y. Lee, et al., “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), (2018), pp. 3–19.

28. W. Shi, J. Caballero, F. Huszár, et al., “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 1874–1883.

29. L. Fang, S. Li, Q. Nie, et al., “Sparsity based denoising of spectral domain optical coherence tomography images,” Biomed. Opt. Express 3(5), 927–942 (2012). [CrossRef]  

30. A. Paszke, S. Gross, F. Massa, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems 32, 8206 (2019). [CrossRef]  

31. D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), ( San Diega, CA, USA, 2015).

32. D. C. Adler, T. H. Ko, and J. G. Fujimoto, “Speckle reduction in optical coherence tomography images by use of a spatially adaptive wavelet filter,” Opt. Lett. 29(24), 2878–2880 (2004). [CrossRef]  

33. X. Zhang, L. Li, F. Zhu, et al., “Spiking cortical model–based nonlocal means method for speckle reduction in optical coherence tomography images,” J. Biomed. Opt. 19(6), 066005 (2014). [CrossRef]  

34. L. Wang, Z. Meng, X. S. Yao, et al., “Adaptive speckle reduction in oct volume data based on block-matching and 3-d filtering,” IEEE Photonics Technol. Lett. 24(20), 1802–1804 (2012). [CrossRef]  

35. B. Lim, S. Son, H. Kim, et al., “Enhanced deep residual networks for single image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, (2017).

36. B. Niu, W. Wen, W. Ren, et al., “Single image super-resolution via a holistic attention network,” in Computer Vision – ECCV 2020, (Springer International Publishing, Cham, 2020), pp. 191–207.

37. Z. Hui, X. Gao, Y. Yang, et al., “Lightweight image super-resolution with information multi-distillation network,” in Proceedings of the 27th acm international conference on multimedia, (2019), pp. 2024–2032.

38. L. Sun, J. Dong, J. Tang, et al., “Spatially-adaptive feature modulation for efficient image super-resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2023), pp. 13190–13199.

39. X. Li, J. Dong, J. Tang, et al., “Dlgsanet: lightweight dynamic local and global self-attention networks for image super-resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2023), pp. 12792–12801.

40. L. Chen, X. Chu, X. Zhang, et al., “Simple baselines for image restoration,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, (Springer, 2022), pp. 17–33.

41. M. A. Mayer, J. Hornegger, C. Y. Mardin, et al., “Retinal nerve fiber layer segmentation on fd-oct scans of normal subjects and glaucoma patients,” Biomed. Opt. Express 1(5), 1358–1383 (2010). [CrossRef]  

42. S. Farsiu, S. J. Chiu, R. V. O’Connell, et al., “Quantitative classification of eyes with and without intermediate age-related macular degeneration using optical coherence tomography,” Ophthalmology 121(1), 162–172 (2014). [CrossRef]  

43. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, (Springer, 2015), pp. 234–241.

44. A. Farshad, Y. Yeganeh, P. Gehlbach, et al., “Y-net: A spatiospectral network for retinal oct segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2022).

45. J. Hu, Y. Chen, and Z. Yi, “Automated segmentation of macular edema in OCT using deep neural networks,” Med. Image Anal. 55, 216–227 (2019). [CrossRef]  

Data availability

Data underlying the results presented in this paper are available in Ref. [4,12,29,42,45].

4. L. Fang, S. Li, R. P. McNabb, et al., “Fast acquisition and reconstruction of optical coherence tomography images via sparse representation,” IEEE Trans. Med. Imaging 32(11), 2034–2049 (2013). [CrossRef]  

12. M. Geng, X. Meng, L. Zhu, et al., “Triplet cross-fusion learning for unpaired image denoising in optical coherence tomography,” IEEE Trans. Med. Imaging 41(11), 3357–3372 (2022). [CrossRef]  

29. L. Fang, S. Li, Q. Nie, et al., “Sparsity based denoising of spectral domain optical coherence tomography images,” Biomed. Opt. Express 3(5), 927–942 (2012). [CrossRef]  

42. S. Farsiu, S. J. Chiu, R. V. O’Connell, et al., “Quantitative classification of eyes with and without intermediate age-related macular degeneration using optical coherence tomography,” Ophthalmology 121(1), 162–172 (2014). [CrossRef]  

45. J. Hu, Y. Chen, and Z. Yi, “Automated segmentation of macular edema in OCT using deep neural networks,” Med. Image Anal. 55, 216–227 (2019). [CrossRef]  

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (12)

Fig. 1.
Fig. 1. The overall architecture of the proposed lightweight parallel spatial and channel attention Transformer (PSCAT) and the structure of channel attention module (CAM). $\oplus /\otimes$: element-wise addition / multiplication.
Fig. 2.
Fig. 2. Horizontal direction PixelShuffle (HDPS) for OCT images.
Fig. 3.
Fig. 3. Performance comparison of different methods on PKU37-val of $\times$2 SR.
Fig. 4.
Fig. 4. Performance comparison of different methods on PKU37-val of $\times$4 SR.
Fig. 5.
Fig. 5. Performance comparison of different methods on DUKE17 of $\times$4 SR.
Fig. 6.
Fig. 6. Performance comparison of different methods on DUKE28 of $\times$4 SR.
Fig. 7.
Fig. 7. Effects of different dim values.
Fig. 8.
Fig. 8. Illustration of (a) Simplified Channel Attention in NAFNet [40], (b) Channel Attention in SENet [26], and (c) Channel Attention in CBAM [27]. $\oplus /\otimes$: element-wise addition / multiplication.
Fig. 9.
Fig. 9. Effects of the constant parameter $\lambda$ in hybrid loss function.
Fig. 10.
Fig. 10. Visual comparison of layer segmentation performance on DUKE17 of $\times$4 SR.
Fig. 11.
Fig. 11. Visual comparison of retinal layer segmentation after preprocessing with different methods of x2 SR.
Fig. 12.
Fig. 12. Visual comparison of OCT images with drusen and retinal edema before and after processing by PSCAT of $\times$2 SR.

Tables (7)

Tables Icon

Table 1. Training, validation, and test datasets used in this work.

Tables Icon

Table 2. Quantitative evaluation of the proposed PSCAT against Non-learning/CNN-based/Transformer-based methods on PKU37-val. #Params means the number of network parameters. #FLOPs denotes the number of the FLOPs, which are calculated on images with a resolution of 128 × 128 pixels. The best and second best results are marked in bold and underlined, respectively.

Tables Icon

Table 3. Quantitative comparison of cross-domain test with different deep learning-based methods on DUKE17 and DUKE28. The best and second best results are marked in bold and underlined, respectively.

Tables Icon

Table 4. Ablation study on the proposed CAM.

Tables Icon

Table 5. Effects of different channel attention (CA) in CAM.

Tables Icon

Table 6. Ablation study on the proposed hybrid loss function.

Tables Icon

Table 7. Quantitative results of retinal layer segmentation after preprocessing with different methods of x2 SR.

Equations (11)

Equations on this page are rendered with MathJax. Learn more.

I ^ H S H R = G ( I L S L R )
Θ = a r g m i n Θ 1 N 1 N L ( G Θ ( I L S L R ; Θ ) I H S H R )
Q = X W P Q , K = X W P K , V = X W P V
A t t e n t i o n ( Q , K , V ) = S o f t M a x ( Q K T d + B ) V
F A v g ( X ) = C o n v ( R e L U ( C o n v ( A v g P o o l ( X ) ) ) ) , F M a x ( X ) = C o n v ( R e L U ( C o n v ( M a x P o o l ( X ) ) ) ) , C A M ( X ) = X δ ( F A v g ( X ) + F M a x ( X ) )
F S A M ( X ) = ( S ) W - M S A ( L N ( X ) ) ) , F C A M ( X ) = C A M ( L N ( X ) ) , F A t t ( X ) = F S A M ( X ) + γ F C A M ( X ) + X , P A T B ( X ) = M L P ( L N ( F A t t ( X ) ) ) + F A t t ( X )
F a m p = C o n v ( F i n )
F o u t = R e s h a p e ( F a m p )
L M A E = I ^ H S H R I H S H R 1
L F F T = F ( I ^ H S H R ) F ( I H S H R ) 1
L = L M A E + λ L F F T
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.