Deep learning empowered highly compressive SS-OCT via learnable spectral&#x2013;spatial sub-sampling

Yuye Ling; Zhenxing Dong; Xueshen Li; Yu Gan; Yikai Su

doi:10.1364/OL.484500

Swept-source optical coherence tomography (SS-OCT) is a non-invasive volumetric imaging modality, which is widely used in biomedical settings [1,2]. Thanks to the advance of laser technology, the imaging rate of SS-OCT, which is often quantified by the A-line rate of the swept light source, has been improved all the way from several hertz to several megahertz [3]. While a faster imaging rate is always desirable for OCT to reject motion artifacts [4] and enable exciting applications such as dynamic-OCT [5], optical coherence elastography [6], and lidar [7], it has also imposed hefty levies on the following data acquisition, data transfer, and data storage. For a typical 200 kHz SS-OCT, the data bandwidth could be over 800 MB/s, if 2048 spectral sampling points are measured at 12-bit [8].

Various strategies have been proposed to alleviate the data bandwidth. One approach is to sub-sample the spectral interferogram during the acquisition. Signal processing techniques including compressed sensing (CS) and deep learning (DL) were used to recover the original images [9,10]. Zhang et al. proposed a DL-based method to reconstruct the OCT images. A peak signal-to-noise ratio (PSNR) of 24.66 dB and 23.40 dB were achieved at a data compression ratio (DCR) of 2 and 3, respectively [10]. Another related method is to exploit the redundancy in the spatial scanning patterns [11,12]. Considering that the conventional OCT relies on faster scanning to acquire images, Lebed et al. proposed to randomly decimate the A-lines in both fast and slow scanning directions to reduce the data bandwidth. Subsequently, they employed a CS-based algorithm to reconstruct the three-dimensional (3D) OCT volumes by promoting the sparsity in the wavelet domain [11]. A DCR of 1.89 was achieved to render a visually satisfactory reconstruction result. Recently, Li et al. proposed a DL-based multi-scale reconstruction algorithm for spectrally truncated and spatially digitally down-sampled OCT data [13]. An improved reconstruction quality are showcased with a higher DCR (maximally at 4) in conjunction with a super-resolution factor of 4.

In summary, the existing methods mostly achieve a reduced data bandwidth by adopting a “sub-sampling and reconstruction” paradigm: the object interferograms are first spectrally or spatially sub-sampled, and the OCT images are later reconstructed by advanced signal processing techniques. However, most of current proposals are focused on improving the performance of the reconstruction algorithms, while minimal efforts are invested in optimizing the sub-sampling patterns. Instead, empirical patterns such as regular down-sampling, random down-sampling, and truncation are mostly used, which might lead to a sub-optimal performance.

To address this issue, we proposed a highly compressive SS-OCT with learnable spectral–spatial sub-sampling. Unlike previous works, the spectral–spatial sub-sampling pattern is obtained by being jointly optimized along with the reconstruction algorithms in an end-to-end manner. To validate the effectiveness of the proposal, we retrospectively applied the proposed method on an ex vivo human coronary dataset. The proposed method achieved a PSNR of 23.8 dB by admitting merely 1.6% of the original data, which represents an order of magnitude higher compression than that of the state-of-the-art (SOTA).

The schematic diagram of the proposed system is illustrated in Fig. 1(a). The sample under inspection is first imaged by an SS-OCT system to form the analog interreference fringe $I(k,n]$, where $k$ represents for the continuous wavenumber and $n$ denotes the index of the lateral scanning point. We then used a data acquisition (DAQ) board to digitize and sub-sample $I(k, n]$ in both the spectral and spatial domain,

(1)$$\begin{aligned} \tilde{I}[l_i,n_j] & = I[l, n] \cdot M[l, n]\\ & = \underbrace{\underbrace{\left[I(k, n] \cdot \sum_{l}{\delta(k-l\Delta k)}\right]}_{\textrm{digitization}}\cdot \sum_{(i, j) \in T}{\delta[l - i, n -j]}}_{\textrm{spectral-{-}spatial sub-sampling}}, \end{aligned}$$

where $l$ is the index of the spectral sampling point, $\Delta k$ is the spectral sampling interval, $M$ is the binary spectral–spatial sub-sampling pattern, $T$ is an unknown subset of $\mathbb {N}^2$, and $i$, $j$ indicates the sub-sampling location in spectral and spatial domain, respectively. Here, $\tilde {I}[l_i, n_j]$, which is at a much smaller size than the original $I[l,n]$, is later transferred to a host computer. The OCT image $i[l,n]$ could then be fully recovered by using the proposed algorithm, which consists of a fringe inpainting network followed by an image enhancing network.

Fig. 1. (a) Schematic diagram of the proposed compressive SS-OCT system. (b) The framework of the proposed network. Three neural networks (mask generator, fringe inpainter, and image enhancer) are trained in an end-to-end manner.

Download Full Size | PDF

To simultaneously obtain the optimal sub-sampling pattern $M[l,n]$ and the reconstruction algorithm, we proposed an end-to-end training strategy as illustrated in Fig. 1(b). The entire framework consists of three trainable modules: a pattern generator (neural network 1, NN1), a fringe inpainting module (neural network 2, NN2), and an image enhancement module (neural network 3, NN3). The network input is the fully sampled fringe $I[l,n]$, and the network output is the estimated OCT image $\tilde {i}[l,n]$. The corresponding ground truth is the OCT image $i[l,n]$, which is obtained through conventional inverse discrete Fourier transform (IDFT)-based processing procedures. During the network training, a two-dimensional binary pattern $M[l,n]$ is first generated by NN1 from a pre-generated random noise input. Then $M[l,n]$ is used to guide the subsequent spectral–spatial sub-sampling of the OCT fringes: as explained in Eq. (1), only the samples $I[l_i,n_j]$ whose corresponding pattern entries $M[i,j]=1$ will be preserved, while all the others will be discarded. The sub-sampled data $\tilde {I}[l_i, n_j]$ is then inpainted by NN2 to obtain an estimation $\tilde {I}[l, n]$, after which IDFT is performed to convert the fringes to the image domain. Lastly, NN3 is used to produce a refined image $\tilde {i}[l, n]$.

It should be noted that all three networks of the proposed system are developed on the basis of similar U-Nets [14]. However, special modifications have been made on NN1. Since the intended output of NN1 is a binary matrix $M[l,n]$, it is non-differentiable and could cause problems for backpropagation in the training stage. Therefore, we apply the Gumbel Softmax [15] trick to obtain a differentiable normalized distribution to approximate its behavior,

(2)$$\mathcal{R}_c^p=\frac{\exp \left(\left(\log \left(D_c^p\right)+g_c^p\right) / \tau\right)}{\sum_{k=0}^1 \exp \left(\left(\log \left(D_k^p\right)+g_k^p\right) / \tau\right)},$$

where $D$ and $\mathcal {R}$ are the output of NN1 and Gumbel Softmax, respectively. We chose the first channel of $\mathcal {R}$ as $M[l, n]$. Here $g$ is independent and identically sampled from a Gumbel (0, 1) distribution, $\tau$ is a temperature value controlling the distribution density, $p$ and $c$ are the pixel index and the channel index, respectively. During the training phase, parameter $g$ is a random variable similar to noise to ensure the convergence. It is worth noting that the parameter is fixed during the testing so that a predetermined pattern could be loaded onto the DAQ board.

The loss function used in training is given by

(3)$$\mathcal{L} = \mathcal{L}_{\mathrm{image}}+\lambda_1 \cdot \mathcal{L}_{\mathrm{fringe}} + \lambda_2 \cdot \mathcal{L}_{\mathrm{mask}},$$

where $\mathcal {L}_{\textrm {image}}$ computes the $l_1$ norm of the difference between the refined image and ground truth image, $\mathcal {L}_{\textrm {fringe}}$ computes the $l_2$ norm of discrepancy between the estimated fringes and the original fringes, and the data compression ratio ($\mathcal {L}_{\textrm {mask}}= \sum _{l,n}{M[l, n]}$), are combined to regularize the behavior of the algorithm. By backpropagating the loss functions, we could update all three NNs to strike the best balance between the image reconstruction quality and the data compression ratio.

The proposed framework was implemented by using Python 3.8.0, and PyTorch 1.8.0. The same de-identified human coronary OCT dataset as described previously in Li et al.’s work [13] was used to train the entire framework. Specifically, the specimens were imaged via a commercial OCT system (Thorlabs Ganymede, Newton, NJ) and the dataset consists of 3784 OCT images (B-scans) from 17 specimens. We selected 3682 images, 102 images as training set and validation set. Moreover, OCT volumes taken by the same device of two human coronaries, two human fingers, and one onion are added to the testing set to further illustrate the generalizability of the proposed technique. All networks have four downsampling (and corresponding upsampling) blocks, with the initial feature count after the input layer set to 32 features. The down blocks have a leaky rectified linear unit (ReLU) (0.2 negative slope) nonlinearity, and the up blocks have a ReLU nonlinearity. All blocks use batch normalization and have two convolutional layers each. The NN1 has one channel for the input and two for the output (${\mathcal {R}_1}[l,n] + {\mathcal {R}_2}[l,n] = 1$). The NN2 has one channel for the input and one for the output. The NN3 has two channels for the input and one for the output. We employed a batch size of one, and the initialized learning rate was $1 \times {10^{{\rm {\ }\hbox{-}{\rm \ }}3}}$. We trained our network for 30 epochs, using the AdamW optimizer with the momentum of (0.9, 0.999), and then used the cosine decay strategy to decrease the learning rate. The weight of fringe loss $\lambda _1$ was $1 \times {10^{{\rm {\ }\hbox{-}{\rm \ }}6}}$. The testing configuration is similar to Fig. 1(a) except that the sub-sampling procedure is performed retrospectively and is not implemented on hardware. It is worth noting that NN1 is not used during the inference: the only purpose of NN1 is to find the optimal sub-sampling pattern during the training stage. Once the optimal pattern is found, it will be fixed and used repeatedly during the testing regardless of the input.

We first evaluated the performance of the proposed framework. Specifically, PSNR and SSIM are used to quantitatively measure the system’s performance at different DCRs by tuning the weighting coefficient $\lambda _2$. The results are listed in Table 1. It is clear that the image quality experiences steady degradation with more aggressive compression ratios; the same trends are observed when the testing set deviates from the coronary samples to the finger and the onion. However, the original images could still be recovered even when a large amount of raw data is missing: a PSNR of 24.2 dB could be achieved when merely 1.6% of the data is used. Three exemplary reconstructed OCT images (one coronary, one finger, and one onion) obtained by using 1.6%, 3.6%, 10%, 25%, and 50% of the original data along with the corresponding ground-truth images are plotted in Fig. 2(a). It is clear that the overall structure information is distinguishable even for the highest DCR (62.5) except for the onion, while most textural information is well preserved for high DCRs such as 27.8. Three exemplary sub-sampling patterns at different DCRs are also provided in Fig. 2(b) for reference.

Fig. 2. (a) Reconstruction results obtained by the proposed framework at variable sub-sampling rates along with the ground-truth images (GT). From top to bottom: human coronary, human finger, and onion. The regions enclosed by the colored boxes are magnified for better visualization. (b) Three exemplary binary sub-sampling patterns (1.6%, 10%, and 50%) are illustrated.

Download Full Size | PDF

Table 1. Performance of the Proposed Framework on Different Samples at Variable Sub-Sampling Rates (SSRs) Evaluated by PSNR and Structural Similarity (SSIM)^a

View Table | View all tables in this article

To further show the effectiveness of the proposed method, four ablation studies have been conducted and the corresponding results are listed in Table 2. We first carried out a comparative study by replacing the learnable mask by other empirical fixed masking patterns such as (1) random spectral sub-sampling (referred as “random masking”), and (2) spectral truncation from the center (referred as “central masking”). It should be noted that we retrained and fine tuned the entire network for each different mask pattern. The corresponding results for different DCRs are given in Fig. 3. The proposed learned pattern performs the best among all mask patterns at all different DCRs. Especially for the very high DCRs (1%), only the proposed method managed to reconstruct the image with a reasonable PSNR over 21 dB. In contrast, the central mask performs relatively well at low DCR (50%), in which the corresponding PSNR is almost 1.8 dB higher than that of the random masking and 2.2 dB lower than the proposed one. However, its performance starts to underperform when DCR increases (such as 1% and 5%); streak-like artifacts are presented as previously reported by Mousavi et al. [16]. On the other hand, the reconstruction results obtained by using random masking simply degrades with the increase of DCR.

Fig. 3. Comparison of the reconstructed images by using different sub-sampling patterns at different sub-sampling rates including 1%, 5%, 25%, and 50%.

Download Full Size | PDF

Table 2. Ablation Study of Selected Modules in the Proposed Framework by Comparing the PSNR and SSIM^a

View Table | View all tables in this article

Secondly, we verified the utility of the IDFT module. We remove the IDFT module and retrain the network. However, the network fails to generate correct output and the PSNR is not changing with DCR as shown in Fig. 4. We suspect that the direct learning of the cross-domain mapping between spectral interferograms and spatial images is relatively difficult, while an explicit Fourier transform is necessary [17].

Fig. 4. Quantitative evaluation of the effect of removing the IDFT module (blue), the fringe inpainting network (orange), and the image enhancement network (green), respectively.

Download Full Size | PDF

Thirdly, we examine the effect of the fringe inpainting module. Specifically, we repeat the experiment by removing NN2 along with fringe loss and the corresponding results are plotted in Fig. 4. The reconstruction quality of the model becomes instable at high DCR without NN2, while an increase of 0.5 dB in PSNR is observed at low DCR when NN2 is presented.

Finally, we tested the usefulness of the image enhancement module. The corresponding result is also given in Fig. 4. The network can neither converge nor produce stable OCT reconstructions without image enhancement module. Moreover, due to the removal of NN3, when we adjust the weight $\lambda _2$, the final sub-sampling rate is always less than 10%. We suspect the reason might because of the image loss is too large in this case and even becomes irrelevant with the DCR.

While the current study is performed retrospectively on an existing dataset, the proposed method could be experimentally implemented. The two-dimensional learned mask could be loaded onto certain data acquisition boards (e.g., ATS 9373, AlazarTech, Canada) to facilitate the sub-sampling procedure. Specifically, a “skip bitmap” function is provided by the software API (ATS-SDK V7.1.4 (C++), AlazarTech, Canada) to precisely control the sub-sampling procedure [8].

In summary, we demonstrated a highly compressive SS-OCT system which could achieve SOTA performance in terms of DCR. A maximum DCR of 62.5 was achieved with an acceptable image quality, while a typical DCR of 20 would lead to a satisfactory image quality: both are an order of magnitude higher than current SOTA. The major innovation of the proposed method is the introduction of the learnable spectral–spatial sub-sampling pattern within the framework, which for the first time makes the sub-sampling pattern an optimizable variable. Moreover, we proposed a novel end-to-end training paradigm to jointly optimize the learnable pattern and the reconstruction algorithm. The utility of the learnable pattern, inpainting network, and the fringe loss are all verified via ablation studies. We believe the proposed system might be a possible solution to address the data issue in SS-OCT, and would be of interest to the OCT community.

Funding

National Natural Science Foundation of China (61905141); Open Research Fund Program of the State Key Laboratory of Low-Dimensional Quantum Physics (KF202107).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

REFERENCES

1. S. H. Yun, G. J. Tearney, B. J. Vakoc, M. Shishkov, W. Y. Oh, A. E. Desjardins, M. J. Suter, R. C. Chan, J. A. Evans, I.-K. Jang, N. S. Nishioka, J. F. de Boer, and B. E. Bouma, Nat. Med. 12, 1429 (2006). [CrossRef]

2. Y. Ling, X. Yao, and C. P. Hendon, Biomed. Opt. Express 8, 3687 (2017). [CrossRef]

3. T. Klein and R. Huber, Biomed. Opt. Express 8, 828 (2017). [CrossRef]

4. I. Laíns, J. C. Wang, Y. Cui, R. Katz, F. Vingopoulos, G. Staurenghi, D. G. Vavvas, J. W. Miller, and J. B. Miller, Prog. Retinal Eye Res. 84, 100951 (2021). [CrossRef]

5. I. Abd El-Sadek, A. Miyazawa, L. Tzu-Wei Shen, S. Makita, S. Fukuda, T. Yamashita, Y. Oka, P. Mukherjee, S. Matsusaka, T. Oshika, H. Kano, and Y. Yasuno, Biomed. Opt. Express 11, 6231 (2020). [CrossRef]

6. Y. Li, S. Moon, J. J. Chen, Z. Zhu, and Z. Chen, Light: Sci. Appl. 9, 58 (2020). [CrossRef]

7. M. Okano and C. Chong, Opt. Express 28, 23898 (2020). [CrossRef]

8. Y. Ling, W. Meiniel, R. Singh-Moon, E. Angelini, J.-C. Olivo-Marin, and C. P. Hendon, Opt. Express 27, 855 (2019). [CrossRef]

9. X. A. Liu and J. U. Kang, Opt. Express 18, 22010 (2010). [CrossRef]

10. Y. Zhang, T. Liu, M. Singh, E. Çetintas, Y. Luo, Y. Rivenson, K. V. Larin, and A. Ozcan, Light: Sci. Appl. 10, 155 (2021). [CrossRef]

11. E. Lebed, P. J. Mackenzie, M. V. Sarunic, and M. F. Beg, Opt. Express 18, 21003 (2010). [CrossRef]

12. J. Wang, E. J. Chaney, E. Aksamitiene, M. Marjanovic, and S. A. Boppart, J. Phys. D: Appl. Phys. 54, 294005 (2021). [CrossRef]

13. X. Li, S. Cao, H. Liu, X. Yao, B. C. Brott, S. H. Litovsky, X. Song, Y. Ling, and Y. Gan, IEEE Trans. Biomed. Eng. 69, 3667 (2022). [CrossRef]

14. O. Ronneberger, P. Fischer, and T. Brox, ‘U-net: Convolutional networks for biomedical image segmentation,’ (Springer International Publishing, 2015), Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015, pp. 234–241.

15. E. Jang, S. Gu, and B. Poole, “Categorical Reparameterization with Gumbel-Softmax,” arXiv, arXiv:1611.01144 (2016). [CrossRef]

16. M. Mousavi, L. Duan, T. Javidi, and A. K. Ellerbee Bowden, Opt. Express 24, 1781 (2016). [CrossRef]

17. Z. Dong, C. Xu, Y. Ling, Y. Li, and Y. Su, Opt. Lett. 48, 759 (2023). [CrossRef]

SSR	Weight	Coronary		Finger		Onion
(DCR)	$λ_{2}$	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
1.6% (62.5)	100	24.2	0.27	24.0	0.25	22.9	0.30
3.6% (27.8)	50	24.6	0.29	24.3	0.28	23.8	0.32
10% (10)	12	24.9	0.31	24.6	0.29	24.7	0.33
25% (4)	1.5	25.7	0.41	24.8	0.32	25.8	0.40
50% (2)	0.5	28.1	0.66	26.1	0.53	28.0	0.61

	Learnable Mask	Fringe Inpainting	IDFT	Image Enhancement	PSNR (dB)	SSIM
Proposed	✓	✓	✓	✓	24.9	0.31
w/ random masking		✓	✓	✓	24.5	0.29
w/ central masking		✓	✓	✓	23.4	0.17
w/o fringe inpainting	✓		✓	✓	23.2	0.20
w/o IDFT	✓	✓		✓	21.0	0.22
w/o image enhancement	✓	✓	✓		22.7	0.23

SSR	Weight	Coronary		Finger		Onion
(DCR)	$λ_{2}$	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
1.6% (62.5)	100	24.2	0.27	24.0	0.25	22.9	0.30
3.6% (27.8)	50	24.6	0.29	24.3	0.28	23.8	0.32
10% (10)	12	24.9	0.31	24.6	0.29	24.7	0.33
25% (4)	1.5	25.7	0.41	24.8	0.32	25.8	0.40
50% (2)	0.5	28.1	0.66	26.1	0.53	28.0	0.61

	Learnable Mask	Fringe Inpainting	IDFT	Image Enhancement	PSNR (dB)	SSIM
Proposed	✓	✓	✓	✓	24.9	0.31
w/ random masking		✓	✓	✓	24.5	0.29
w/ central masking		✓	✓	✓	23.4	0.17
w/o fringe inpainting	✓		✓	✓	23.2	0.20
w/o IDFT	✓	✓		✓	21.0	0.22
w/o image enhancement	✓	✓	✓		22.7	0.23

Deep learning empowered highly compressive SS-OCT via learnable spectral–spatial sub-sampling

Abstract

Funding

Disclosures

Data availability

REFERENCES

Data availability

Cited By

Figures (4)

Tables (2)

Equations (3)

Optics Letters