Feature decoupled knowledge distillation enabled lightweight image transmission through multimode fibers

Fujie Li; Fujie Li; Li Yao; Li Yao; Wenqing Niu; Wenqing Niu; Ziwei Li; Ziwei Li; Jianyang Shi; Jianyang Shi; Junwen Zhang; Junwen Zhang; Chao Shen; Chao Shen; Nan Chi; Nan Chi

doi:10.1364/OE.516102

1. Introduction

Optical multimode fibers (MMF) have gained significant attention in optical communications and spatial information transmission due to their remarkable transmission capacity [1,2]. Compared with the single-model fibers, MMF exhibit a noteworthy advantage with their abundant spatial modes, thereby demonstrating great potential for transmitting high-resolution images. However, the presence of mode dispersion and mode coupling effects inside MMF tends to degrade the original clear image into disordered speckle patterns at the distal end of the fiber, thereby severely impeding the transmission quality of images through MMF [3].

Various traditional imaging methods have been proposed to deal with the distortion caused by MMF and realize image transmission or light focusing, including phase conjugation methods aiming to compensate for image distortion [4–6] and calibrating transmission matrix (TM) of the scattering media to control the light fields at the distal end [7–9]. Besides, wavefront shaping methods using iterative optimization have also been proposed to achieve specific modulation of the incident wave, thereby realizing the imaging tasks at the distal end of MMF [10–12]. However, these imaging methods lack generalization and robustness, so once the transmission characteristics of the scattering medium change, the performance of these methods will significantly decrease. Unfortunately, the transmission characteristics of MMF are sensitive to external disturbances and environmental noise, which limits the imaging performance using traditional imaging methods.

Deep learning methods have been widely used in imaging tasks through scattering medium due to their powerful nonlinear fitting capabilities and excellent generalization performance. In [13–15], the successful transmission of MNIST digits, letters, or natural scenes through MMF via supervised deep learning are demonstrated. In [16], the semi-supervised learning approach are proposed to overcome the time-varying nature of the MMF. In [17], unsupervised full-color cellular image reconstruction through disordered optical fiber are demonstrated. Through years of research efforts, deep learning methods have enabled high-fidelity transmission of complex spatial information through MMF, making MMF widely used in disease diagnosis and medical endoscopy [18].

However, to learn the high variability and randomness inside MMF, the above neural networks are intricately designed with huge spatial and computational complexity. How to dramatically compress the deep networks without degrading the performance of image transmission through MMF, while meeting the demands of resource-constrained deployments such as edge computing, remains a formidable challenge.

Knowledge distillation (KD) is an efficient method of compressing large deep learning models. Unlike network compression methods such as pruning [19], quantization [20], or low-rank decomposition [21], KD transfers the knowledge of a cumbersome model to a lightweight model, thereby dramatically improving the performance of the lightweight model and even reaching the performance of the cumbersome model [22–24]. Compared with the cumbersome teacher model, the lightweight student model has smaller spatial and computational complexity, making it more suitable for implementing inference tasks under resource-constrained environments, and is also compatible for hardware implementations [25,26].

In this paper, a novel feature decoupled knowledge distillation framework is proposed for lightweight image transmission through MMF. KD makes it possible to dramatically reduce the computational complexity of the model without degrading the quality of image transmission. Besides, the frequency-principle-inspired feature decoupled module designed to highlight the detailed information of the transmitted image is also proposed to further improve the quality of image transmission. This work represents the first effort, to the best of our knowledge, that successfully applies a KD-based framework for image transmission through scattering media. Experimental results demonstrated that even reducing the model computational complexity by up to 93.4%, the lightweight student model, namely a simple FC network, can still achieve averaged Structure Similarity Index Measure (SSIM) of 0.76, 0.85, and 0.90 in Fashion-MNIST, EMNIST, and MNIST images respectively, which are very close to the performance of cumbersome teacher model, namely the CNN-based model. Moreover, the improvement in the convergence speed of the student model brought by KD, and the value of the image reconstruction algorithm for downstream information post-processing tasks are also discussed in depth, providing insightful perspective into the application of KD in image transmission through the scattering media such as MMF.

2. Principles

2.1 Frequency-principle-inspired feature decoupled module

Empirically speaking, when we enter a new environment, we initially retain mostly rudimentary outlines. Only with prolonged exposure to that environment do we begin to remember finer details. Research indicates that artificial neural networks exhibit a similar phenomenon known as the frequency principle. During training, neural networks tend to first fit low-frequency components of the target signal before gradually incorporating higher-frequency details [27,28]. This phenomenon also explains that when neural networks are used for image transmission, the reconstructed target patterns from speckle patterns passing through scattering media often become blurry, resulting in the loss of edges and fine details (corresponding to the high-frequency components of the image). This severely constrains the quality of image transmission.

To enhance the transmission quality of edge and fine detail information in images, we propose a frequency-principle-inspired feature decoupled module (Fig. 1). This module decomposes the image into two components: the main body and the edge information. Specifically, for the original images $Im{g_{ori}}$, the blur operation can be performed by Gaussian filter to obtain the main body $Im{g_{body}}$:

(1)$$Im{g_{body}} = Im{g_{ori}} \ast \frac{1}{{2\pi {\sigma ^2}}}\textrm{exp} \left[ { - \frac{{{x^2} + {y^2}}}{{2{\sigma^2}}}} \right]$$

where $\sigma $ is the standard deviation of the Gaussian distribution and ${\ast} $ represents convolution operation. The original image size is $28 \times 28$. The stride of $3 \times 3$ and $\sigma = 1$ were chosen to maintain relatively clear image information after the feature decoupled module. Then the edge information $Im{g_{edge}}$ can be obtained by the difference between $Im{g_{ori}}$ and $Im{g_{body}}$. To quantitatively assess the impact of the edge information on image quality, the correlation coefficient and the SSIM between the edge information and the original images are calculated:

(2)$$Correlation(x,y) = \frac{{{\sigma _{xy}}}}{{{\sigma _x}{\sigma _y}}}$$

(3)$$SSIM(x,y) = \frac{{(2{\mu _x}{\mu _y} + {c_1})(2{\sigma _{xy}} + {c_2})}}{{(\mu _x^2 + \mu _y^2 + {c_1})(\sigma _x^2 + \sigma _y^2 + {c_2})}}$$

where ${\mu _x}$ and ${\mu _y}$ represent the mean value of the image, ${\sigma _x}$ and ${\sigma _y}$ represent the standard deviation of the image, ${\sigma _{xy}}$ represents the covariance. ${c_1}$ and ${c_2}$ are two constant terms that are used to prevent the denominator from being zero.

Fig. 1. Schematic diagram of the proposed feature decoupled knowledge distillation framework for lightweight image transmission through MMF

Download Full Size | PDF

Figure 2 illustrates the results of feature decoupled operations on different original images. As expected, the main body retains the main information of the original images, while the edge information highlights the detailed information, which is crucial for enhancing image transmission quality. Furthermore, by calculating the correlation coefficient and SSIM between the main body and the original images, the results indicate that even though edge information is lost, the main body and the original images still maintain a high correlation coefficient, albeit with a significant decrease in SSIM. The SSIM comprehensively considers the luminance, contrast, and structure of images. Therefore, compared to the correlation coefficient, SSIM better reflects the quality of images and will be utilized as the primary evaluation metric for image reconstruction in our experiments.

Fig. 2. Results of frequency-principle-inspired feature decoupled module. The correlation coefficient and SSIM between the main body and the original images are indicated in the bottom row.

Download Full Size | PDF

2.2 Novel feature decoupled knowledge distillation framework

In this section, we will describe the network architecture of teacher and student models within the KD framework in detail. Furthermore, we will investigate training strategies for both teacher and student models to elucidate how knowledge distillation achieves efficient and accurate knowledge transfer.

2.2.1 Network architecture of teacher and student model

Various neural networks have been proposed to achieve high-fidelity image transmission through MMF [13–16]. In this paper, we design a CNN-based image reconstruction algorithm as the teacher model to approximate the mapping relationship from the disordered spackle patterns at the distal end of the fiber to the original clear images. The network architectures for both the T_CNN and the T_CNN_Decoupled (with and without the feature decoupled module) are shown in Table 1. They share the common convolutional feature extraction layers, differing only in the final fully connected decoding layer. Due to the different output content of these two models, their training strategies also vary. For T_CNN, the output pattern (denoted as $T\_Pre{d_{ori}}$) should closely approximate the original clear images (denoted as $G{T_{ori}}$). Therefore, the MSE between $T\_Pre{d_{ori}}$ and $G{T_{ori}}$ are calculated as the loss function of T_CNN:

(4)$${\mathrm{{\cal L}}_{T\_CNN}} = \mathrm{{\cal L}}(T\_Pre{d_{ori}},\textrm{ }G{T_{ori}})$$

For T_CNN_Decoupled, the two-dimensional output patterns (denoted as $T\_Pre{d_{body}}$ and $T\_Pre{d_{edge}}$) should respectively approximate the main body and edge information of the original target patterns (denoted as $G{T_{body}}$ and $G{T_{edge}}$). Additionally, to ensure that the two-dimensional output patterns of T_CNN_Decoupled can be combined to form the expected reconstructed images (also denoted as $T\_Pre{d_{ori}}$), we further constrain $T\_Pre{d_{ori}}$ to closely approximate the original clear images (also denoted as $G{T_{ori}}$). Consequently, the loss function consists of three components:

(5)$$\begin{aligned} {\mathrm{{\cal L}}_{T\_CNN\_Decoupled}} &= \alpha \cdot \mathrm{{\cal L}}(T\_Pre{d_{body}},\textrm{ }G{T_{body}}) + \beta \cdot \mathrm{{\cal L}}(T\_Pre{d_{edge}},\textrm{ }G{T_{edge}})\\ &\textrm{ } + (1 - \alpha - \beta ) \cdot \mathrm{{\cal L}}(T\_Pre{d_{body}} + T\_Pre{d_{edge}},\textrm{ }G{T_{ori}}) \end{aligned}$$

where $\mathrm{{\cal L}}$ represents MSE metric, $\alpha$ and $\beta$ represent the weight coefficients of three components in the loss function.

Table 1. Architecture of the teacher models (w/ and w/o feature decoupled module)

View Table | View all tables in this article

In the case of student models, to highlight the model compression effect brought about by KD and to minimize model complexity, the simple fully connected (FC) network is chosen as the student model. The input speckle images are flattened and then passed through a single fully connected layer followed by a sigmoid function to obtain the output images. The training strategies for both S_FC and S_FC_Decoupled (with and without the feature decoupled module) are identical to the aforementioned teacher models. The comprehensive comparative analysis of the spatial and computational complexities between teacher models and student models will be elaborated in the subsequent Experimental Results section.

2.2.2 Training strategy of KD framework

In recent years, deep learning has flourished due to the powerful fitting ability of neural networks and the massive data in the era of big data. However, accurate feature extraction from a large amount of training data often requires a sophisticated and large-scale neural network, which severely restricts its lightness and real-time performance. Knowledge distillation (KD), as an efficient model compression algorithm, aims to transfer knowledge from a well-trained but cumbersome large model (denoted as teacher model) to a lightweight model (denoted as student model) [22–24]. By employing appropriate training strategies, the student model can achieve satisfactory performance with fewer computing resources.

KD was initially proposed for classification tasks and then found widespread applications in regression tasks as well. In classification tasks, the student model not only learns from labels (called hard targets) but also benefits from the relative probabilities provided by the teacher model for each class (called soft targets), which greatly enhances the model’s generalization capability. The training strategy of the KD framework in classification tasks is given by:

(6)$${\mathrm{{\cal L}}_{KD\_Cls}} = \alpha \cdot {\mathrm{{\cal L}}_{KL}}({q_T}||{q_S}) + (1 - \alpha ) \cdot {\mathrm{{\cal L}}_{CE}}({q_S},label)$$

where $\alpha$ represents the weight coefficient, ${q_T}$ and ${q_S}$ represent the probabilistic distribution of the teacher model and the student model, respectively. ${\mathrm{{\cal L}}_{KL}}$ and ${\mathrm{{\cal L}}_{CE}}$ are the Kullback-Leibler (KL) divergences and cross-entropy (CE) loss, respectively.

In regression tasks, by adopting appropriate training strategies, the student model can also effectively learn valuable information from the teacher model, thereby enhancing algorithmic performance. Common training strategies in regression tasks are expressed as follows [29]:

(7)$${\mathrm{{\cal L}}_{KD\_Reg\_min}} = min\{{\mathrm{{\cal L}}({p_S},{p_T}),\textrm{ }\mathrm{{\cal L}}({p_S},{p_{GT}})} \}$$

(8)$${\mathrm{{\cal L}}_{KD\_Reg\_weight}} = \alpha \cdot \mathrm{{\cal L}}({p_S},{p_{GT}}) + (1 - \alpha ) \cdot \mathrm{{\cal L}}({p_S},{p_T})$$

where $\alpha$ represents the weight coefficient, ${p_T}$, ${p_S}$, and ${p_{GT}}$ represent the teacher outputs, student outputs, and ground truth, respectively. Equation (7) minimizes whichever one is smaller between the loss of (${p_S},{p_T}$) and the loss of (${p_S},{p_{GT}}$). Equation (8) combines them by weighted summation, which is similar to the strategy in Eq. (6). Through these strategies, the teacher model can provide additional guidance to the student model to avoid local optimal solutions, and the loss of (${p_S},{p_T}$) can serve as a regularization term to mitigate overfitting.

However, these strategies are effective only when the teacher model performs exceptionally well. If the teacher model performs poorly, it can mislead the student model during the learning process. Therefore, we must carefully limit the guidance provided by the teacher model [29]. Formally, the distillation loss adopted in our work is defined as follows:

(9)$${\mathrm{{\cal L}}_{KD\_Reg}} = \alpha \cdot \mathrm{{\cal L}}({p_S},{p_{GT}}) + (1 - \alpha ) \cdot {\mathrm{{\cal L}}_{Imit}}$$

(10)$${\mathrm{{\cal L}}_{Imit}} = \left\{ {\begin{array}{{cc}} {\mathrm{{\cal L}}({p_S},{p_T}),}&{\mathrm{{\cal L}}({p_T},{p_{GT}}) < \mathrm{{\cal L}}({p_S},{p_{GT}})}\\ {0,}&{otherwise} \end{array}} \right.$$

Compared with Eq. (8), ${\mathrm{{\cal L}}_{Imit}}$ only comes into effect when the teacher model outperforms the student model, thus avoiding potential misguidance from the teacher model under certain circumstances. It is worth noting that weight coefficients can be manually assigned or treated as hyperparameters for learning [30]. In our experiments, for the sake of convenience, we manually set the weight coefficient to 0.5. In practice, we found that reasonable weight coefficients (e.g., within the range of 0.3 to 0.7) exhibited minimal impact on the final algorithmic performance.

3. Experimental setup

The experiment setup of image transmission system through MMF is shown in Fig. 3. The laser beam (488 nm, OBIS) undergoes collimation, expansion, and projection onto a Spatial Light Modulator (SLM, PLUTO-2.1-VIS-014). The SLM possesses the capability to dynamically modulate the wavefront of the incident laser beam, enabling the display of arbitrary patterns. The modulated laser beam is consequently coupled into a MMF (Si, 0.22 NA, $\phi = 50\mu m$, 0.5 m) by a pair of microscope objectives (40x, 0.65 NA, Olympus). At the distal end of the fiber, the output speckle patterns are imaged on a CCD camera (MER2-630-60U3C).

Fig. 3. Experimental setup of MMF image transmission system. HWP: half wave plate; PL: polarizer; M1, M2: mirrors; L1: f = 25 mm lens; L2, L3, L4: f = 250 mm lenses; OBJ1, OBJ2: 40x microscope objectives; SLM: spatial light modulator; MMF: multimode fiber.

Download Full Size | PDF

In subsequent experiments, target patterns from the MNIST, EMNIST, and Fashion-MNIST datasets were individually transmitted and combined with their corresponding speckle patterns to assemble datasets for the feature-decoupled knowledge distillation framework. Each dataset variant comprised 20,000 samples, partitioned into training and validation sets at a ratio of 4:1. The implementation utilized Python version 3.8.17, executed on a laptop equipped with an Intel Core i9-13900HX CPU @ 2.20 GHz, 64GB of RAM, and running the Microsoft Windows 11 operating system. Comprehensive experimental outcomes will be delineated in the subsequent section.

4. Experimental results

In this section, we will first analyze the performance improvement brought by the feature decoupled module. Subsequently, we will demonstrate the performance of the feature decoupled KD on Fashion-MNIST, EMNIST, and MNIST datasets. The dramatic reduction in spatial and computational complexity brought by KD will also be emphasized. Next, the influence of the down-sampled speckle images at the receiver and training data size for student model are investigated to further demonstrate the rationality and superiority of our proposed KD. Finally, the value of image reconstruction for downstream visual tasks is also discussed.

4.1 Performance of teacher models with feature decoupled module

In our experiments, we carefully designed a CNN-based architecture as the teacher model, as shown in Table 1. Additionally, for comparison with the student model FC (as discussed in Section 2.2.1) in KD framework, we conducted experiments on the FC trained from scratch (which can also be considered as a teacher model). The performance of different teacher models with and without feature decoupled module on Fashion-MNIST dataset is illustrated in Fig. 4.

Fig. 4. The performance of different teacher models with and without feature decoupled module on the validation dataset of Fashion-MNIST. ‘T’ represents the teacher model, ‘FC’ and ‘CNN’ represent FC-based and CNN-based methods respectively, and ‘Decoupled’ represents the model incorporating the feature decoupled module.

Download Full Size | PDF

Compared to the simple T_FC, the sophisticated T_CNN exhibits stronger feature extraction capabilities due to their more intricate network designs. Consequently, under the same training conditions, T_CNN can achieve better performance than T_FC. Furthermore, when feature decoupled module is incorporated into both T_CNN and T_FC, significant performance improvements are observed. This enhancement is attributed to the feature decoupled module emphasizing edge information of images, enabling networks to more effectively learn fine details and thus enhance the quality of image transmission through MMF. The teacher models with feature decoupled module also achieve similar results on the EMNIST and MNIST datasets, as shown in Table 2, which demonstrate the widespread applicability of the feature decoupled module we proposed.

Table 2. Performance comparison of different teacher models and student models for various datasets

View Table | View all tables in this article

Although cumbersome CNN-based methods can achieve excellent image transmission performance, their network complexity and computational overhead are substantial. These limitations hinder their applications under resource-constrained conditions. If KD can significantly enhance the performance of simpler student model, it would enable lightweight and high-fidelity image transmission, which will be demonstrated in the next section.

4.2 Performance of feature decoupled knowledge distillation

The experimental results in the previous section validate the effectiveness of the feature decoupled module. However, to minimize model complexity, we employ the FC-based method without the feature decoupled module as the student model. This strategy can further highlight the efficiency of the KD distillation framework. The performance of the teacher models (CNN-based and FC-based) with feature decoupled module and the student model (FC-based) trained based on T_CNN_Decoupled within the KD framework is illustrated in Fig. 5.

Fig. 5. The performance of different teacher and student models on the validation dataset of Fashion-MNIST. ‘T’ and ‘S’ represent the teacher model and student model respectively, ‘FC’, ‘CNN’, and ‘Decoupled’ have the same meanings as Fig. 4.

Download Full Size | PDF

Figure 5 demonstrates that through KD, even a simple student model can acquire knowledge from the cumbersome teacher model, resulting in a dramatical enhancement of the student model’s performance. During the initial stage of KD training, the student model S_FC simultaneously receives guidance from both ground truths and the well-trained teacher model. This presents a higher challenge for model learning, and better performance compared to teacher model T_FC is not achieved. However, as training progresses, the advantages of KD become evident, and the performance of S_FC without the feature decoupled module gradually surpasses that of the teacher model T_FC with feature decoupled module, ultimately achieving a remarkable improvement of nearly 8% in SSIM. Similar performance improvements of S_FC on EMNIST and MNIST datasets are also achieved, as shown in Table 2, which verifies the widespread applicability of our proposed KD framework.

Figure 6 illustrates the examples of reconstructed images by different teacher and student models, and Table 2 provides a comprehensive comparison of different teacher and student models for Fashion-MNIST, EMNIST, and MNIST datasets, in terms of MSE, correlation coefficient, and SSIM metrics. Overall, the sophisticated T_CNN can achieve excellent reconstruction performance, while the simple T_FC exhibits relatively poorer performance. However, through the KD framework, the student model S_FC can learn from the teacher model T_CNN_ Decoupled, resulting in dramatic performance improvement. It is noteworthy that while the overall performance of S_FC may slightly lag behind that of T_CNN_Decoupled, the S_FC can still outperform the teacher model on specific subsets of data. This is attributed to the inherent uncertainties in neural network training and the efficiency of knowledge distillation techniques working in tandem. Besides, the effectiveness of the feature decoupled module is also verified.

Fig. 6. Examples of reconstructed images by different teacher and student models. The SSIM of the reconstructed images and target images are indicated in the upper right corners.

Download Full Size | PDF

The comparison of the spatial and computational complexities between teacher models and student models are shown in Table 3 (Input dimension: (1,64,64)), which demonstrates that the parameters and computational complexity (Flops) of student model S_FC is only 16.2% and 6.6% of that of teacher model T_CNN_Decoupled, respectively. In general, simpler student models tend to perform significantly worse than sophisticated teacher models. However, the proposed novel KD framework can dramatically improve the performance of student models, thereby greatly reducing algorithm complexity while ensuring excellent performance.

Table 3. Comparison of the complexity between the teacher models and student models

View Table | View all tables in this article

4.3 Influence of the down-sampled speckle images at the receiver

In image transmission tasks, speckle images are typically down-sampled before being fed into the image reconstruction algorithm to reduce the algorithmic complexity. However, down-sampling inevitably leads to the loss of effective information, which restricts the quality of image transmission. We utilized the EMNIST dataset for data transmission and performed different degrees of down-sampling on the captured speckle images to achieve the balance between algorithm complexity and image transmission quality. The teacher model (T_CNN_Decoupled) is used for image reconstruction and the image transmission performance corresponding to different speckle image sizes is illustrated in Fig. 7.

Fig. 7. Influence of the down-sampled speckle images on image transmission performance. The upper rows provide examples of the reconstructed images corresponding to different down-sampled speckle images from the same original speckle image. The speckle images are scaled to the same size only for display, and their actual sizes are indicated in the second row.

Download Full Size | PDF

Figure 7 shows that when the speckle images are down-sampled to a small size, most of the useful information is lost, resulting in poor image transmission performance. As the size of the speckle images increasing, the transmission performance improves with the increase of effective information. However, when the information provided by the speckle images is sufficient for image reconstruction, additional information will not continue to bring performance improvement. In fact, too much information will pose a greater challenge to the accurate learning of the neural networks, leading to a slight decrease in performance. In our experiments, the optimal size of the down-sampled speckle images is 64 ${\times} $ 64, which explains the rationality of the input dimension designed in Table 1.

4.4 Influence of training data size for student model in KD

In the KD framework, sophisticated teacher models are usually trained on large amounts of data to achieve satisfactory performance. However, for student models, whether it is necessary to use the entire dataset for training, or whether satisfactory performance can be achieved with less training data is a question worth exploring. In our experiments, two same FC-based neural networks are trained for EMNIST dataset transmission from scratch, and with KD based on the T_CNN_Decoupled model, respectively. The performance comparison between these two models when changing the amount of training data is illustrated in Fig. 8.

Fig. 8. Performance of the same FC-based network trained from scratch and with KD using different amounts of training data.

Download Full Size | PDF

Figure 8 shows that with the guidance from the teacher model, the student model requires only a small amount of training data to achieve satisfactory performance, approximately 8 k to 10 k training images (only about half of the entire training dataset). At this point, the student model trained from scratch has not fully converged, while the performance of the student model with KD tends to saturate and achieves significant improvement compared to that of the model trained from scratch. This inspires us that once the teacher models are well-pre-trained, the training cost for subsequent student models will be significantly reduced, which can further enhance the practicality and convenience of our proposed KD framework.

4.5 Value of image reconstruction for downstream visual tasks

In practical applications, high-quality image transmission through scattering medium is often not the ultimate goal, but rather a means to achieve excellent performance in downstream information processing tasks, such as medical image segmentation and tumor cell classification. In our experiments, we conducted the image classification task based on the image reconstruction to demonstrate the value of high-fidelity image transmission for downstream visual tasks. The classic LeNet network [31] is utilized to realize classification tasks on original target images, reconstructed images, and speckle images of MNIST dataset, respectively. The reconstructed images are obtained through our proposed feature decoupled KD framework. The classification results based on different datasets are illustrated in Fig. 9.

Fig. 9. Performance of classification tasks based on target images, reconstructed images, and speckle images of MNIST dataset. Examples of these three kinds of images are provided in (c).

Download Full Size | PDF

The detailed confusion matrices of classification tasks based on these three kinds of images are also demonstrated in Fig. 10. Figure 9 and 10 show that the accuracy of image classification based on reconstructed images is very close to that based on clear target images, with an accuracy difference of less than 2%. However, it is 10% higher than that based on speckle images, which indicates the superiority of reconstructed images over speckle images for downstream visual tasks.

Fig. 10. Confusion matrix of classification tasks based on target images, reconstructed images, and speckle images of MNIST dataset.

Download Full Size | PDF

To intuitively explain why reconstructed images are more suitable than speckle images for downstream classification tasks, the t-SNE algorithm [32] was utilized to embed target images, reconstructed images, and speckle images into two dimensions respectively and cluster them according to their corresponding digit categories, as shown in Fig. 11.

Fig. 11. Visualization of different kinds of images with t-SNE and clustering results according to their corresponding digit categories (0-9). (a) Target images. (b) Reconstructed images. (c) Speckle images.

Download Full Size | PDF

Compared to clear target images, the separability of speckle images is poorer, with overlapping between categories of clustering results. This is due to the fact that the original information becomes blurred and disordered after passing through the scattering medium. The separability of reconstructed images obtained through our proposed KD-based algorithm is close to that of the target images. To quantitatively measure this separability, we calculate the intra-cluster and inter-cluster distances of these three clustering results, as shown in Table 4.

Table 4. Comparison of intra-cluster and inter-cluster distances of the clustering results in Fig. 11

View Table | View all tables in this article

The clustering results of reconstructed images have smaller intra-cluster distances and larger inter-cluster distances, similar to those of original images. This greater separability explains why their corresponding classification accuracy is higher. These results reflect the efficiency of our proposed feature decoupled KD algorithm and illustrate the value of image reconstruction for downstream visual tasks.

5. Conclusion

In summary, in order to dramatically reduce the computational complexity of the model without degrading the quality of image transmission through scattering medium, a novel feature decoupled knowledge distillation framework is proposed for lightweight and high-fidelity image transmission. The frequency-principle-inspired feature decoupled module is designed to highlight the detailed information of the transmitted image to further improve the quality of image transmission. Experimental results demonstrated that even reducing the model computational complexity by up to 93.4%, the lightweight student model can still achieve averaged SSIM of 0.76, 0.85, and 0.90 in Fashion-MNIST, EMNIST, and MNIST images respectively, which are very close to the performance of sophisticated teacher model. This work represents the first effort, to the best of our knowledge, that successfully applies a KD-based framework for MMF image transmission. Our proposed algorithm is also applicable to information transmission through other scattering media and has broad prospects for applications in inference resource-constrained environments and hardware implementations.

Funding

National Key Research and Development Program of China (2022YFB2802803); National Natural Science Foundation of China (61925104, 62031011, 62201157).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. D. J. Richardson, J. M. Fini, and L. E. Nelson, “Space-division multiplexing in optical fibres,” Nature Photon 7(5), 354–362 (2013). [CrossRef]

2. T. Fukui, Y. Kohno, R. Tang, et al., “Single-Pixel Imaging Using Multimode Fiber and Silicon Photonic Phased Array,” J. Lightwave Technol. 39(3), 839–844 (2021). [CrossRef]

3. S. Mumtaz, R.-J. Essiambre, and G. P. Agrawal, “Nonlinear Propagation in Multimode and Multicore Fibers: Generalization of the Manakov Equations,” J. Lightwave Technol. 31(3), 398–406 (2013). [CrossRef]

4. A. Yariv, “On transmission and recovery of three-dimensional image information in optical waveguides*,” J. Opt. Soc. Am. 66(4), 301 (1976). [CrossRef]

5. I. N. Papadopoulos, S. Farahi, C. Moser, et al., “Focusing and scanning light through a multimode optical fiber using digital phase conjugation,” Opt. Express 20(10), 10583–10590 (2012). [CrossRef]

6. M. Azimipour, F. Atry, and R. Pashaie, “Calibration of digital optical phase conjugation setups based on orthonormal rectangular polynomials,” Appl Opt 55(11), 2873–2880 (2016). [CrossRef]

7. S. M. Popoff, G. Lerosey, R. Carminati, et al., “Measuring the Transmission Matrix in Optics: An Approach to the Study and Control of Light Propagation in Disordered Media,” Phys. Rev. Lett. 104(10), 100601 (2010). [CrossRef]

8. S. M. Popoff, G. Lerosey, M. Fink, et al., “Controlling light through optical disordered media: transmission matrix approach,” New J. Phys. 13(12), 123021 (2011). [CrossRef]

9. S. Popoff, G. Lerosey, M. Fink, et al., “Image transmission through an opaque material,” Nat Commun 1(1), 81 (2010). [CrossRef]

10. T. Zhao, L. Deng, W. Wang, et al., “Bayes’ theorem-based binary algorithm for fast reference-less calibration of a multimode fiber,” Opt. Express 26(16), 20368–20378 (2018). [CrossRef]

11. R. N. Mahalati, D. Askarov, J. P. Wilde, et al., “Adaptive control of input field to achieve desired output intensity profile in multimode fiber with random mode coupling,” Opt. Express 20(13), 14321–14337 (2012). [CrossRef]

12. X. Zhou, J. Shi, N. Chi, et al., “Wavefront shaping for multi-user line-of-sight and non-line-of-sight visible light communication,” Opt. Express 31(16), 25359 (2023). [CrossRef]

13. N. Borhani, E. Kakkava, C. Moser, et al., “Learning to see through multimode fibers,” Optica 5(8), 960 (2018). [CrossRef]

14. B. Rahmani, D. Loterie, G. Konstantinou, et al., “Multimode optical fiber transmission with a deep learning network,” Light Sci Appl 7(1), 69 (2018). [CrossRef]

15. P. Caramazza, O. Moran, R. Murray-Smith, et al., “Transmission of natural scene images through a multimode fibre,” Nat Commun 10(1), 2029 (2019). [CrossRef]

16. P. Fan, M. Ruddlesden, Y. Wang, et al., “Learning Enabled Continuous Transmission of Spatially Distributed Information through Multimode Fibers,” Laser Photonics Rev. 15(4), 2000348 (2021). [CrossRef]

17. X. Hu, J. Zhao, J. E. Antonio-Lopez, et al., “Unsupervised full-color cellular image reconstruction through disordered optical fiber,” Light Sci Appl 12(1), 125 (2023). [CrossRef]

18. Z. Wen, Z. Dong, Q. Deng, et al., “Single multimode fibre for in vivo light-field-encoded endoscopic imaging,” Nat. Photon. 17(8), 679–687 (2023). [CrossRef]

19. Y. Guo, A. Yao, and Y. Chen, “Dynamic Network Surgery for Efficient DNNs,” in 30th International Conference on Neural Information Processing Systems (Curran Associates Inc., 2016), pp. 1387–1395.

20. M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations,” in 28th International Conference on Neural Information Processing Systems - Volume 2 (MIT Press, 2015), pp. 3123–3131.

21. C. Tai, T. Xiao, Y. Zhang, et al., “Convolutional neural networks with low-rank regularization” (2016).

22. J. Gou, B. Yu, S. J. Maybank, et al., “Knowledge Distillation: A Survey,” Int J Comput Vis 129(6), 1789–1819 (2021). [CrossRef]

23. G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network” (2015).

24. J. Yim, D. Joo, J. Bae, et al., “A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 7130–7138.

25. T. Fang, J. Li, X. Zhang, et al., “Classification accuracy improvement of the optical diffractive deep neural network by employing a knowledge distillation and stochastic gradient descent β-Lasso joint training framework,” Opt. Express 29(26), 44264–44274 (2021). [CrossRef]

26. J. Xiang, Y. Cheng, S. Chen, et al., “Knowledge distillation technique enabled hardware efficient OSNR monitoring from directly detected PDM-QAM signals,” J. Opt. Commun. Netw. 14(11), 916 (2022). [CrossRef]

27. Z.-Q. J. Xu, Y. Zhang, T. Luo, et al., “Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks,” Commun. Comput. Phys. 28(5), 1746–1767 (2020). [CrossRef]

28. Z. J. Xu and H. Zhou, “Deep Frequency Principle Towards Understanding Why Deeper Learning Is Faster,” in AAAI Conference on Artificial Intelligence (2021), 35, pp. 10541–10550.

29. M. R. U. Saputra, P. Gusmao, Y. Almalioglu, et al., “Distilling Knowledge From a Deep Pose Regressor Network,” in IEEE/CVF International Conference on Computer Vision (IEEE, 2019), pp. 263–272.

30. R. Cipolla, Y. Gal, and A. Kendall, “Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2018), pp. 7482–7491.

31. Y. Lecun, L. Bottou, Y. Bengio, et al., “Gradient-based learning applied to document recognition,” Proc. IEEE 86(11), 2278–2324 (1998). [CrossRef]

32. L. van der Maaten and G. Hinton, “Visualizing Data using t-SNE,” J. Machine Learning Research 9(86), 2579–2605 (2008).

T_CNN	T_CNN_Decoupled
Input: 64 $\times$ 64 $\times$ 1
Layer 1: 3 $\times$ 3 conv. 32 filters, padding = 1, ReLU, BN, Dropout
Layer 2: 3 $\times$ 3 conv. 64 filters, padding = 2, ReLU, BN, Dropout
Layer 3: 3 $\times$ 3 conv. 64 filters, padding = 2, ReLU, BN, Dropout
w/o feature decoupled module	w/ feature decoupled module
Layer 4: Linear(784), sigmoid	Layer 4: Linear(1568), sigmoid
Output: 28 $\times$ 28 $\times$ 1	Output: 28 $\times$ 28 $\times$ 2

	Fashion-MNIST			EMNIST			MNIST
	MSE	Corr	SSIM	MSE	Corr	SSIM	MSE	Corr	SSIM
T_FC	0.021	0.917	0.643	0.016	0.920	0.745	0.013	0.932	0.805
T_FC_Decoupled	0.017	0.933	0.680	0.015	0.934	0.784	0.012	0.941	0.834
T_CNN	0.014	0.941	0.754	0.013	0.945	0.853	0.011	0.944	0.882
T_CNN_Decoupled	0.011	0.956	0.801	0.012	0.952	0.886	0.008	0.958	0.914
S_FC_KD^a	0.015	0.946	0.759	0.014	0.934	0.848	0.010	0.951	0.897

	S_FC / T_FC	T_FC_Decoupled	T_CNN	T_CNN_Decoupled
Total params (M)	3.212	6.424	9.916	19.751
Memory (MB)^a	0.01	0.01	3.10	3.10
Flops (GFlops)	0.003	0.006	0.036	0.045
MemR + W (MB) ^b	12.27	24.53	42.27	79.79

	Target Images	Reconstruction Images	Speckle Images
Intra-cluster Distance	18.275	18.925	19.151
Inter-cluster Distance	78.891	77.638	67.613

T_CNN	T_CNN_Decoupled
Input: 64 $\times$ 64 $\times$ 1
Layer 1: 3 $\times$ 3 conv. 32 filters, padding = 1, ReLU, BN, Dropout
Layer 2: 3 $\times$ 3 conv. 64 filters, padding = 2, ReLU, BN, Dropout
Layer 3: 3 $\times$ 3 conv. 64 filters, padding = 2, ReLU, BN, Dropout
w/o feature decoupled module	w/ feature decoupled module
Layer 4: Linear(784), sigmoid	Layer 4: Linear(1568), sigmoid
Output: 28 $\times$ 28 $\times$ 1	Output: 28 $\times$ 28 $\times$ 2

Feature decoupled knowledge distillation enabled lightweight image transmission through multimode fibers

Abstract

1. Introduction

2. Principles

2.1 Frequency-principle-inspired feature decoupled module

2.2 Novel feature decoupled knowledge distillation framework

2.2.1 Network architecture of teacher and student model

2.2.2 Training strategy of KD framework

3. Experimental setup

4. Experimental results

4.1 Performance of teacher models with feature decoupled module

4.2 Performance of feature decoupled knowledge distillation

4.3 Influence of the down-sampled speckle images at the receiver

4.4 Influence of training data size for student model in KD

4.5 Value of image reconstruction for downstream visual tasks

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (11)

Tables (4)

Equations (10)

Optics Express