Modulation format recognition in a UVLC system based on an ultra-lightweight model with communication-informed knowledge distillation

Li Yao; Li Yao; Fujie Li; Fujie Li; Haoyu Zhang; Haoyu Zhang; Yingjun Zhou; Yingjun Zhou; Yuan Wei; Yuan Wei; Ziwei Li; Ziwei Li; Jiangyang Shi; Jiangyang Shi; Junwen Zhang; Junwen Zhang; Chao Shen; Chao Shen; Nan Chi; Nan Chi

doi:10.1364/OE.517666

1. Introduction

As an emerging technology, Visible Light Communication (VLC) primarily operates within the visible light frequency band, specifically at wavelengths between 380 and 780 nm [1,2]. Given its vast spectral resources, VLC offers numerous advantages including low cost, environmental friendliness, high confidentiality, and resistance to electromagnetic interference, positioning it as a pivotal technology for future communications [3]. In underwater communication, blue and green light [4,5] exhibit the least attenuation compared to other optical signals, making UVLC systems that employ blue/green LEDs as transmitters the optimal choice for reliable, high-speed, long-distance underwater wireless optical communications.

Modulation format recognition (MFR) is a crucial task in underwater visible light communications (UVLC) [6]. It enables the receiver to identify the modulation format of the transmitted signal and demodulate it correctly. Otherwise, the receiver may use an inappropriate demodulation method and fail to recover the data [7]. On the other hand, MFR enables the UVLC system to dynamically change the modulation scheme, optimize data transmission rate, and improve communication reliability under changing underwater conditions, which is crucial to maintaining the efficiency and stability of underwater communication links. Unlike optical fiber communication systems, UVLC systems face greater challenges in MFR. This is because there are huge differences in propagation media, signal attenuation, environmental interference and transmission distance. While fiber optic systems use transparent optical fibers for signal transmission [8–10], UVLC deals with signal distortion caused by absorption, scattering, and turbulence in the water. Therefore, underwater systems require modulation technology adapted to their unique environment, which brings great difficulty to the receiver to accurately identify the modulation format. Various algorithms have been proposed to address MFR tasks. They can be classified into two main categories: likelihood-based and feature-based recognition methods [11,12]. Likelihood-based methods generally have high computational complexity and are not suitable for real-time recognition. Feature-based methods train a model beforehand and do not need prior information during communication. However, designing a model that is efficient, accurate, and lightweight is an important research topic in MFR [13].

Deep learning is a key technology of artificial intelligence that can effectively extract hidden patterns from raw data [14–17]. Various neural network (NN) algorithms have been applied in UVLC. For example, [18] proposed a convolutional neural network (CNN)-based model that uses short-time Fourier transformed spectral image data as input and achieves high-precision MFR with a small-scale dataset. [19] developed a model that combines a bidirectional recurrent neural network (BiRNN) and a recurrently connected CNN to extract spatial features of the received signal and perform high-precision MFR. However, NN with high accuracy often entail deep architectures and large trainable parameters, which increase training and inference time. This may not be compatible with the computational resource constraints of communication systems.

In this paper, we propose an ultra-lightweight model based on communication-informed knowledge distillation (CIKD) to address the above problems. Our algorithm consists of two modules: intermediate layer feature similarity preservation [20] and representation vector decoupling [21]. These modules ensure the effective transfer of knowledge from the teacher model to the student model. We also introduce an auxiliary system based on prior communication system knowledge, which can filter out fuzzy signal areas with strong nonlinearity, thereby enhancing data clarity [22]. That is, the auxiliary system enables data compression. In the UVLC system, we use eight modulation formats: QPSK, PAM4, 8QAM-CIR, 8QAM-DIA, 16QAM, 16APSK, 32QAM, and 32APSK. It is worth noting that at the same modulation order, the shape of APSK signal and QAM signal is similar. To make recognition more difficult, we mixed QAM and APSK signals together. At the receiver, we map the collected I/Q branch signals in the rectangular coordinate system and obtain a constellation image dataset. To explore the relationship between MFR and LED driving voltage in different modulation formats, we collect data at twelve operating points from 100 mV to 1200 mV (sampling step is 100 mV). We experimentally evaluate the performance of our algorithm in the UVLC system. We propose a student model with a single linear output layer under the auxiliary system, which has only 18% of the teacher model’s parameters. Using knowledge distillation, the student model achieves 100% accuracy at an LED driving voltage of 800 mV, while compared to the student model without distillation, the accuracy of the distilled student model can be improved by up to 87%. The ultra-lightweight structure, short inference time, and high accuracy demonstrate the huge advantages of CIKD for hardware deployment and real-time inference in MFR tasks.

2. Principles

2.1 Preparation of modulated format datasets

During a communication process, the system uses a fixed modulation format and LED driving voltage value. Denote complex signals corresponding to the m codebook as ${s_m} = \{ {s_{m,i}}\} _{i = 1}^n$, where n is the order of the modulation format. The average power value ${\bar{W}_m}$ can be shown as:

(1)$${\bar{W}_m} = \sqrt {\sum\limits_{i = 1}^n {\frac{1}{n}({{({\mathop{\rm Im}\nolimits} ({s_{m,i}}))}^2} + {{(\textrm{Re} ({s_{m,i}}))}^2})} }$$

Let ${x_m} = \{ {x_{m,j}}\} _{j = 1}^N$ be received data in a real system, where N is the length of received signals. Average power normalization is a necessary step and results can be expressed by ${\bar{x}_{m,j}}$, which is shown as:

(2)$${\mathop{\rm Im}\nolimits} ({\bar{x}_{m,j}}) = \frac{{{\mathop{\rm Im}\nolimits} ({x_{m,j}})}}{{{{\bar{W}}_m}}},\textrm{Re} ({\bar{x}_{m,j}}) = \frac{{\textrm{Re} ({x_{m,j}})}}{{{{\bar{W}}_m}}}$$

Denote the set of received signals normalized by average power as ${\bar{X}_m}$. We use images of constellation points as individual data entries in our datasets. ${\bar{x}_{m,j}}$ needs to be mapped to the plane rectangular coordinate system, where the real part of ${\bar{x}_{m,j}}$ corresponds to the horizontal axis and the imaginary part of ${\bar{x}_{m,j}}$ corresponds to the vertical axis. To obtain more image data quickly, we randomly select a fixed number of complex data from ${\bar{X}_m}$ as an image sample ${p_m} = \{ {p_{m,k}}\} _{k = 1}^H$. Let H be the number of an image dataset in modulation format m. Then for eight modulation formats, under a specific operating voltage, the formula of the complete set of image data $\bar{X}$ is:

(3)$$\bar{X}_m = \bigcup\limits_{m = 1}^8 {\bigcup\limits_{k = 1}^H {p_{m,k}} }$$

In the experiment, we adjust the driving voltage from 100 mV to 1200 mV with a step size of 100 mV, resulting in twelve operating points. Figure 1 displays images at three operating points: 100 mV, 600 mV, and 1200 mV.

Fig. 1. The constellation diagrams in the receiving end among (a) noise working area, (b) optimal working area, and (c) nonlinear working area for different modulation formats.

Download Full Size | PDF

In Fig. 1(a), 100 mV is in the noise working area, where the signal-to-noise ratio (SNR) is low. The communication performance is mainly affected by random noise. In Fig. 1(b), 600 mV is in the optimal working area, where SNR is high, nonlinear effect is weak, and signal distortion is minimal. In Fig. 1(c), 1200 mV is in the nonlinear working area, where nonlinear is the main cause of signal distortion. Depending on operating points, datasets of the modulation format vary, denoted by $D = \{ {D_V}\} _{V = 1}^{12}$. It is worth noting that we randomly select a fixed number of symbols from the collected symbol set to map them to images. The final input images are blurred and more difficult to classify accurately than that presented in Fig. 1(image data are not shown in this paper because there are fewer symbol points in the image that are difficult to see visually). In dataset ${D_V}$, we use image data as features and modulation formats corresponding to images as labels. The model is trained and tested on each dataset.

2.2 Deep learning models

2.2.1 Teacher model and student model

Neural networks (NN) establish a theoretical mapping between image data and modulation format labels in MFR tasks. Convolutional neural networks (CNN) excel in image processing tasks due to their ability to extract high-level semantic information and invariance of image transformations [23]. The teacher model uses the CNN layer as the backbone to extract deep features, where the Neck and Head are composed of Flatten layer and Dense layer respectively. The backbone network of the student model is the communication-informed auxiliary system. The student model shares the same Neck and Head structure as the teacher model. Following the transfer of knowledge from the teacher model to the student model via distillation, the knowledge distillation model employs the communication-informed module as the feature extractor and utilizes a linear fully connected layer as the classifier to predict modulation format categories during the inference stage. According to the above description, it is obvious that this algorithm belongs to a feature-based modulation format recognition method. Table 1 shows all details.

Table 1. Parameters of the proposed teacher model and student model

View Table | View all tables in this article

2.2.2 Communication-informed auxiliary system

Optical signals are severely attenuated underwater, so LED-based UVLC systems need high drive bias to improve SNR, which causes nonlinear distortion to received signals. Moreover, the higher the driving voltage of LED, the stronger the nonlinear effect on received signals. Moreover, under the same driving voltage, the nonlinearity increases with the signal amplitude [22]. Based on this communication prior knowledge, we use the auxiliary system to mask areas with large nonlinear distortion and retain areas with small amplitude. The above steps are mainly realized through the interaction between the original image and the generated mask, and the masked area becomes zero pixels. The original input images are dense matrices. After masking, images have many zero-pixel areas, so they are transformed into sparse data. In other words, our auxiliary system performs data compression. The working principle of this auxiliary system is described below.

For the input image $p \in {R^{(h,w,3)}}$, let $({x_{center}},{y_{center}})$ be the image center point coordinates. The center point coordinates is calculated as the following equation:

(4)$${x_{center}} = \left\lfloor {\frac{h}{2}} \right\rfloor ,{y_{center}} = \left\lfloor {\frac{w}{2}} \right\rfloor$$

To preserve the shape of the original image data and more effectively mask constellation points affected by strong nonlinear effects, we use a circular mask $p_{mask}^{gray}$ with a radius of r. The mask can keep the original value of pixels within the circle and change pixels outside the circle to black pixels. First, we generate a single-channel circular mask, and define any pixel in the grayscale image as a, and ${d_a}$ is the distance from that pixel $({x_a},{y_a})$ to the center point. Then we have:

(5)$${d_a} = \sqrt {{{({x_a} - {x_{center}})}^2} + {{({y_a} - {y_{center}})}^2}} $$

We draw a circle with the center point as the center and r as the coverage range. The final pixel value depends on the relationship between r and ${d_a}$. The formula is:

(6)$$a = \left\{ {\begin{aligned}{{c}} {1,{d_a} \le r}\\ {0,{d_a} > r} \end{aligned}} \right.$$

Copy the single-channel mask into three copies and stack them in the channel dimension to finally form a three-channel mask ${p_{mask}}$. The expression is:

(7)$${p_{mask}} = \left( {\begin{aligned}{{c}} {({a_{11}},{a_{11}},{a_{11}})}& \ldots &{({a_{1w}},{a_{1w}},{a_{1w}})}\\ \vdots & \ddots & \vdots \\ {({a_{h1}},{a_{h1}},{a_{h1}})}& \cdots &{({a_{hw}},{a_{hw}},{a_{hw}})} \end{aligned}} \right)$$

The next step is the bit-level multiplication operation of the mask on the original image. Let $(b_{ij}^R,b_{ij}^G,b_{ij}^B)$ be the three-color vectors in row i and column j of the original image data matrix. After bit-level operations and normalization, the pixel vector becomes $(c_{ij}^R,c_{ij}^G,c_{ij}^B)$:

(8)$$c_{ij}^{R/G/B} = \frac{1}{{255}}{a_{ij}} \times b_{ij}^{R/G/B}$$

We denote the image after multiplying the original image and the three-channel mask as $p^{\prime}$. To make the masked area more suitable for the data characteristics, we update r through NN backpropagation. However, since r does not directly participate in the loss function, r variable has no gradient. To enable r to update automatically, we add Gaussian noise $p^{\prime}$ to give r a gradient. The mean value of the added Gaussian noise is 0 and the variance is $\alpha r$ ($\alpha $ is the attenuation coefficient of the variance term, mainly to reduce the interference degree of Gaussian noise). In actual implementation, we generate a set of Gaussian noise $g = \{ {g_k}\} _{k = 1}^{h\mathrm{\cdot}w\mathrm{\cdot3}}$ by the Box-Muller algorithm [23], which is shown as:

(9)$${g_k} = \alpha r\cos (2\pi {u_k})\sqrt { - 2\ln u_k^\prime }$$

Among them, $u = \{ {u_i}\} _{i = 1}^{h\mathrm{\cdot}w\mathrm{\cdot3}}$ and $u^{\prime} = \{ u_i^\prime \} _{i = 1}^{h\mathrm{\cdot}w\mathrm{\cdot3}}$ are two sets of data that conform to the uniform distribution $U(0,1)$. The above-mentioned Gaussian noise is a one-dimensional vector, which is transformed into noise image data g after reshaping. The shape of the noise picture is $(h,w,3)$. The transformed pixel vectors of row i and column j are denoted as $(g_{ij}^R,g_{ij}^G,g_{ij}^B)$.

Then, the final picture $p^{\prime\prime}$ (whose pixel vector is $(e_{ij}^R,e_{ij}^G,e_{ij}^B)$) generated by the auxiliary system is obtained by adding $p^{\prime}$ and $g$. The expression is:

(10)$$e_{ij}^{R/G/B} = c_{ij}^{R/G/B} + g_{ij}^{R/G/B}$$

Each pixel vector of $p^{\prime\prime}$ participates in NN forward propagation and loss function calculation, which means that r has a gradient and can be updated automatically in the backpropagation. Moreover, adding Gaussian noise can improve the performance of the MFR image classification model on the testing dataset, acting as data augmentation. To be specific, Gaussian noise can simulate stains or slight deletions on the image, forcing the model to extract more robust features during training and obtain a more resilient model [24]. The MFR task typically uses symbol sequences as input features, but this paper employs constellation images instead. This method simplifies the Euclidean distance calculation of each symbol to the center point. After generating the mask, the data only requires pixel-level multiplication, without computing the distance. This saves computing resources and accelerates the training process.

2.2.3 Communication-informed knowledge distillation

In classical knowledge distillation (KD), knowledge is transferred in the form of softened class probability distributions [25]. In the training stage, we first input image data set into the teacher model. After training the teacher model, we freeze all its parameters. When training the student model, we use the KD joint loss function $Los{s_{KD}}$:

(11)$$Los{s_{KD}} = \alpha {L_{KL}}(\sigma (\frac{{{z_S}}}{T}),\sigma (\frac{{{z_T}}}{T})) + (1 - \alpha ){L_{CE}}(y,\sigma ({z_S}))$$

where ${L_{KL}}$ and ${L_{CE}}$ are the Kullback-Leibler (KL) divergence function and the cross entropy (CE) function respectively, y is the one-hot vector corresponding to the ground truth class, ${z_S}$ and ${z_T}$ are the outputs of the student and teacher networks, and $\alpha$ is the balance hyperparameters.

The above distillation method struggles to converge the performance of two models with large parameter differences, so we use two KD modules: intermediate layer feature similarity preservation (ILFSP) and representation vector decoupling (RVD). These modules ensure the effective transfer of knowledge from the teacher model to the student model.

ILFSP is a KD module utilizing hidden layer feature vectors. During forward propagation, these vectors output high-dimensional feature information, offering more flexibility than predicted logits. The teacher model’s middle hidden layer activation features are similar for identical inputs and dissimilar for different ones [20]. This distillation model guides the student model to generate corresponding feature maps in the inner hidden layer.

In intermediate layer knowledge distillation (ILKD) based on ILFSP, we transform and regularize the feature map by the teacher model Backbone to obtain ${G_T}$. We extract the feature map of the student model from the output of the communication-informed auxiliary system block, and apply the same operation as the teacher model to obtain ${G_S}$. We define the loss ${L_{SP}}$ for preserving the feature similarity of the inner hidden layer as:

(12)$${L_{SP}}({G_T},{G_S}) = \frac{1}{{b{s^2}}}||{{G_T} - {G_S}} ||_F^2$$

Let ${||\mathrm{\cdot} ||_F}$ be the Frobenius norm and $bs$ be the batch size during training. This formula calculates the mean element-wise squared difference between ${G_T}$ and ${G_S}$. In the training process of the student model, the total loss is $Los{s_{ILKD}}$:

(13)$$Los{s_{ILKD}} = \alpha {L_{SP}}({G_T},{G_S}) + (1 - \alpha ){L_{CE}}(y,\sigma ({z_S}))$$

In the same way, $\alpha $ is still the balance hyperparameters.

RVD, a KD module, transfers information from logits vectors, which contain semantic information close to the final prediction. Conventional distillation methods treat target and non-target logits equally, limiting knowledge transfer flexibility. Non-target logits, or “dark knowledge” [21], are more crucial for prediction accuracy. To enhance this, we propose a decoupled logits distillation: target class knowledge distillation (TCKD) and non-target class knowledge distillation (NCKD). TCKD is a binary distillation of target logits indicating classification difficulty, while NCKD is a weighted distillation of non-target logits, assigned a larger weight to ensure rich knowledge transfer to the student model.

In [21], the probability distribution value of the target category is ${p_t}$, ${p_{\backslash t}}$ is the result of adding the probability distribution values of all non-target categories, and $\hat{p}$ is the probability distribution vector that removes the probability distribution value of the target category. $b = [{p_t},{p_{\backslash t}}] \in {R^{1 \times 2}}$ represents binary probabilities. Then, TCKD is the result of KL divergence calculation of ${b^T}$ in the teacher model and ${b^S}$ in the student model; NCKD is the result of KL divergence calculation of ${\hat{p}^T}$ in the teacher model and ${\hat{p}^S}$ in the student model. Then the total knowledge distillation loss ${L_{Decouple}}$ is:

(14)$${L_{Decouple}} = {L_{KL}}({b^T},{b^S}) + (1 - p_t^T){L_{KL}}({\hat{p}^T},{\hat{p}^S})$$

where $p_t^T$ is the probability of the target class predicted by the teacher model. The total loss is a linear combination of the KD loss and the CE loss.

We propose communication-informed knowledge distillation (CIKD), which combines the auxiliary system and two knowledge distillation patterns. One of the most important modules of CIKD is the communication-informed auxiliary system. This module obtains a clearer image by covering constellation points that are far away from the origin, so that the linear classifier can perform more accurate classification.

It is worth noting that the data input into the model is two-dimensional RGB image data. Because for image data, the computational effort of generating the mask in the communication-informed block is lower. Figure 2(a) shows the training phase of CIKD. The student model and the teacher model take the training set with a batch size of 32 as input. After feature extraction by the teacher model’s Backbone and auxiliary system calculations by the student model, we perform the inner feature distillation of Eq. (12) and record the loss term as ${L_1}$. The models generate logits vectors for the last linear output layer. We denote the KD loss term decoupling representation vectors as ${L_2}$, and denote the CE loss term between the probability distribution vector of the student model and the ground truth’s one-hot vectors as ${L_3}$. Since the magnitudes of the three loss values are inconsistent, we need to de-dimensionalize each loss term, as shown in the following expression:

(15)$${\hat{L}_1} = {L_1}/\psi ([{L_1}/{L_3}]),{\hat{L}_2} = {L_2}/\psi ([{L_2}/{L_3}])$$

where $\psi ({\cdot} )$ is the function that blocks the gradient backpropagation, so that the de-dimensionalization operation does not affect ${L_1}$. $[{\cdot} ]$ is the rounding function, which finds the order of magnitude difference between the numerator and the denominator.

Fig. 2. The (a) training logic of our proposed communication-informed knowledge distillation algorithm and inference processes for (b) teacher model with complex construction and (c) distilled student model with only one linear output layer.

Download Full Size | PDF

The total loss function $Los{s_{CIKD}}$ in CIKD can be expressed as:

(16)$$Los{s_{CIKD}} = \alpha {\hat{L}_1} + \beta {\hat{L}_2} + (1 - \alpha - \beta ){L_3}$$

where $\alpha $ is the balance hyperparameter of the loss term ${L_1}$, and $\beta $ is the balance hyperparameter of the loss term ${L_2}$.

In the model testing phase, Fig. 2(b) shows that the test dataset passes through the CNN, Flatten, and linear output layers during the teacher model’s forward inference to obtain the final prediction result. Figure 2(c) shows that the student model first passes the data through a auxiliary system with only one trainable parameter and then outputs the test results through a dense layer without activation function. The student model only has one linear output layer, achieving an ultra-lightweight model architecture.

3. Experimental setup

Figure 3 shows the experiments based on PSK/PAM/QAM/APSK/-CAP modulated UVLC systems. We set the attenuator at the transmitter to 5 dB, the current source to 145 mA, and the system bit rate to 2.5Gbps. We adjust the voltage source from 100 mV to 1200 mV with a step size of 100 mV. At the receiver, we set both attenuators to 10 dB.

Fig. 3. The experimental setup of our single carrier CAP modulated UVLC system.

Download Full Size | PDF

At the transmitter, a raw sequence of binary bits is randomly generated and mapped to one of eight modulation formats. After four times up sampling, the signal is separate into in-phase and quadrature components. After I/Q separation, we pulse shape the signal through Hilbert transform to complete CAP modulation [26], and then use AWG to convert the digital signal into an analog signal. Finally, a silicon-based green LED light is used to convert the electrical signal into a light signal and emit it.

At the receiver, the PIN driven by a differential amplifier receives the signal and converts the received optical signal into an electrical signal after the signal passes through a water tank with a width of 1.2 meters. the two electrical signals are combined into one after passing through the ADC. Then the LMS-Volterra post-equalizer is used to compensate for the linear and non-linear distortion in the received signal. The real and imaginary parts of signals are separated with a matched filter before down sampling. Then, the signal is restored to the original binary bit sequence through down sampling, LMS symbol level equalization, and de-mapping. It is noted that both LMS-Volterra waveform level equalization and LMS symbol level equalization require data from the transmitter. It is necessary to obtain data at the transmitter and receiver to calculate the error signal of LMS and LMS-Volterra algorithm, so that the parameters of algorithms can be iteratively updated. We collect transmit data and receive data at 900 mV operating point, and obtain the optimal parameters of these two algorithms. In order to avoid data leakage at the receiver, these two algorithm parameters are fixed at other operating points. In this case, the algorithm does not require data from the transmitter and acts as a blind equalizer. In the actual experiment, we map the I-channel and Q-channel data from LMS symbol level equalization to the rectangular coordinate system to form image data. We use data sets of twelve driving voltage operating points to train and test the model, respectively. Each dataset has 4800 samples, where each modulation format corresponds to 600 images, and we randomly split them into a training dataset and a testing dataset in a ratio of 6:4.

4. Experimental results

This section presents an ultra-lightweight student model scheme based on CIKD. We also describe the process of selecting the hyperparameters for the auxiliary system, and conduct an ablation experiment to verify the effectiveness of the double modules knowledge distillation. We then visualize the results of different knowledge distillation models and compare the computational complexity of the teacher model, the pure student model, and the CIKD-based student model.

4.1 Performance of the ultra-lightweight model based on CIKD

Figure 4 illustrates how different models perform on the test set as the LED driving voltage varies. In UVLC, the driving voltage can well reflect the trends of the noise operating area, the optimal working area, and the nonlinear working operating area, thereby enabling the performance of the algorithm to be explored in different environments. ACC represents the accuracy of the model in the inference stage. The light blue and pink regions indicate that noise and nonlinear effects dominate the signal interference, respectively. The student models, S Model (Partial Data) and S Model (Whole Data), are trained with the auxiliary and original data, respectively. The partial data is the image passed through the communication-informed module. The whole data is that the picture does not pass through this module, that is, the circular mask does not cover images with black pixels. Both models have low and constant accuracy, below 20%, due to their limited representation and learning abilities with only a linear output layer and no activation function. The teacher model’s accuracy fluctuates with the voltage, depending on the signal distortion, image clarity, and interference level. The optimal point is 800 mV, where noise and nonlinear effects cancel out and the image is the clearest.

Fig. 4. The accuracy versus different VPP among with the algorithms of the teacher model, student model fed with whole data, student model fed with partial data and CIKD model.

Download Full Size | PDF

The CIKD student model’s accuracy is close to the teacher model’s, but varies with Vpp. In the noisy region, the points on the constellation diagram are scattered, widening the accuracy gap between the models. In the optimal point, noise and nonlinear effects reach a balanced state, where the constellation diagram is clear. In this case, the teacher model’s knowledge has low information entropy, and the student model matches the teacher model’s accuracy through knowledge transfer. In the nonlinear region, the device’s nonlinear characteristics produce phase noise and radial tailing on the constellation diagram. Severe nonlinear effects also distort the constellation diagram’s shape, interfering with APSK and QAM recognition. Both models’ accuracy drops and the gap increases, indicating the difficulty of knowledge transfer in this region. The CIKD model improves its accuracy by 87% compared to the pure student model.

The teacher model and the CIKD model perform poorly when the driving voltage is 100 mV, with an accuracy difference of 6%. However, when the driving voltage increases to 1200 mV, the accuracy difference widens to 23%, demonstrating that the proposed distillation method has greater resilience to ambient noise. Even under low-intensity light emission from the LED, the ultra-lightweight model can accurately identify the modulation format at the receiver end, ensuring signal demodulation.

4.2 Effect of hyperparameters in auxiliary system

The auxiliary system adds Gaussian noise related to the circular mask’s radius r, which prompts r to involve the loss function calculation. Gaussian noise can enhance the image processing model’s robustness and generalization, but too much noise can impair image clarity and model accuracy. Since r is ${10^{ - 1}}$, using it directly as the Gaussian noise’s variance is too excessive, so an attenuation coefficient $\alpha $ is needed. Figure 5 illustrates how varying the attenuation coefficient affects the CIKD model’s accuracy, with r initially set to 0.4. We tested three driving voltages: 100 mV (noise-dominated regime), 700 mV (balanced regime), and 1200 mV (nonlinearity-dominated regime). As $\alpha $ increases from ${10^{ - 4}}$ to ${10^{ - 3}}$, the accuracy of all three voltages improves, which indicates that moderate noise helps models capture common features and enhance its robustness. As $\alpha $ decreases from ${10^{ - 3}}$ to ${10^{ - 2}}$, the accuracy of all three voltages deteriorates. This suggests that excessive noise masks the image and amplifies the channel environmental noise. The results demonstrate that the optimal value of $\alpha $ is ${10^{ - 3}}$, which yields the highest accuracy for all three voltages. Hence, we fixed $\alpha $ at ${10^{ - 3}}$ for all comparative experiments in this paper.

Fig. 5. Selection of Gaussian noise variance attenuation coefficient of the auxiliary system under noise operating point 100 mV, optimal operating point 700 mV and nonlinear operating point1200 mV.

Download Full Size | PDF

Significantly, the radius r tends to increase following the model’s training phase. This suggests that data-driven masks are inclined towards generating more comprehensive patterns as opposed to sharper images and sparse data.

4.3 Ablation study of knowledge distillation models

To evaluate the effectiveness of mentioned two distillation modules, we compared the performance of (1) the student baseline without an auxiliary system, (2) the classical KD, (3) intermediate layer knowledge distillation (ILKD), and (4) our CIKD model.

Table 2 shows results of different models. The student model has less than 20% accuracy, but improves significantly after KD. Classical KD increases the accuracy by 28%∼56%, making the student model close to the teacher model. ILKD guides the student model with the teacher model’s intermediate implicit features, and improves the accuracy by 76% compared to the student model. CIKD is the best KD model, which modifies logits (namely representation vectors) and incorporates ILKD. Among logits, target class knowledge distillation (TCKD) works better for complex patterns, while non-target class knowledge distillation (NCKD) provides similar information as the original logits. TCKD overcomes nonlinear effects, enabling CIKD to achieve 100% accuracy at 800 mV, even surpassing the teacher model. NCKD demonstrates the importance of “dark knowledge”. CIKD improves the accuracy by 87% compared to the student model. CIKD reveals data patterns from different perspectives and instructs the student model to classify correctly.

Table 2. Ablation of CIKD model on different VPP. Baseline KD: The classical knowledge distillation without intermediate feature maps relationship extraction and decoupling module. ILKD: The knowledge distillation only with intermediate feature maps relationship extraction

View Table | View all tables in this article

4.4 Visual performance comparison of knowledge distillation models

In order to compare the three algorithms more intuitively, we used correlation matrix difference, confusion matrix and t-distributed Stochastic Neighbor Embedding dimensionality reduction to visualize the performance differences. We set the LED driving voltage to 100 mV, where the data is fuzzy and complex, and the results of KD models differ significantly.

Figure 6 shows the difference between the teacher-student logits vectors’ correlation coefficient matrices. We first calculate the teacher model’s logits vector’s correlation coefficient matrix, then subtract the KD, ILKD, and CIKD models’ logits vectors’ correlation coefficient matrices, respectively. To simplify the matrix form, we only keep the lower triangular matrix and use blue squares to indicate the magnitude of each correlation coefficient. The darker and larger the blue square, the larger the difference between the teacher-student models’ correlation coefficients. Figure 6(a) shows that the classical KD’s logits correlation coefficient matrix has a larger difference from the teacher model’s; Fig. 6(b) shows that the ILKD model’s correlation coefficient matrix is closer to the teacher model’s; Fig. 6(c) shows that the CIKD model’s correlation coefficient matrix is almost the same as the teacher model’s in terms of values. This indicates that through hidden layer feature similarity preservation and representation vector decoupling, knowledge is well transferred to the student model, making the student model’s logits output close to the teacher model’s logits, demonstrating the optimality of the CIKD model from this perspective.

Fig. 6. Difference of correlation matrices of teacher and student logits. Obviously, CIKD (most right) leads to the smallest difference (most similar prediction with teacher model) among three models, which are (a) classical KD, (b) ILKD and (c) our final model, CIKD.

Download Full Size | PDF

The confusion matrix evaluates the model’s classification performance for different modulation formats. The model performs better when the confusion matrix is closer to a diagonal matrix. Figure 7 compares the confusion matrix of three KD models: classical KD, ILKD, and CIKD. The results show that KD improves the performance over the baseline student model, ILKD improves over KD, and CIKD improves the most. CIKD has the most diagonal confusion matrix, indicating the best classification performance. It is worth noting that 16QAM and PAM4 are prone to classification errors in CIKD. The reason is that the unknown shift occurs when the points are affected by noise or nonlinear effects, which leads to the adhesion of the data clusters inside the mask.

Fig. 7. Confusion matrix of different KD models with very low accuracy on 0.5 V, 0.8 V and 1.1 V

Download Full Size | PDF

In KD, logits are a feature that is “closer to the end”. To examine how this kind of feature affects downstream tasks, we use t-SNE for dimensionality reduction and visual analysis. t-SNE is an unsupervised machine learning algorithm that reduces high-dimensional data to a low-dimensional space (usually two or three dimensions) [27,28] while preserving the similarity structure of the high-dimensional data. t-SNE is suitable for visualizing high-dimensional data because it can capture both the local and global structure of the data. After t-SNE reduces the dimensions of the clear and highly distinguishable features, each data cluster is further separated and the points in the same cluster are closer together. Figure 8 illustrates how different data types perform after dimensionality reduction. The original testing dataset is flattened and then reduced by t-SNE, resulting in Fig. 8(a). The eight modulation formats are indistinguishable, indicating the need for automatic feature extraction. Figure 8(b) shows the logits output by classical KD after t-SNE. Low-order modulation formats, such as PAM4, QPSK, 8QAM-CIR and 8QAM-DIA, are well separated, as they have low feature information entropy and simple patterns. High-order modulation formats are still mixed, as classical KD struggles to extract complex features. Figure 8(c) shows the result of ILKD, which guides the student model with the teacher model’s intermediate implicit features. Except for 32QAM and 32APSK, the other formats are separated. Figure 8(d) shows the result of CIKD, which modifies the representation vector distillation and enhances the feature extraction strategy. 32APSK and 32QAM are more separated than ILKD, but not completely. The accuracy at 100 mV is low, as the image is unclear and the pattern is complex. As the voltage increases, the image becomes clearer and the pattern simpler, making the data easier to separate after reduction.

Fig. 8. Visualization of feature vectors on (a) origin dataset, (b) KD, (c) ILKD and (d) CIKD flowing through t-SNE dimensionality reduction algorithm at the 100 mV image dataset.

Download Full Size | PDF

4.5 Computation complexity comparison of knowledge distillation models

The aim of KD is to transfer the teacher model’s knowledge to the student model and enhance the prediction accuracy. Training a student model requires passing data through both models, which is computationally costly and time-consuming. However, testing only involves the student model, which has fewer parameters and faster inference. Therefore, the goal of KD is to obtain a model that is both efficient and accurate. Two metrics are used to evaluate the inference performance of the model: inference time and Top-1 accuracy. The inference time is the average duration of ten inferences by the model on the test set, and the Top-1 accuracy is the proportion of the predictions that match the true labels. In a plot with inference time on the x-axis and Top-1 accuracy on the y-axis, the closer the model is to the y-axis, the better its performance.

Figure 9 shows that when the LED driving voltage is 800 mV, the CIKD model is the best and the two student models are the worst. Moreover, the weight matrix of the classical KD is not sparse, resulting in longer inference time than the teacher model, while the weight matrix of the CIKD model is sparse and the inference time is the shortest. This indicates that a good KD model has a regularization effect.

Fig. 9. Inference time versus TOP-1 accuracy on the same test dataset.

Download Full Size | PDF

In the analysis of computational complexity, the quantity of multipliers serves as a crucial metric, accurately illustrating the primary resource demands for hardware implementation of algorithms. This metric is instrumental in assessing and enhancing both the performance and energy efficiency of the algorithm. This paper predominantly examines the disparities in computational complexity among various algorithms during the inference phase. However, the CIKD framework predominantly employs a student model enhanced with the communication-informed module. Consequently, it necessitates a comparative analysis of the multiplier counts between the teacher model and the student model.

In this study, the teacher model is composed of a convolutional layer followed by a fully-connected layer. The input image size is $150 \times 150 \times 3$. Within the convolutional layer, the convolution kernel size is denoted as F, the input data channel count as N, the number of convolution kernels as M, and the output feature map's width and height as W and H, respectively. Consequently, the quantity of multipliers within the convolutional layer is represented by $F \times F \times N \times M \times W \times H$. In our experiment, we employed 64 filters, each with a size of $3 \times 3$, a stride of 2, and utilized the “same” padding technique. This configuration results in $3 \times 3 \times 3 \times 64 \times 75 \times 75 = \textrm{9720000}$ multipliers. Given the use of the ReLU activation function, this process does not entail any multiplication operations. Moreover, the Flatten layer does not involve multiplier operations and outputs a one-dimensional vector containing 360000 features. Subsequently, the linear output layer generates a one-dimensional vector with 8 features, leading to $360000 \times 8 = \textrm{2880000}$ multipliers in this final layer. In summary, the teacher model encompasses a total of $\textrm{9720000 + 2880000 = 12600000}$ multipliers.

In the CIKD model, the inference phase exclusively utilizes the student model. Therefore, it is essential to compute the number of multipliers for both the communication-informed block and the linear output layer. The communication-informed block encompasses two primary operations: the generation of a mask with only one channel and the execution of pixel-level multiplication. The process of generating a mask necessitates $150 \times 150 = 22500$ Euclidean distance calculations, each requiring two multiplication operations, resulting in a total of 45000 multipliers for this phase. Subsequently, when the mask is expanded into a three-channel matrix for pixel-level multiplication, $150 \times 150 \times 3 = 67500$ multipliers are needed in total. Given that the number of multipliers in the linear output layer is the same as that of the teacher model, the CIKD model altogether necessitates $45000 + 67500 + 2880000 = 2992500$ multipliers. Consequently, during the inference stage, the quantity of multipliers for CIKD is merely 23.75% of that required by the teacher model. This significantly underscores CIKD's advantage in terms of hardware deployment efficiency.

We compared the parameter sizes of different models. The pure student model has 540000 parameters. All KD models have one more parameter than the pure student model, which is the radius r. The teacher model has 2882000 parameters. The student model had only 18% of the parameters of the teacher model, which demonstrated the ultra-lightweight nature of the student model.

5. Conclusion

In summary, we present a communication-informed knowledge distillation (CIKD) model for modulation format recognition (MFR). This model can achieve near-perfect accuracy with an ultra-lightweight student model. The CIKD model has only 18% of the parameters of the complex teacher model, which significantly reduces the algorithmic complexity. By applying hidden layer feature similarity preservation and representation vector decoupling, we successfully train the teacher-student model with vastly different parameter sizes. Through ablation experiments, we demonstrate that using two KD modules are effective, and the CIKD model can achieve 100% accuracy for MFR. The low computational cost and rapid inference speed make CIKD a promising candidate algorithm for MFR.

Funding

National Natural Science Foundation of China (61925104, 62031011); National Key Research and Development Program of China (2022YFB2802803).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. S. J. O. E. Arnon, “Underwater optical wireless communication network,” Opt. Eng 49(1), 015001 (2010). [CrossRef]

2. H. Kaushal and G. J. I. A. Kaddoum, “Underwater optical wireless communication,” IEEE ACCESS 4, 1518–1547 (2016). [CrossRef]

3. N. Chi, Y. Zhou, Y. Wei, et al., “Visible light communication in 6G: Advances, challenges, and prospects,” IEEE Veh. Technol. Mag. 15(4), 93–102 (2020). [CrossRef]

4. S. Q. J. J. Duntley, “Light in the sea,” J. Opt. Soc. Am. 53(2), 214–233 (1963). [CrossRef]

5. B. Wozniak and J. J. L. A. i. S. W. Dera, “Light absorption by dissolved organic matter (DOM) in sea water,” Light Absorption in Sea Water 112–166 (2007).

6. F. A. Dahri, S. Ali, M. Jawaid, et al., “A review of modulation schemes for visible light communication,” International Journal of Computer Science and Network Security 18, 117 (2018).

7. Y.-C. Liang, K. C. Chen, G. Ye Li, et al., “Cognitive radio networking and communications: An overview,” IEEE Trans. Veh. Technol. 60(7), 3386–3407 (2011). [CrossRef]

8. L. Guesmi, A. Mohamed Ragheb, H. Fathallah, et al., “Experimental demonstration of simultaneous modulation format/symbol rate identification and optical performance monitoring for coherent optical systems,” J. Lightwave Technol. 35(4), 868–875 (2017). [CrossRef]

9. J. Thrane, J. Wass, M. Piels, et al., “Machine learning techniques for optical performance monitoring from directly detected PDM-QAM signals,” Journal of Lightwave Technology 35, 868–875 (2016).

10. Z. Wan, Z. Yu, L. Shu, et al., “Intelligent optical performance monitor using multi-task learning based artificial neural network,” Opt. Express 27(8), 11281–11291 (2019). [CrossRef]

11. D. H. Al-Nuaimi, V. A. Hashim, Ivan A. Hashim, et al., “Performance of feature-based techniques for automatic digital modulation recognition and classification—A review,” Electronics 8(12), 1407 (2019). [CrossRef]

12. O. A. Dobre, A. Abdi, Y. Bar-Ness, et al., “Survey of automatic modulation classification techniques: classical approaches and new trends,” IET Commun. 1(2), 137–156 (2007). [CrossRef]

13. F. Li, X. Lin, J. Shi, et al., “Modulation format recognition in a UVLC system based on reservoir computing with coordinate transformation and folding algorithm,” Opt. Express 31(11), 17331–17344 (2023). [CrossRef]

14. N. Chi, Y. Zhao, M. Shi, et al., “Gaussian kernel-aided deep neural network equalizer utilized in underwater PAM8 visible light communication system,” Opt. Express 26(20), 26700–26712 (2018). [CrossRef]

15. Y. Wei, L. Yao, H. Zhang, et al., “An Optimal Adaptive Constellation Design Utilizing an Autoencoder-Based Geometric Shaping Model Framework,” in Photonics, (MDPI, 2023), 809.

16. W. Xiao, Z. Luo, Q. Hu, et al., “A Review of Research on Signal Modulation Recognition Based on Deep Learning,” Electronics 11(17), 2764 (2022). [CrossRef]

17. L. Yao, H. Zhang, C. Chen, et al., “MPT-Transformer based post equalizer utilized in underwater visible light communication system,” in 2023 Opto-Electronics and Communications Conference (OECC), (IEEE, 2023), 1–4.

18. Y. Zeng, M. Zhang, F. Han, et al., “Spectrum analysis and convolutional neural network for automatic modulation recognition,” IEEE Wireless Commun. Lett. 8(3), 929–932 (2019). [CrossRef]

19. S. Chen, Y. Zhang, Z. He, et al., “A novel attention cooperative framework for automatic modulation recognition,” IEEE Access 8, 15673–15686 (2020). [CrossRef]

20. F. Tung and G. Mori, “Similarity-preserving knowledge distillation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019), 1365–1374.

21. B. Zhao, Q. Cui, R. Song, et al., “Decoupled knowledge distillation,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2022), 11953–11962.

22. H. Chen, W. Niu, Y. Zhao, et al., “Adaptive deep-learning equalizer based on constellation partitioning scheme with reduced computational complexity in UVLC system,” Opt. Express 29(14), 21773–21782 (2021). [CrossRef]

23. A. Krizhevsky, I. Sutskever, G. E. Hinton, et al., “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems 25, (2012).

24. T. Tsiligkaridis and A. Tsiligkaridis, “Diverse gaussian noise consistency regularization for robustness and uncertainty calibration,” in 2023 International Joint Conference on Neural Networks (IJCNN), (IEEE, 2023), 01-08. [CrossRef]

25. G. Hinton, O. Vinyals, J. Dean, et al., “Distilling the knowledge in a neural network,” arXiv, arXiv:1503.02531 (2015). [CrossRef]

26. N. Chi, Y. Zhou, S. Liang, et al., “Enabling technologies for high-speed visible light communication employing CAP modulation,” J. Lightwave Technol. 36(2), 510–518 (2018). [CrossRef]

27. A. C. Belkina, C. O. Ciccolella, R. Anno, et al., “Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets,” Nat. Commun. 10(1), 5415 (2019). [CrossRef]

28. L. Van Der Maaten, E. Postma, J. V. den Herik, et al., “Dimensionality reduction: A comparative review,” Journal of Machine Learning Research 10, 13 (2009).

Layers	Teacher Model	Student Model
Input	Tensor	Tensor
Input	Shape: (Batch Size, 150,150,3)	Shape: (Batch Size, 150,150,3)
Backbone	Conv2D
	Kernel number: 64	Communication-informed
	Kernel size: 3	Auxiliary System
	Stride: 2
	Padding: Same
Neck	Flatten	Flatten
Head	Dense Layer	Dense Layer
Head	Unit: 8	Unit: 8
Output	Softmax	Softmax

	100	200	300	400	500	600	700	800	900	1000	1100	1200
Tea. Model	0.708	0.987	0.997	0.945	0.991	0.990	0.998	0.983	0.988	0.974	0.925	0.940
Baseline Stu.	0.122	0.120	0.120	0.122	0.128	0.125	0.111	0.120	0.128	0.125	0.122	0.125
Classical KD	0.460	0.494	0.471	0.446	0.504	0.632	0.678	0.472	0.590	0.489	0.459	0.404
ILKD	0.510	0.697	0.785	0.749	0.867	0.864	0.874	0.754	0.740	0.586	0.634	0.581
CIKD	0.647	0.788	0.831	0.883	0.949	0.979	0.995	1.000	0.926	0.859	0.779	0.710

Layers	Teacher Model	Student Model
Input	Tensor	Tensor
Input	Shape: (Batch Size, 150,150,3)	Shape: (Batch Size, 150,150,3)
Backbone	Conv2D
	Kernel number: 64	Communication-informed
	Kernel size: 3	Auxiliary System
	Stride: 2
	Padding: Same
Neck	Flatten	Flatten
Head	Dense Layer	Dense Layer
Head	Unit: 8	Unit: 8
Output	Softmax	Softmax

	100	200	300	400	500	600	700	800	900	1000	1100	1200
Tea. Model	0.708	0.987	0.997	0.945	0.991	0.990	0.998	0.983	0.988	0.974	0.925	0.940
Baseline Stu.	0.122	0.120	0.120	0.122	0.128	0.125	0.111	0.120	0.128	0.125	0.122	0.125
Classical KD	0.460	0.494	0.471	0.446	0.504	0.632	0.678	0.472	0.590	0.489	0.459	0.404
ILKD	0.510	0.697	0.785	0.749	0.867	0.864	0.874	0.754	0.740	0.586	0.634	0.581
CIKD	0.647	0.788	0.831	0.883	0.949	0.979	0.995	1.000	0.926	0.859	0.779	0.710

Modulation format recognition in a UVLC system based on an ultra-lightweight model with communication-informed knowledge distillation

Abstract

1. Introduction

2. Principles

2.1 Preparation of modulated format datasets

2.2 Deep learning models

2.2.1 Teacher model and student model

2.2.2 Communication-informed auxiliary system

2.2.3 Communication-informed knowledge distillation

3. Experimental setup

4. Experimental results

4.1 Performance of the ultra-lightweight model based on CIKD

4.2 Effect of hyperparameters in auxiliary system

4.3 Ablation study of knowledge distillation models

4.4 Visual performance comparison of knowledge distillation models

4.5 Computation complexity comparison of knowledge distillation models

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (2)

Equations (16)

Optics Express