Performance vs. complexity in NN pre-distortion for a nonlinear channel

Hamza Imtiaz; Zibo Zheng; Rizan Homayoun Nejad; Leslie A. Rusch; Ming Zeng

doi:10.1364/OE.500467

1. Introduction

The rapid growth of internet traffic caused by the bandwidth-intensive applications (e.g. virtual reality and real-time video conference etc.) has pushed the optical networks near to their maximum capacity. The optimum use of available optical spectrum is essential to enhance system throughput. This requires the use of higher modulation order at high symbol rate on a tightly packed spectral grid [1,2]. Higher modulation order has stringent signal-to-noise ratio (SNR) requirements, while wider signal bandwidths can lead to intersymbol interference (ISI). Digitial pre-distortion (DPD) of the transmitted signal can address both challenges.

Transmitter impairments may come from the chain of digital-to-analog converter (DAC), radio frequency (RF) amplifier and optical modulator. All transmitter components with bandwidth smaller than the signal bandwidth cause memory effects, also known as ISI. Systems rely on the DAC for Nyquist signaling (to push the signals into their minimum spectrum) and higher order modulation formats. Modulator architectures that eliminate the need of a DAC [3] are restricted to resolution of only a few bits per symbol and incompatible with Nyquist pulse shaping. In state-of-the-art systems, the DAC commonly imposes a constraint on the attainable baud rate, exhibits the most restricted bandwidth within the transmitter chain, and introduces non-linear quantization noise.

Transmitter distortions are typically pre-compensated in a digital signal processing (DSP) block, known as DPD. A linear DPD can address the memory effects [4], but it restricts the signal amplitude, thereby resulting in lower SNR. Adding a nonlinear DPD can increase the SNR by allowing a larger signal swing. To achieve higher throughput, the DPD should address both linear and nonlinear effects [5]. The Volterra series based nonlinear DPD have been widely used for the coherent optical transmitters [6,7] and RF amplifiers [8,9]. This approach is mostly based on indirect learning (IL), which gives limited performance due to the noise enhancement and estimation error at low SNR and high nonlinearity. The look-up table (LUT) is another nonlinear DPD method, used to pre-compensate both optical systems [10,11] and RF amplifier [12]. While the LUT is a low-complexity solution, its scalability to higher modulation formats is challenging due to large memory requirements.

Recently, neural network (NN) are under consideration for DPD. Various structures have been proposed to achieve the linearization of RF amplifiers, including feed-forward neural network (FNN) [13], convolutional neural network (CNN) [14] and time delay neural network (TDNN) [15–17]. For optical transmitters, a DPD based on FNN mitigated the nonlinearity of low resolution DAC [18] and Mach-Zehnder modulator (MZM) [19]. Most of the above approaches used the IL, which may not provide the best performance. For example, direct learning (DL) based DPDs using the FNN for the high baudrate optical coherent transmitter [5] and advanced recurrent neural network (RNN) for simulated optical transmitter [20] have shown better performance than IL.

We propose the architecture of NN-DPD using CNN and bi-directional RNN (Bi-RNN) layers to mitigate impairments stemming from the DAC and power amplifier (PA). We use both learning methods (IL and DL) to evaluate the performance of our proposed NN-DPD. We contrast our solution with linear and nonlinear DPDs including Volterra and LUT solutions. The DPDs are applied to 8-level pulse amplitude modulation (PAM-8) signal at 64 Gbaud on the electrical back-to-back (EB2B) experimental setup. We train and examine the performance of all DPDs for various levels of nonlinear quantization distortion. The result shows our NN-DPD trained using DL is the most effective approach as it outperformed the Volterra, LUT and linear DPD by almost 0.9 dB, 1.9 dB and 2.9 dB, respectively. We further analyse the performance and complexity trade-off for various NNs (FNN, simple-RNN, long short-term memory (LSTM) and gated recurrent unit (GRU)) based DPDs with Volterra.

The rest of the paper is organized as follows: Section 2 describes the nonlinear system. We review Volterra and LUT solutions and introduce the BiRNN. In section 3 we explore many choices for the RNN including several channel inversions strategies, the use of DL and IL, and the use of iteration in training. We present experimental results comparing DPD performance. In section 4 we examine the complexity for the various DPD solutions on a per symbol basis. We discuss several performance/complexity trade-offs. Finally, we offer some concluding remarks in section 5.

2. Digital pre-distortion for a nonlinear channel

System nonlinearity can limit total capacity, with multiple devices contributing to the nonlinear behavior. In this study we focus on quantization nonlinearity to compare the performance and complexity of varied nonlinear pre-distortion techniques. We review common approaches, Volterra and LUT, and propose a recurrent neural network solution.

2.1 Nonlinear channel under test and figure of merit

Optical systems improve capacity via multilevel modulation and extremely high bandwidths. The need for pulse shaping, equalizer filters and multilevel modulation lead to the use of DAC, followed by a PA. This configuration introduces nonlinear quantization noise and may also include some saturation effects.

In Fig. 1 we show the block diagram of the nonlinear system to be examined. The transmitter (TX) and receiver (RX) employ linear DSP. The TX-DSP includes raised cosine (RC) filtering at 0.01 roll off to keep signal bandwidth to a minimum. A finite impulse response (FIR) filter compensates the limited system frequency response. The quantization sent to the DAC is uniform; the nonlinear clipping enhances performance by removing large and infrequent excursions (created by the linear filtering) that would otherwise compromise the SNR. The receiver low pass filter (LPF) limits out-of-band additive white Gaussian noise (AWGN); we synchronize (sync) for optimal sampling. The signal is filtered to minimize the mean square error. In our comparison of DPD solutions we hold these receiver DSP elements fixed.

Fig. 1. Block diagram of the experimental setup, including DSP steps and hardware.

Download Full Size | PDF

The center portion of Fig. 1, the channel, contains three hardware components: a DAC, an amplification stage and a real-time oscilloscope (RTO). The DAC introduces nonlinear quantization error and is band limited. The RTO captures the signal for offline processing and reception. We operate at 64 Gbaud, while the DAC has a 3 dB bandwidth of 18 GHz. The PA and RTO have sufficient 3 dB bandwidth for the signal.

The amplification stage is a tunable-gain PA and an attenuator allowing us to sweep the signal swing, i.e., V$_{pp}$. Larger signal swing increases equally both the signal power and the nonlinear distortion, while the thermal noise remains fixed in our experiment. We attack the nonlinear distortion with pre-distortion; when pre-distortion is successful, the increased V$_{pp}$ translates into improved performance. Our performance metric is the effective SNR given by

(1)$$\text{effective SNR} = 10\log_{10} \bigg(\frac{\mathbb{E} [|X|^2]}{\mathbb{E} [|X-Y|^2]}\bigg),$$

where $X$ and $Y$ are the transmitted and received symbol coordinates, respectively. We use a statistical mean for the $\mathbb {E}$ operator. The numerator is the signal power, while the denominator is the residual noise from all sources (nonlinear distortion and thermal noise). As we approach the largest V$_{pp}$ examined, the PA is entering saturation with potentially increased nonlinearity.

2.2 Common DPD solutions

The common approach to DPD is to first apply linear compensation. Our RC filter have four samples per symbol, 513 taps and roll off factor of 0.01. We apply an FIR filter in the form of a minimum mean square error (MMSE) equalizer. We estimate the filter coefficients when transmitting PAM-8 signal in our electrical, back-to-back experiment. As the RF amplifier is wider band than the DAC, this filter mostly inverts the DAC frequency response. The filter has 256 taps, sufficient to cover signal reflections from RF cables and connections.

The output of RC filter is re-sampled to one sample per symbol for experimental transmission. We apply uniform quantization. We find the best clipping level manually by observing the resultant bit error rate. We examine two common nonlinear compensation stages applied before this linear stage and clipping/quantization: Volterra and LUT.

2.2.1 Volterra DPD

Volterra series is a widely used approach to model the nonlinear systems with memory. For DPD we find the Volterra weights for the inverse system. During training we transmit $N$ known logical symbols $x[n]$, or vector $\overrightarrow {x}_{N \times 1}$ where $\overrightarrow {(.)}$ represents a vector quantity. We capture experimentally the system outputs $y[n]$. We use this training set to find the weights solving [21]

(2)$$\begin{aligned} x[n]& = \sum_{i_{1}={-}M_{1}}^{M_{1}} w_{1}[i_{1}]y[n+i_{1}]+\sum_{i_{1}={-}M_{2}}^{M_{2}}\sum_{i_{2}=i_{1}}^{M_{2}}w_{2}[i_{1},i_{2}]y[n+i_{1}]y[n+i_{2}]+ \\ &\qquad\sum_{i_{1}={-}M_{3}}^{M_{3}}\sum_{i_{2}=i_{1}}^{M_{3}}\sum_{i_{3}=i_{2}}^{M_{3}}w_{3}[i_{1},i_{2},i_{3}]y[n+i_{1}]y[n+i_{2}]y[n+i_{3}],\end{aligned}$$

where $M_{p}$ are the memory lengths of the $p^{\text {th}}$ order terms. Table 1 gives $C_{p}$, the number of $p^{\text {th}}$ order weights. The collection of Volterra kernel weights $w_{p}$ (also known as coefficients or taps) are gathered into a single column vector $\overrightarrow {w}$. To find the weights we will write (2) in matrix form. We define $\overrightarrow {y}_{k}=[y[k],y[k+1],\ldots,y[k+n]]^T$. Here, $(.)^T$ denotes the transpose. We form a matrix $\boldsymbol {Y}$, whose columns are generated from delayed versions of the received symbols per

(3)$$\boldsymbol{Y} = [\overrightarrow{y}_{{-}M_{1}} \hspace{2mm} \cdots\hspace{2mm} \overrightarrow{y}_{{-}M_{2}} \odot \overrightarrow{y}_{{-}M_{2}} \hspace{2mm} \cdots\hspace{2mm} \overrightarrow{y}_{M_{3}} \odot \overrightarrow{y}_{M_{3}} \odot \overrightarrow{y}_{M_{3}}].$$

where $\odot$ denotes the element-wise multiplication. The matrix form of (2) is then [5]

(4)$$\overrightarrow{x}_{N \times 1}=\boldsymbol{Y}_{N \times C}\overrightarrow{w}_{C \times 1},$$

where $C$ is the total number of Volterra weights. The weights $\overrightarrow {w}$ can be estimated using the Moore-Penrose inverse relation per

(5)$$\overrightarrow{w}_{C \times 1} = [(\boldsymbol{Y}^H\boldsymbol{Y})^{{-}1}\boldsymbol{Y}^H]_{C \times N}\overrightarrow{x}_{N \times 1}.$$

Table 1. Number of weights with increasing Volterra order

View Table | View all tables in this article

We use a Volterra solution with $M_1=125$, $M_2=5$, and $M_3=2$. For training $N$=8192, and we use transmitted logical symbols $x[n]$ and experimentally captured test statistics $y[n]$ to find the weights via (5). For the test phase, we take a block of logical symbols and form the matrix $X$ of element-wise multiplied delayed data blocks per (3). The pre-distorted block for transmission becomes

(6)$$\text{DPD}=\boldsymbol{X}_{N \times C}\overrightarrow{w}_{C \times 1},$$

2.2.2 Look-up table DPD

A look-up table (LUT) introduces a correction term to the middle symbol of a sequence to address the nonlinear distortions; we follow the LUT implementation in [22]. We transmit a long sequence of symbols, $X$, and examine sub-sequences of length $\beta$, where $\beta$ is the order of the LUT, i.e., a LUT-$\beta$. We parse the long transmit sequence via a sliding window into $X_{n=}(x[n-(\beta-1) / 2], \ldots, x[n], \ldots, x[n+(\beta-1) / 2])$. The received coordinate $y[n]$ corresponds to the test statistic for detecting the middle symbol of $X_n$, and their difference is the error, i.e., $e_{n} = \|x[{n}]- y[{n}]\|$.

For PAM-M modulation, there are $M^\beta$ possible patterns; a unique index $p$ is assigned to each pattern $s(p)$. The LUT assigns correction $\delta (p)$ to pattern $s(p)$. In estimating the optimal $\delta (p)$, we reduce the thermal noise during training by averaging over all occurrences of sequence $s(p)$ in the long sequence $X$. That is,

(7)$$\delta (p) = \frac{1}{N_{p}} \sum_{\forall n \ni X_{n} = s(p)} e_{n},$$

where $N_p$ is the number of times the pattern $s(p)$ occurs in the transmitted sequence $X$. The estimated pattern errors $\delta (p)$ are stored in the LUT. The LUT DPD pre-distorts the middle symbol for $X_{n} = s(p)$ to

(8)$$x_{n}^{TX} = x_{n} + \delta (p),$$

We use $\beta =3$, i.e., a LUT of order three.

2.3 Bidirectional RNN

To accommodate various applications, machine learning (ML) offers numerous neural network (NN) architectures that can be very effective in combating nonlinear effects. We propose the NN architecture in Fig. 2 for DPD, inspired by [5]. This architecture includes leading and following CNNs stages, internal NN layers and a bypass connection. The two uni-dimensional (1-D) CNN layers act as FIR filters to compensate the linear responses. The middle section layers rectify the nonlinear response, with output going to a linear FNN layer to reshape the output. Before the final CNN, a bypass connection is used to improve performance and enhance the training speed per [23].

Fig. 2. Neural network architecture for auxiliary (AUX) and digital pre-distortion (DPD); the batch normalization (BN) and hard tanh are for DPD training with $x_3[n]$ in Fig. 1.

Download Full Size | PDF

Our modifications to the approach in [5] are shaded in Fig. 2. We replace the leading inner FNN used in [5] with bidirectional RNNs whose contextual window adjusts dynamically [24]. This approach is appropriate for memory-dependent nonlinearities. In section 4, we see the bidirectional RNNs provide good performance/complexity trade-offs. We also examine three separate points, $x_i[n]$ in Fig. 1, for defining our channel inversion. For the third channel inversion approach using $x_3[n]$ for training, we add the final stages of batch normalization (BN) and hard tanh to adapt to the scenario of a clipped and quantized DPD output.

We determined the most suitable hyper-parameters through multiple experiments. We retained parameters for each DPD solution yielding the highest figure of merit, i.e., the highest effective SNR. We examined various hyper parameters, initialization techniques and numbers of layers and nodes for each NN section. The configurations offering the best performance are provided in the appendix in Table 2 and Table 3.

3. Bidirectional RNN for digital pre-distortion

In section 3.1 we consider three different channel inversion strategies for a direct learning solution. Once the best inversion strategy is identified, we also apply it in the indirect learning approach. In section 3.3 we compare the performance of conventional and RNN solutions. In section 3.4 we see how iterating the learning process (for RNN, Volterra and LUT) can further improve performance.

3.1 Channel inversion strategies

We begin with a direct learning approach to train an AUX model to mimic the system we wish to invert. We capture the transmit and received data to train the AUX offline. Once the AUX is completed, we noiselessly train another NN to invert the AUX. This becomes our DPD neural network.

We examine the three strategies for channel inversion, i.e., when using $x_i[n]$ in Fig. 4(a) to find the auxiliary model ${AUX}_{x_i}$. For ${AUX}_{x_i}$ we find ${DPD}_{x_i}$ inverting ${AUX}_{x_i}$. The total transmitters are given in Fig. 3 for each case. For example, when training on $x_{1}$, the ${DPD}_{x_1}$ is followed by the standard TX-DSP. The ${AUX}_{x_1}$ captures that DSP, so the inverted ${AUX}_{x_1}$ must be followed by the TX-DSP such that the net effect is inverting the hardware channel in Fig. 1.

Fig. 3. Transmitter with a digital pre-distortion block that (a) inverts the channel as well as the standard TX-DSP, (b) inverts the channel and clipping/quantization stage, or (c) inverts channel directly.

Download Full Size | PDF

Fig. 4. Direct learning method block diagram of (a) AUX training for three input scenarios, and (b) DPD training to invert the (now fixed) AUX.

Download Full Size | PDF

When the AUX input is $x_{3}$, the DPD has an additional (see Fig. 2) BN layer and hard tanh activation function to limit the output range. For this channel inversion approach, the AUX is trained on quantized inputs: the range [0,255] is mapped to [-1,1] to get smooth convergence. We use hard tanh to force the DPD output to be in the range [-1,1] during the DPD training.

We train each of the three AUX with labels $y[n]$, the experimental soft symbols after the RX-DSP. The AUX training scenarios are illustrated in Fig. 4. For input of $x_{i}[n]$, the ${AUX}_{x_i}$ output $\hat {y}_{i}[n]$ is our prediction of the soft symbols. The AUX weights minimize the normalized mean square error (NMSE).

(9)$$NMSE(u,v) = 10 \log_{10} \Bigg[ \frac{\sum_{n=1}^{N} |u[n]-v[n]|^2}{\sum_{n=1}^{N} |u[n]|^2} \Bigg],$$

where for scenario $i$, $u=y[n]$ and $v=\hat {y}_{i}[n]$. We save the optimized weights which serve as our numerical model of the channel.

Per Fig. 4(b), the DPD training proceeds with the AUX$_{x_i}$ box is now fixed (no adaptation). The AUX$_{x_i}$ weights are the NN system model. The DPD input are always logical symbols $d[n]=x_{1}[n]$. Again we use the NMSE as the optimality criteria. During the DPD$_{x_i}$ training, output $v=\hat {x}_{i}[n]$ of the AUX$_{x_i}$ approximate the DPD input $u=x_{1}[n]$. When DPD training is completed, we save the optimized weights to test performance experimentally.

3.2 Finding the best channel inversion strategy

Our experimental method for collecting training data is shown in Fig. 5(a). For all three DPD we apply RC and FIR filters to the data frames. For DPD$_{x_{1}}$ and DPD$_{x_{2}}$ we generate $N=500$ random sequences in frames of 8192 symbols. Following filtering we clip the sequences at a clipping ratio (CR) of 85% and send them to the DAC. For the DPD$_{x_{3}}$ case, we generate $N=650$ random frames. Following filtering we clip with a range of CRs (72%-85%) and send them to the DAC. The 13 CRs examined each includes 50 sequences. We generate this enriched training set for DPD$_{x_{3}}$ to ensure it learns the clipping operation.

Fig. 5. Summary of the experimental procedure for data collection to (a) train NN-DPD and (b) test its performance by computing the SNR.

Download Full Size | PDF

The training data frames are transmitted continuously and multiple copies of each frame are captured by the RTO. We average over 20 copies of each sequence to reduce the AWGN, increasing the effective SNR by 13 dB. The de-noised data for a given CR is used to train the AUX NN.

Our experimental method for evaluating performance is given in Fig. 5(b). A stream of $2^{18}$ logical symbols are input to the trained DPD. For DPD$_{x_{1}}$, the output is sent to the pulse shaping, clipping and quantization, and then the DAC. For DPD$_{x_{2}}$, the output is sent to clipping and quantization, and then the DAC. For DPD$_{x_{3}}$, the output is sent directly to the DAC. We capture one copy of the transmission at the RTO for offline DSP processing and estimate the effective SNR using (1).

We examine the three channel inversion approaches as we sweep V$_{pp}$. As the thermal noise is fixed, low V$_{pp}$ is least affected by nonlinear distortion. In Fig. 6, we present the performance of the three bidirectional recurrent neural networks (BiRNNs) solutions. The performance of DPD$_{x_{3}}$ is best at low V$_{pp}$. At higher V$_{pp}$ the advantage disappears. Small differences could be due to the random initialization of the weights.

Fig. 6. Performance comparison of three DL-BiRNN.

Download Full Size | PDF

The introduction of BN and tanh with DPD$_{x_{3}}$, combined with an enriched training set at several CR, allowed us to avoid the filtering and manual clipping optimization steps. Not only does this provide a less complex solution, it is also a more robust solution. When DPD$_{x_{1}}$ and DPD$_{x_{2}}$ is deployed, we had to manually find a good clipping ratio by observing the effective SNR, which takes several transmissions. Hence, though performance is similar, DPD$_{x_{3}}$ is the best choice.

3.3 Comparing RNN and common DPD solutions

Having fixed the choice of DPD$_{x_{3}}$, we add indirect learning as an alternative candidate to direct learning discussed previously. As seen in Fig. 7, indirect learning consists of only one step in training, i.e., we estimate the inverse of channel without using an auxiliary model. The computational complexity of the training phase is halved as compared to DL. However, this approach suffers from noise enhancement and estimation error when the non-linearity is high.

Fig. 7. Indirect learning block diagram for DPD training.

Download Full Size | PDF

The optimality criterion is again the NMSE between the DPD output $\hat {x}_3[n]$ and labels $x_{3}[n]$. The IL has similar NN structure to the DL DPD$_{x_{3}}$, differing only in the number of nonlinear layers and cells (as detailed in the Table 2).

We transmit PAM-8 and vary the amount of nonlinear distortion by sweeping V$_{pp}$. The effective SNR in dB is plotted vs. V$_{pp}$ in mV in Fig. 8(a). The curve with asterisk markers is the linear solution. We see that amplifying the signal has some improvement with linear pre-distortion, but after V$_{pp}$ of 340 mV the nonlinear effects become dominant.

Fig. 8. For various pre-distortion techniques (a) effective SNR versus $V_{pp}$, and (b) histograms of PAM-8 reception for $V_{pp}$=400 mV.

Download Full Size | PDF

Consider next the two common solutions for nonlinear pre-distortion, Volterra series and LUT. The LUT-3 shown with triangular markers outperforms linear in all regimes. The LUT-3 sees improvement until V$_{pp}$ of 370 mV, while its performance falls off afterwards. It is less precipitous than the linear pre-compensation case. The Volterra solution is presented with diamond markers and outperforms linear and LUT-3 in all regimes. Volterra shows improvement until V$_{pp}$ of 400 mV.

Finally we consider the two neural network solutions using BiRNNs and point $x_{3}[n]$. Both BiRNNs outperform linear, LUT-3 and Volterra. They, like Volterra, see improvement until V$_{pp}$ of 400 mV. We re-train each DPD for every value of the V$_{pp}$. The DL-BiRNN gives the greatest effective SNR gain over the nonlinear bench mark. At V$_{pp}$ of 400 mV, there is 2.9 dB, 1.9 dB and 0.9 dB improvement over linear, LUT-3 and Volterra DPD, respectively. The DL-BiRNN gives higher SNR improvement than IL-BiRNN over all the operating points, as training is more effective. A detailed comparison of the IL and DL is presented in [25].

In Fig. 8(b), we present the histograms of four DPD techniques at V$_{pp}$ of 400 mV. Compared to the other cases, the linear DPD has histograms with higher noise and greater variation across the symbols. The outer symbols are more prone to errors compared to the innermost symbols. The worst case outer symbols cause the errors that dominate the bit error rate (BER). Both the LUT and Volterra pre-distortions lower the noise level and improve the shape of the noise around symbols. The DL-BiRNN approach exhibits the narrowest noise shapes. The NN approach shows significant reduction in both noise level and nonlinear distortion, with enhanced symbol uniformity. However, we can see symbols located on the outer edges still contribute to errors that dominate the BER.

3.4 Improving DPD with iterations

We can improve the DPD by iterating the process. Our first iteration (the approach used in the preceding section) collects training data when using linear pre-compensation. We use averaging to de-noise our training set, but the nonlinear distortion remains. A second iteration with nonlinear pre-distortion leads to a training set with better symbol recovery as there is less residual nonlinear distortion. The iterative approach can be applied to NN and to Volterra and LUT. When improvement stagnates from one iteration to the next, we devote more resources to receiver DSP (in the form of MMSE filter taps) to see if we can jump start the iterative improvement in the next round. In the previous section we used 257 MMSE filter taps at reception, while now we move gradually from 101 to 501 taps.

Our experimental method for iterative NN training is shown in Fig. 9(a). At iteration $i$, we send logical symbols through the DL-BiRNN found in iteration $i-1$. We train on $N=500$ frames; there is no processing between the DPD and the DAC. The RTO captures the transmission and we average over 20 copies of each frame of 8192 symbols to reduce AWGN. We use the de-noised data to train the AUX$_i$; DPD$_i$ is trained from AUX$_i$.

Fig. 9. Summary of the experimental procedure for data collection iterations to (a) train NN-DPD and (b) test its performance by computing the SNR.

Download Full Size | PDF

Our experimental method for evaluating performance is given in Fig. 9(b). If the effective SNR has not improved, on the next iteration we increase the number of taps in the receiver MMSE filter. We repeat this process until no more improvement in performance is observed. For Volterra and LUT, at each iteration, we need to cascade the previously trained DPDs to hold the SNR improvement. For $n$ iterations, the complexity of Volterra and LUT is multiplied by $n$.

In Fig. 10 we present the effective SNR at V$_{pp}$=400 mV, where the nonlinear effect limits performance, and thus, nonlinear pre-distortion has greatest gain. We see that the LUT has virtually no improvement using iterations. All improvements are due to the greater length of the MMSE filter at reception; there is a performance plateau at filter length of 457. The Volterra sees no improvement from iteration unless the reception filter is quite long, above 400 taps. Between iterations 4 and 6 we see the maximal improvement of 0.2 dB due solely to iteration. There is improvement of 0.8 dB due to post-processing, with Volterra achieving 20.7 dB and LUT-3 19.4 dB.

Fig. 10. Effective SNR at $V_{pp}$=400 mV vs. iteration number for three pre-distortion approaches. Annotations at each increment give tap length of receiver MMSE filter.

Download Full Size | PDF

The DL-BiRNN approach shows improvement both from iteration and from greater filter length. The only transition where we do not see improvement is the final one from iteration 6 to 7 with 501 filter taps. Both iteration and MMSE filtering length contribute (somewhat equally) to the improvement from 20.5 (iteration 1) to 21.4 dB (iteration 6). It is interesting that Volterra and LUT solutions see 0.8 dB improvement with iterations and/or increased MMSE length, but only NN sees the improvement of 0.9 dB mainly due to iterations. The DL-BiRNN has the best performance of 21.4 dB effective SNR.

4. Complexity vs. performance

We have seen in the previous section that the DL-BiRNN achieves the best performance. In this section we will focus on the complexity/performance trade-offs. We compare the trade-off for Volterra, and for several NN solutions. We examine three types of recurrent neural network (RNN). We include a FNN to highlight the complexity advantages of RNN solutions.

4.1 Volterra complexity

In section 2.2 we used a Volterra series of order $P=3$. The complexity of the Volterra DPD in terms of real valued multiplications (RVMs), $CC_V$ can be calculated [21] from Eq. (2).

(10)$$\begin{aligned} CC_{V}(P,M) & = \sum_{p=1}^{P} \frac {(2M_{p}+p)!}{(p-1)!(2M_{p})!},\\ \end{aligned}$$

The number of RVMs increases with both higher order and longer memory.

To identify reasonable trade-offs of performance and complexity, we found the SNR improvement for an order 3 Volterra when sweeping memory lengths $M_{2}$ and $M_{3}$. We fixed the linear memory length to $M_{1}=125$ and examined the nonlinear operating region at $V_{pp}=400$ mV. For each $(M_{2},M_{3})$, we found the Volterra weights, and experimentally estimated effective SNR. For each $M_{3}$ we swept $M_{2}$ (with step size of one) until no improvement in SNR is observed.

The results are presented in Fig. 11, where the best performance is 20.5 dB. As performance saturates, adding complexity does not lead to performance improvement. We identify the sets $(M_{2},M_{3})$ such that $SNR \geq \tau$ for thresholds of 19.3 dB, 20 dB and 20.4 dB. For each set we calculate the required number of RVMs, and retain the $(M_{2},M_{3})$ with lowest complexity. For our complexity/performance comparisons in section 4.1 we retain these three DPD points: (3,1), (5,2) and (11,3), identified in Fig. 11 as $V_1$, $V_2$ and $V_3$, respectively. Where $V_2$ is the point used in section 3.3.

Fig. 11. Volterra DPD effective SNR $V_{pp}=400$ mV, $M_{1}$=125 sweeping $M_{2}$ and $M_{3}$. Stars are points ($V_{1}$, $V_{2}$ and $V_{3}$) used for complexity/performance comparisons.

Download Full Size | PDF

4.2 Neural network complexity

We examine three types of RNN cells, as illustrated in Fig. 12, to use in our bidirectional RNN. The first is the simplest, the standard RNNs. These have limited contextual information for data at long separations, due to the vanishing gradient causing the input to have exponentially decreasing impact on the input or hidden layers [26]. Therefore, we also consider the LSTM and GRU cells; these versions of RNN are designed to tackle the problem of the vanishing gradient [27]. We use the bi-directional version in all cases, so we process the input in both the forward and backward directions during training. We confine our examination to easily-available, standard RNN modules for ease in reproducing our results.

Fig. 12. Internal structure of standard RNN, GRU and LSTM cells

Download Full Size | PDF

In this section, we examine the computational operations of the three recurrent cell types. Referring to Fig. 12, for an RNN,

(11)$$\begin{aligned} h_{t} & =\tanh(Wh_{t-1}+U x_{t}+b), \\ \end{aligned}$$

for a GRU,

(12)$$\begin{aligned} z_{t} & =\sigma(W_{z}x_{t}+U_{z}h_{t-1}+b_{z}), \\ r_{t} & =\sigma(W_{r}x_{t}+U_{r}h_{t-1}+b_{r}), \\ h_{t}^{'} & =\tanh(W_{h} x_{t}+U_{h}(r_{t}\odot h_{t-1})+b_{h}), \\ h_{t} & =(1-z_{t})\odot h_{t-1}+ z_{t}\odot h_{t}^{'}, \\ \end{aligned}$$

and for an LSTM,

(13)$$\begin{aligned} f_{t} & =\sigma(W_{xf}x_{t}+W_{hf}h_{t-1}+b_{f}), \\ i_{t} & =\sigma(W_{xi}x_{t}+W_{hi}h_{t-1}+b_{i}), \\ o_{t} & =\sigma(W_{xo}x_{t}+W_{ho}h_{t-1}+b_{o}), \\ c_{t}^{'} & =\tanh(W_{xc}x_{t}+W_{hc}h_{t-1}+b_{c}), \\ c_{t} & =(f_{t} \odot c_{t-1})+(i_{t} \odot c_{t}^{'}), \\ h_{t} & =o_{t} \odot \tanh (c_{t}), \end{aligned}$$

In Eq. (11) and Eq. (12), the connection weights in matrices $W$ and $U$ relate the reset ($r_{t}$), update ($z_{t}$) gate and activation vector ($h_{t}^{'}$), while in Eq. (13), the matrix $W$ gives connection weights between cell state ($c_{t}$), input ($i_{t}$), output ($o_{t}$) and forget ($f_{t}$) gates. The input, bias, current and previous hidden states vectors are $x_{t}$, $b$, $h_{t}$, and $h_{t-1}$, respectively. The symbol $\odot$ represents element-wise multiplication, $\sigma$ is the logistic sigmoid function, and $tanh$ is the hyperbolic tangent function.

The complexity of RNNs depends on several factors, including number of hidden units and weights, and the length of the input sequence. The choice of input sequence length can be selected independently of the model architecture and should be determined by the channel memory. The complexity of each cell distinguishes the models from one other. Following Eq. (11) to Eq. (13), the number of parameters for RNN model with two recursive layers and equal number of cells per layer can be computed as [21].

(14)$$\begin{aligned} P_{BiRNN} & = 2G[H(QH+F_{1}+F_{2})+QH]+J(2H+1), \\ \end{aligned}$$

where only the number of gates, $G$, varies across RNN cell type; $G$ is 1, 3 and 4 for the RNN, GRU, and LSTM, respectively. The number of hidden cells $H$ can be found in appendix in Table 2. The factor $Q$ is 1 for IL and 2 for DL based on the number of recurrent layers. For both learning methods, input features of first recurrent layer $F_{1}$, and output $J$ are $1$ due to PAM-8 data. The input features of second recurrent layer $F_{2}$ is $40$ for DL, while it is $0$ for IL.

Combining the length of input symbol sequence $I$ with the number of parameters $P_{BiRNN}$, we can compute the number of RVMs for bi-directional RNN, per

(15)$$\begin{aligned} CC_{BiRNN} & = I \cdot \left(P_{Bi-RNN} - J \right), \end{aligned}$$

where $I=8192$. The subtraction reflects the bias setting of linear layer to zero.

For the sake of complexity comparison we will include a FNN. The performance would be sub-optimal as the FNN weights are updated without taking into account any temporal context. Instead, we use a sliding window input to allow the FNN to learn the memory effect: past and future delayed symbols are used at the input. We do not optimize FNN performance, but rather use architecture and parameters as close as possible to the Bi-RNN case.

The FNN we examine has an architecture similar to that in Fig. 2. We replace the RNN layers with five FNN layers of 21, 12, 8, 8, and 1 neurons, respectively. The first and last FNN layer is linear; the others use the Leaky ReLU activation function with negative slope of 0.1. The output of the first CNN layer is converted into a sliding window of size 11 before being fed to the FNN layers. Training hyper-parameters are similar to those Table 3. For DL-FNN, we use the already trained Aux (based on Bi-RNN). In terms of complexity, the DL and IL FNN have the same RVMs as their network size hyper-parameters are identical. We find the RVMs by multiplying the input length of each FNN layer to its number of neurons (as we set the bias to zero).

4.3 Many-to-one vs. many-to-many

We do not consider the complexity of finding Volterra weights or for training the NN. These operations are performed once, while DPD calculations are made on live data. Therefore, we focus our comparison on the number of RVMs per symbol. In complexity comparison, we consider only nonlinear section of DPDs as they are more prone to higher complexity. We exclude the complexity caused by CNN layers in NN-DPD and first order terms in Volterra. In the case of the Volterra series, for each single output we make $CC_V$ (Eq. (10)) real valued multiplications (RVMs). Many inputs are included in the calculation that yields one output in Eq. (6). Hence $CC_V$ is the number of RVMs per symbol. For the same reason, the FNN is also many-to-one.

In the case of a bidirectional RNN, we operate on a block of input data. The data must be run forward and backward through the RNN to generate the output for the entire block of data. This choice was made due to the nature of the system where the nonlinear quantization distortion interplays with the bandwidth limitation of the DAC. Memory and nonlinearity can be captured simultaneously in the Bi-RNN, especially in the case of the GRU and LSTM.

Our BiRNN approach is many-to-many [28]; $CC_{BiRNN}$ RVMs yield many symbol decisions. This approach significantly reduce the RVMs per symbols vis-à-vis the Volterra solution. Our block length is 8192, so the number of RVMs per symbol is $CC_{BiRNN}/8192$. The overall DPD architecture (Fig. 2) and network size hyper-parameters (Table 2) are similar for all BiRNN; the only difference is the type of RNN cell.

4.4 Complexity/performance trade-offs

In Fig. 13(a) we plot the effective SNR in dB for several DPD solutions. For the Volterra DPD we include the three solutions identified in section 4.1. For the four NN solutions we show in circle markers the DL performance, and in triangle markers the IL performance. As expected, the FNN does not perform as well as the RNNs solutions. The DL outperforms the IL as found in previous examinations [5]. The three Volterra solutions were selected per the discussion in section 4.1. The effective SNR shown in Fig. 8(a) is for Volterra $V_{2}$ and the Bi-RNN.

Fig. 13. For $V_{pp}=400$ mV, (a) performance (effective SNR) and (b) complexity (RVMs per symbol) of several DPD solutions.

Download Full Size | PDF

In Fig. 13(b) we plot the RVMs per symbol for the DPD solutions we examined. The Volterra solutions are spread out by design. The FNN complexity falls between that of the DL and IL solutions for RNNs. The DL has twice the number of RNN layers as IL for all three bidirectional RNNs, hence the separation of the two curves. For FNN, same structure is used for DL or IL, so they have the equal complexity.

For the Volterra DPD we have a good spread of performance/complexity trade-offs. We can achieve a 1 dB increase in effective SNR by tolerating an order of magnitude increase in complexity. Increasing complexity beyond this point will not lead to further improvement, as the performance is saturated. The most complex Volterra has similar performance to the IL-BiRNN, but the IL-BiRNN has fewer RVMs per symbol. This RNN does not outperform Volterra, but it is more computationally efficient due to the many-to-many advantage. If we choose the Volterra $V_2$ solution with very similar complexity to IL-BiRNN, the NN outperforms the Volterra by 0.4 dB. We did not consider numerous studies to reduce the complexity of the Voletrra [29] or RNNs [30], but rather constrained our evaluation to traditional models for both Volterra and RNN.

When performance increase is critical, we can turn to the DL-BiRNN. We see that for the nonlinear system examined, the vanishing gradient is not a significant factor. The standard BiRNN performs almost as well as the the GRU and LSTM versions with a gating mechanism to manage long-short term dependencies. The significantly higher complexity for the GRU and LSTM is not justified here. These results are for a single iteration and an MMSE equalizer of 257 tap length. Using iterations and a longer MMSE filter could increase the effective SNR to 21.4 dB for the BiRNN as similar RVMs per symbol. The BiRNN is the preferred solution for high-performance systems that can tolerate increased complexity (less than an order of magnitude) as compared to Volterra.

5. Conclusion

We presented several NN based DPD techniques to pre-compensate the impairments emerging from a DAC. Such DACs are essential for the most spectrally efficient and high-bandwidth optical communications. Our bidirectional recursive neural network solution was compared against the more typical Volterra, LUT and linear DPDs. We considered both DL and IL methods for training our RNN DPD. The DL approach achieves the highest performance, while the IL offers an attractive performance/complexity trade-off. For the case of PAM-8, the DL BiRNN achieved approximately 0.9 dB, 1.9 dB and 2.9 dB SNR gains with regard to the Volterra, LUT and linear DPD, respectively. The many-to-many approach in the RNN DPD reduces the computational complexity compared to the many-to-one approach of FNN and Volterra.

Appendix

Table 2. Network size hyper-parameters of NN AUX and DPD for both DL and IL. For $x_{3}[n]$ case, NN DPD have additional units of batch normalization (BN) and hard tanh

View Table | View all tables in this article

Table 3. Training hyper-parameters of NN AUX and DPD for both IL and DL. For $x_{3}[n]$ case, training epochs are 650 and data length is 650*8192

View Table | View all tables in this article

Funding

Natural Sciences and Engineering Research Council of Canada (IRCPJ546377-18).

Disclosures

The authors declare no conflicts of interest.

Data Availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. F. Buchali, V. Aref, M. Chagnon, K. Schuh, H. Hettrich, A. Bielik, L. Altenhain, M. Guntermann, R. Schmid, and M. Möller, “1.52 Tb/s single carrier transmission supported by a 128 GSa/s SiGe DAC,” in Optical Fiber Communications Conference (OFC), (2020), pp.1–3.

2. H. Sun, M. Torbatian, M. Karimi, R. Maher, S. Thomson, M. Tehrani, Y. Gao, A. Kumpera, G. Soliman, and e. a. Kakkar, “800G DSP ASIC design using probabilistic shaping and digital sub-carrier multiplexing,” J. Lightwave Technol. 38(17), 4744–4756 (2020). [CrossRef]

3. R. Dubé-Demers, S. LaRochelle, and W. Shi, “Low-power DAC-less PAM-4 transmitter using a cascaded microring modulator,” Opt. Lett. 41(22), 5369–5372 (2016). [CrossRef]

4. J. Zhang, H.-C. Chien, Y. Xia, Y. Chen, and J. Xiao, “A novel adaptive digital pre-equalization scheme for bandwidth limited optical coherent system with DAC for signal generation,” in Optical Fiber Communications Conference (OFC), (2014), pp.1–3.

5. V. Bajaj, F. Buchali, M. Chagnon, S. Wahls, and V. Aref, “Deep neural network-based digital pre-distortion for high baudrate optical coherent transmission,” J. Lightwave Technol. 40(3), 597–606 (2022). [CrossRef]

6. H. Faig, Y. Yoffe, E. Wohlgemuth, and D. Sadot, “Dimensions-reduced Volterra digital pre-distortion based on orthogonal basis for band-limited nonlinear opto-electronic components,” IEEE Photonics J. 11(1), 1–13 (2019). [CrossRef]

7. G. Khanna, B. Spinnler, S. Calabró, E. De Man, and N. Hanik, “A robust adaptive pre-distortion method for optical communication transmitters,” IEEE Photon. Technol. Lett. 28(7), 752–755 (2016). [CrossRef]

8. C. Eun and E. Powers, “A new Volterra predistorter based on the indirect learning architecture,” IEEE Trans. Signal Process. 45(1), 223–227 (1997). [CrossRef]

9. J. Kim and K. Konstantinou, “Digital predistortion of wideband signals based on power amplifier model with memory,” Electron. Lett. 37(23), 1417–1418 (2001). [CrossRef]

10. J. H. Ke, Y. Gao, and J. C. Cartledge, “400 Gbit/s single-carrier and 1 Tbit/s three-carrier superchannel signals using dual polarization 16-QAM with Look-up table correction and optical pulse shaping,” Opt. Express 22(1), 71–84 (2014). [CrossRef]

11. J. Zhang, P. Gou, M. Kong, K. Fang, J. Xiao, Q. Zhang, X. Xin, and J. Yu, “PAM-8 IM/DD transmission based on modified Look-up table nonlinear predistortion,” IEEE Photonics J.10(6), 1–9 (2018).

12. J. Hassani and M. Kamarei, “A flexible method of LUT indexing in digital predistortion linearization of RF power amplifiers,” in IEEE International Symposium on Circuits and Systems (ISCAS), vol.1 (2001), pp.53–56.

13. C. Tarver, A. Balatsoukas-Stimming, and J. R. Cavallaro, “Design and implementation of a neural network based predistorter for enhanced mobile broadband,” in IEEE International Workshop on Signal Processing Systems (SiPS), (2019), pp.296–301.

14. X. Hu, Z. Liu, X. Yu, Y. Zhao, W. Chen, B. Hu, X. Du, X. Li, M. Helaoui, W. Wang, and F. M. Ghannouchi, “Convolutional neural network for behavioral modeling and predistortion of wideband power amplifiers,” IEEE Trans. Neural Netw. Learning Syst. 33(8), 3923–3937 (2022). [CrossRef]

15. T. Gotthans, G. Baudoin, and A. Mbaye, “Digital predistortion with advance/delay neural network and comparison with Volterra derived models,” in IEEE 25th Annual International Symposium on Personal, Indoor, and Mobile Radio Communication (PIMRC), (2014), pp.811–815.

16. M. Rawat, K. Rawat, and F. M. Ghannouchi, “Adaptive digital predistortion of wireless power amplifiers/transmitters using dynamic real-valued focused time-delay line neural networks,” IEEE Trans. Microwave Theory Techn. 58(1), 95–104 (2010). [CrossRef]

17. D. Wang, M. Aziz, M. Helaoui, and F. M. Ghannouchi, “Augmented real-valued time-delay neural network for compensation of distortions and impairments in wireless transmitters,” IEEE Trans. Neural Netw. Learning Syst. 30(1), 242–254 (2019). [CrossRef]

18. M. Abu-Romoh, S. Sygletos, I. D. Phillips, and W. Forysiak, “Neural-network-based pre-distortion method to compensate for low resolution DAC nonlinearity,” in European Conference on Optical Communication (ECOC), (2019), pp.1–4 .

19. M. Schaedler, M. Kuschnerov, S. Calabró, F. Pittalá, C. Bluemm, and S. Pachnicke, “AI-based digital predistortion for IQ mach-zehnder modulators,” in Asia Communications and Photonics Conference (ACP), (2019), pp.1–3.

20. G. Paryanti, H. Faig, L. Rokach, and D. Sadot, “A direct learning approach for neural network based pre-distortion for coherent nonlinear optical transmitter,” J. Lightwave Technol. 38(15), 3883–3896 (2020). [CrossRef]

21. S. Deligiannidis, C. Mesaritakis, and A. Bogris, “Performance and complexity analysis of bi-directional recurrent neural network models versus Volterra nonlinear equalizers in digital coherent systems,” J. Lightwave Technol. 39(18), 5791–5798 (2021). [CrossRef]

22. S. Zhalehpour, J. Lin, W. Shi, and L. A. Rusch, “Reduced-size Lookup tables enabling higher-order qam with all-silicon iq modulators,” Opt. Express 27(17), 24243–24259 (2019). [CrossRef]

23. Y. Wu, U. Gustavsson, A. G. i. Amat, and H. Wymeersch, “Residual neural networks for digital predistortion,” in IEEE Global Communications Conference (GLOBECOM), (2020), pp.01–06.

24. S.-C.-K. Kalla, C. Gagné, M. Zeng, and L. A. Rusch, “Recurrent neural networks achieving MLSE performance for optical channel equalization,” Opt. Express 29(9), 13033–47 (2021). [CrossRef]

25. H. Paaso and A. Mammela, “Comparison of direct learning and indirect learning predistortion architectures,” in IEEE International Symposium on Wireless Communication Systems, (2008), pp.309–313.

26. A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber, “A novel connectionist system for unconstrained handwriting recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009). [CrossRef]

27. F. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: continual prediction with LSTM,” in International Conference on Artificial Neural Networks (ICANN)., (1999), pp.850–855.

28. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Neural Information Processing Systems (NIPS),, (2014), p.3104–3112.

29. N.-P. Diamantopoulos, H. Nishi, W. Kobayashi, K. Takeda, T. Kakitsuka, and S. Matsuo, “On the complexity reduction of the second-order Volterra nonlinear equalizer for IM/DD systems,” J. Lightwave Technol. 37(4), 1214–1224 (2019). [CrossRef]

30. M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for model compression,” arXiv, arXiv:1710.01878 (2017). [CrossRef]

Performance vs. complexity in NN pre-distortion for a nonlinear channel

Abstract

1. Introduction

2. Digital pre-distortion for a nonlinear channel

2.1 Nonlinear channel under test and figure of merit

2.2 Common DPD solutions

2.2.1 Volterra DPD

2.2.2 Look-up table DPD

2.3 Bidirectional RNN

3. Bidirectional RNN for digital pre-distortion

3.1 Channel inversion strategies

3.2 Finding the best channel inversion strategy

3.3 Comparing RNN and common DPD solutions

3.4 Improving DPD with iterations

4. Complexity vs. performance

4.1 Volterra complexity

4.2 Neural network complexity

4.3 Many-to-one vs. many-to-many

4.4 Complexity/performance trade-offs

5. Conclusion

Appendix

Funding

Disclosures

Data Availability

References

Data Availability

Cited By

Figures (13)

Tables (3)

Equations (15)

Optics Express