## Abstract

We explore recurrent and feedforward neural networks to mitigate severe inter-symbol interference (ISI) caused by bandlimited channels, such as high speed optical communications systems pushing the frequency response of transmitter components. We propose a novel deep bidirectional long short-term memory (BiLSTM) architecture that strongly emphasizes dependencies in data sequences. For the first time, we demonstrate via simulation that for QPSK transmission the deep BiLSTM achieves the optimal bit error rate performance of a maximum likelihood sequence estimator (MLSE) with perfect channel knowledge. We assess performance for a variety of channels exhibiting ISI, including an optical channel at 100 Gbaud operation using a 35 GHz silicon photonic (SiP) modulator. We show how the neural network performance deteriorates with increasing modulation order and ISI severity. While no longer achieving MLSE performance, the deep BiLSTM greatly outperforms linear equalization in these cases. More importantly, the neural network requires no channel state information, while its performance is comparable to conventional equalizers with perfect channel knowledge.

© 2021 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

To meet rapidly growing traffic demand, optical communications systems are turning to advanced modulation formats combined with coherent detection. The electrical bandwidth limitations of the transceiver components pose the major challenges in achieving higher data rates. Inter-symbol interference (ISI) due to this band-limitation, rather than signal-to-noise ratio, can be the principal impairment in higher order QAM modulation in high-speed optical communication.

ISI can be mitigated via post compensation of the received signal. The maximum-likelihood sequence estimator (MLSE) is an excellent solution to combat ISI, providing the optimal performance by finding the most probable transmitted sequence. However, an MLSE is highly complex and becomes infeasible with increasing modulation order and ISI memory length. Moreover, to achieve this optimal performance, the MLSE equalizer requires accurate channel state information (CSI). Research focus has thus shifted towards sub-optimal solutions. The minimum mean squared error (MMSE) equalizer provides an optimal linear solution. However, the MMSE performance quickly deteriorates with severe ISI.

Recently, machine learning [1] and deep learning [2,3] techniques have been applied in many areas of communication [4–7]. Neural networks (NN) hold the potential to learn the channel indirectly from the data during the training, without the need for explicit CSI.

In [8], a recurrent neural network (RNN) is used to mitigate ISI induced by a Poisson channel; the performance approaches that of MLSE. However, a Poisson channel is a poor model for high data-rate optical communications, and useful only for very low data rate channels. In [9], the authors estimate the weights of the Viterbi trellis for the MLSE using a deep neural network (DNN). This model performs well for the unknown channel conditions, but like MLSE, its computational complexity grows exponentially with modulation order and channel memory length.

In [10], several RNN structures are trained to decode convolutional codes under additive white Gaussian noise (AWGN) channel. It is shown that the enhanced bidirectional RNN-based decoder can approach the performance of maximum likelihood Viterbi decoder when the encoding memory is six or lower. The long short-term memory (LSTM) version of the RNN is found to be particularly effective. However, only binary and QPSK data are examined and the only impairment is AWGN.

In [11], a LSTM neural network is utilized to compensate fiber nonlinearities in digital coherent systems. Numerical results illustrate that LSTM is highly efficient in combating non-linear impairments in fiber. Additionally, various NN structures, such as feedforward NN [12], cascade RNN [13] and bidirectional RNN [14] have been proposed for equalizing non-linear effects in non-coherent short-reach direct detection systems, and they dominate conventional feedforward equalization, decision feedback equalization or the Volterra series-based equalizer in nonlinear equalization scenarios.

In this paper, we examine both recurrent and feedforward NNs to target high data rates in severely bandlimited channels. Compared with other optical communications studies, we focus on data rates greatly exceeding available bandwidth [15] rather than nonlinearities. We therefore also focus on pattern dependencies induced by various channel types, rather than the correlations induced in convolutional codes at the transmitter [10]. We also examine higher order modulation formats where complexity is particularly challenging for MLSE solutions, while [10] was limited to binary and QPSK.

We study two families of ISI channels, one multipath family and another super Gaussian family, to sweep through ISI severity. While these are synthetic channels, we also examine an experimentally measured optical channel from a silicon photonic (SiP) modulator [15] operated at 100 Gbaud. We propose to use LSTM [16] for ISI compensation, and extend examinations begun in [17,18]. The recent success of LSTM for convolutional codes [10], as well as speech recognition [19–21] and machine translation [22,23] motivated our examination.

We show the deep bidirectional LSTM (BiLSTM) architecture is very promising for processing sequential dependencies. To the best of our knowledge, ours was the first demonstration of LSTM-RNNs achieving the MLSE performance in bandlimited channels with severe ISI for QPSK. We further examine higher modulation orders where deep BiLSTM achieves performance better than MMSE and approach that of MLSE. The severity of the ISI determines how closely we can approach MLSE performance. Nonetheless, deep BiLSTM still yields substantial gain over MMSE, and offers better scalability than MLSE.

The rest of the paper is organized as follows. In section II, we introduce conventional optimal equalizers and bandlimited channels (specifically those examined in our simulations) with their performance metrics. In section III, we present the proposed feedforward NN and deep BiLSTM in detail. In section IV, we demonstrate that deep BiLSTM outperforms other NN solutions for QPSK, attaining MLSE performance. In section V we move to higher order QAM; we quantify performance penalties with increasing levels of ISI. In section VI, we provide some discussion of relevant NN characteristics, such as convergence and complexity. Concluding remarks are provided in section VII.

## 2. System description and preliminaries

The functional block diagram shown in Fig. 1 illustrates the signal flow through a typical communication system. The modulated QAM signal is generated at the transmitter and passes through a channel with bandlimited components. At the receiver, additive white Gaussian noise (AWGN) corrupts the received symbols. When high data rate signals are transmitted through these bandlimited components, the signal is distorted due to severe attenuation at high frequencies. This results in severe ISI, even when using Nyquist pulse shaping to mitigate these effects.

To achieve reliable transmission, we can use post-compensation equalizers. In next subsection we describe the optimal linear and nonlinear conventional equalizers. Following that we describe several collections of bandlimited channels that will be examined, and the motivation for their use. In particular, we quantify the conventional equalizer performance for these channels. We will present the NNs we adopted in the next section.

#### 2.1 Conventional receivers

The MMSE receiver is the optimal linear approach to symbol-by-symbol detection to mitigate ISI. It is a model-based equalizer taking the form of a finite-impulse response filter. For a known channel, that is for perfect CSI, the exact MMSE equalizer tap coefficients can be found. When CSI is not available, we can use data-driven adaptation of tap weights. The MMSE equalizer is relatively easy to implement and is an efficient solution for low ISI channels. Its performance is highly sub-optimal for compensating high ISI.

For an ISI channel, the optimal nonlinear equalizer uses sequence detection rather than the symbol-by-symbol approach in a MMSE equalizer. It is known as the maximum likelihood sequence estimator (MLSE) equalizer. The MLSE is also a model-based equalizer, applicable to channels where a trellis-like architecture can describe symbol dependencies (as is the case for ISI channels) [24]. An exhaustive examination of all sequences is not necessary to find the optimal one, but the algorithm is highly complex and scales exponentially with both channel memory length and modulation order. Not only is prior CSI required, MLSE performance highly depends on the quality of the CSI. The MMSE and MLSE provide “bookends” to the performance/complexity trade-off in equalization.

#### 2.2 ISI channels for simulation

The impact of ISI on BER performance depends on two factors: the receiver used and the channel frequency response. We use BER performance of MMSE (linear) and MLSE (nonlinear) receivers as a baseline of comparison with our proposed NN solutions. The relative performance of these two solutions (MMSE & MLSE) depends on the severity of the channel. In this subsection, we first introduce the collection of channels we simulate (including an experimental SiP modulator response), and secondly a means of quantifying the ISI impact of these channels vis-à-vis the performance baselines.

This work is motivated by the challenges in silicon-photonic (SiP) modulator operating at baud rates that greatly exceed their nominal channel bandwidth. Our simulations include the ISI created by a system with the experimental SiP modulator frequency response for 100 Gbaud operation [15]. The measured impulse response is complex and estimated to 512 taps. We truncate the taps to three, representing 90% of the total energy in the taps. This was done to facilitate comparison with other channels examined with similar tap lengths. The two-sided frequency response of the truncated SiP channel is plotted in Fig. 2(a) over a 100 GHz range (impulse response at 100 Gsamples/sec). The SiP modulator has a 3 dB bandwidth of 35 GHz.

The performance of our NN solutions will fall somewhere between the MMSE & MLSE benchmarks. To generalize our conclusions we need to expand our examination to channels that can sweep through ISI severity. We investigate two series of synthetic channels responses. The first series is a collection of multipath channels with three taps. We have chosen the tap weights to cover various frequency responses, as seen in Fig. 2(a). We note that these multipath channels are variations on a classic example appearing in [25]. In Fig. 2(b) we plot five super Gaussian channels (used extensively to model optical filters). Their taps take the form of a Gaussian exponential raised to a power (one to five in our parameterization, with the tap coefficients given in Table 1). While multipath channels have three tap weights, we use five weights for the super Gaussian channels.

All channel impulse responses are normalized to have unit energy. The bit error rate (BER) vs. signal-to-noise ratio (SNR) can be easily found numerically (e.g., using existing toolboxes in Matlab) for the MMSE & MLSE receivers in the case of perfect CSI. That is, performance when the receiver knows the exact channel impulse responses. Common forward error correction (FEC) techniques in optical communications systems are pegged to a FEC threshold of 3.8e-3 BER. Therefore each of our channels can be characterized by the SNR penalty of the equalizer vis-à-vis an ideal channel with no ISI. Figure 2(c) summarizes the gap between the SNR penalty for the MMSE and MLSE for QPSK modulation for the channels examined. We observe our selection spans a wide swath of penalties.

## 3. Supervised machine learning for equalization

To overcome the limitations of conventional receivers and provide a lower complexity and near optimal solution, we investigate NNs as ISI channel equalizers in this section. We examine two NN architectures: a classic feed-forward NN (FFNN) and a recurrent NN (RNN), namely the long short-term memory (LSTM) [16]. The inspiration for the FFNN is to mimic the nonlinear nature of the MLSE, while the LSTM seeks to mimic the sequence estimation.

For both NN architectures, we first generate our M-QAM modulated training and validation data sets following the random uniform distribution, and transmit these sets through our simulated channel that adds ISI and AWGN. Then, we input the received symbols and targets symbols to the corresponding NN for training. During training, we calculate the error performance on both training and validation sets and monitor this performance improvement using learning curves (a learning curve is a plot that shows the variation of error performance or model accuracy over the epochs, where an epoch is one complete presentation of the data set to be learned by NN. Please refer to Fig. 8 in Section VI-B). The cross-entropy (CE) is used as the error criterion, since we focus on classification in this paper. The CE error can be expressed as

We update the weights in all NN models via a stochastic gradient approach. To assist in convergence we use an adaptive gradient, specifically the adaptive moment (Adam) optimization [26] with a learning rate of $\sim$10$^{-2}$. Once the model is completely trained, we use the test data (new random data) for assessing the performance. More exactly, we first transmit the test data through the simulated channel, and then input the received in-phase (I) and quadrature (Q) coordinates into the already trained NN. The equalized QAM signal after the NN is demodulated using standard QAM decision boundaries in Matlab and on this basis, the output BER is calculated.

#### 3.1 Feed-forward NN (FFNN)

FFNNs are the simplest and most widely used NNs; information flows only in the forward direction. We adopt the FFNN structure shown in Fig. 3 for ISI mitigation, which consists of an input layer, two hidden layers and one output layer. Note that during NN training we examine various choices for the number of hidden layers, neurons and input frame size (features).

The input layer for our application is the received IQ coordinates. To retain information on the sequential dependencies caused by ISI, we use a frame of buffered IQ measurements in a sliding window as input. That is, the input layer has 18 features, including 4 buffered past and future I and the Q measurements in addition to the current I and Q measurements. Smaller frames led to worse BER, and larger frames did not improve BER. Given that the channels examined had only 3 or 5 taps in their impulse response, these values are not surprising.

The input layer components are convolved with NN weights and sent to two hidden layers, each with 50 neurons. We swept from one to four hidden layers. Performance improves for two layers, but remains flat for three and four layers. Therefore, we use two hidden layers for FFNN. We also examined a range of 30 to 100 nodes. We settled on 50 nodes for FFNN as this gave good performance at all SNR levels; more nodes gave no performance improvement. The *tanh* function is adopted as the nonlinear activation function in these two hidden layers as it outperformed both sigmoid and ReLU functions. The convergence properties were also improved with tanh.

As mentioned earlier, the NN weights are trained to minimize the CE between the output and known target. Accordingly, the output layer has $M$ neurons, one per constellation point. We use the soft-max function at the output layer to determine the symbol probabilities (class probabilities). We then apply the CE negative log likelihood error criterion for updating nodes.

#### 3.2 Long short-term memory (LSTM)

In FFNN, the output is based only on the current input features and is independent of previous frames. As a result, the FFNN output would resemble MMSE output. To approach an MLSE solution, we turn to LSTM, which was introduced to extend the conventional to better capture very long-term dependencies [16].

Where the conventional RNN has neurons that only contain the hidden state (see cell $h$ in the abstraction in Fig. 4(a), the LSTM also has the cell $c$ at each neuron. The LSTM multi-functional $c$ cells are shown in Fig. 4(b). The cell gates (in the form of sigmoid functions) perform complex operations on data (like forget, update and output) to capture the very long term dependencies during training. LSTM can be either unidirectional or bidirectional, and in this paper we consider both LSTM versions.

A unidirectional LSTM will adjust to changes in the input sequence, even if a single IQ symbol is used as input. However, it cannot achieve the pruning action in a Viterbi algorithm [27]. An MLSE trellis [24] will update the past decisions (switch to a different path) at each successive symbol interval if that decision increases likelihood. This reevaluation of previously preserved paths is a sort of “back and forth” search for the best sequence in an MLSE trellis. If the LSTM ran in two directions it would be able to harness previous paths via the LSTM running in the reverse direction. We further consider a bidirectional LSTM (BiLSTM) [28,29] so that the forget and update functions can also be applied in reverse on the data.

The abstraction of the BiLSTM architecture is shown in Fig. 4(c). It is formed from two independent (but with identical structure) LSTMs. The input data is fed through each LSTM - one copy in the forward direction and another in the backward direction. The outputs at each symbol interval are then combined to create the final outputs. Thus, the equalized current symbol IQ coordinates at the output layer is calculated from both the buffered past and future sequence information.

For the input layer, the unidirectional LSTM has a sliding frame with 18 IQ measurements as in the FFNN. The frame input together with the feedback gating is expected to facilitate unidirectional LSTM to better react to data sequences. In contrast, the BiLSTM has a single IQ input, and thus, relies entirely on the LSTM cells (both in the forward and the backward NN) and their gating to gain context into symbol sequences.

We use one hidden layer for unidirectional LSTM, as a second hidden layer did not improve performance. For the BiLSTM architecture two hidden LSTM layers clearly improved performance, while a third did not. We refer to this NN as deep BiLSTM to highlight the depth as compared to the other LSTM solution examined. Additionally, the LSTM and deep BiLSTM both use 60 nodes (by nodes we refer to the LSTM cells). We examined a range of 30 to 100 nodes, and found that 60 nodes was best. The simulation parameters we adopted for all three NNs considered are summarized in Table 2.

## 4. Performance results for QPSK

We simulated BER vs. SNR performance for two channels: the SiP channel and the multipath channel MP5 which has considerably more severe ISI. The corresponding SiP and MP5 results are presented in Fig. 5(a) and 5(b), respectively. The BER for an ideal channel (ISI-free) is included with the annotation “theory”. The conventional receiver performance, both MLSE and MMSE, is included in dashed blue lines. The NN performance is given in red markers. A horizontal line is traced at the 7% overhead FEC threshold at BER of 3.8e-3.

Consider first the SiP channel. The NNs achieve performance that falls between the MMSE and MLSE receivers, that is, between the optimal linear and nonlinear solutions. The LSTM outperforms MMSE with a gain of 4.6 dB, with the FFNN providing similar improvement. At the 7% FEC threshold, LSTM approaches MLSE with a small 1.2 dB penalty. The deep BiLSTM equalizer actually achieves the same performance as the optimal MLSE receiver. This is probably because the deep BiLSTM equalizer can well capture the long-term dependencies in the ISI channels, and compensate ISI without noise enhancement.

The same trends can be seen for multipath channel MP5. As the ISI is more severe, the disparity in the performance of the linear MMSE solution and the MLSE solution is much more pronounced. The penalty from MLSE to LSTM performance is only an 0.5 dB greater for this channel than for the SiP channel, despite a much greater gap between MMSE and MLSE performance (13.3 vs. 5.8 dB). Once again the deep BiLSTM achieves the same performance as the MLSE, further verifying its effectiveness.

We evaluated the BER vs. SNR performance of deep BiLSTM for all our remaining synthetic bandlimited channels, i.e., all SG and all MP channels. At the 7$\%$ FEC threshold, the deep BiLSTM again achieved the same performance of MLSE for all these channels, i.e., zero penalty, showing that the deep BiLSTM is extremely effective in mitigating severe ISI for QPSK (the SNR values of BiLSTM and MLSE at 3.8e-3 BER for MP and SG channels are listed in Table 3). This is especially remarkable as the MLSE had access to perfect CSI, while the deep BiLSTM garnered its information only from the training set.

#### 4.1 Discussion

The classical FFNN and the LSTM offered similar performance. The LSTM had the same sliding window input as the FFNN, but even with the additional internal feedback it was unable to outperform the FFNN. We suspected that the framed input could actually be holding back the LSTM from achieving even better performance. That is, framed input may provide the LSTM with too much context, and thus, stifle its ability to build context using the long short-term memory cells in the NN.

Confining the LSTM to only a single pair of IQ coordinates at the input did lead to mild improvement in performance. However, moving to a bidirectional LSTM was required to reach the theoretical limit of MLSE detection. The BiLSTM architecture was effective in learning the sequential nature of data dependencies. This change, combined with the deep (two layer) structure, greatly enhanced the ability of LSTM to address sequential correlations.

## 5. Extension to higher order QAM

As NNs are data-driven models, their performance depends acutely on input data quality. Performance degradation with increased modulation order can be severe. In a channel with high ISI, the received symbols (input data to NN) are highly perturbed by ISI distortion in addition to additive noise. We examine how ISI severity impacts the ability of the deep BiLSTM NN to achieve MLSE-levels of performance.

We use the same deep BiLSTM and BER simulator as previous sections, but replace QPSK with M QAM modulation. We vary ISI severity by examining a variety of channels. The deep BiLSTM architecture (two hidden layers and 60 LSTM cells each) remains unchanged, since we could not find a configuration with higher performance.

#### 5.1 Assessing performance gains

Let us consider the relative BER performance as a channel worsens, that is, as the ISI becomes more severe. Clearly, the resulting gap (in dB) between MMSE and the ISI-free BER curves will increase. We can use the gap width at the 3.8e-3 FEC threshold as a benchmark. Each channel can be parameterized by this gap, which we use as the $x$-axis in the typical performance plot in Fig. 6(a). While Fig. 6(a) is a sketch for generic modulation (any QAM), Fig. 7 will present calculations for 8QAM, 16QAM and 32QAM. The $y$-axis reports the other performance gaps vis-à-vis MMSE, where ‘eq’ is one of the three equalizers: NN, MLSE, or theoretical limit (ISI-free). The lines report the absolute performance gap between MMSE and each solution (see curly braces). The lines define the following three performance regions.

A large red region means the NN has obtained great gains over the linear MMSE, while a small blue area indicates the NN is performing well compared to the optimal MLSE equalizer. In the ideal NN performance case, the blue region disappears whereas the red region is maximized. A large grey area means the channel is truly challenging and even the optimal MLSE has limited performance. Note that the uppermost line is by construction $x=y$, the gap between MMSE-theory.

To move from the prototypical plot to a specific plot, we can run simulations of BER vs. SNR for each channel. Take the example in Fig. 6(b) for 8QAM over the experimentally measured SiP channel frequency response. We annotate the FEC line with the relative penalty between equalizers - deep BiLSTM (black), MLSE (blue) and MMSE (red). For the SiP channel, the gap at the FEC threshold between ISI-free and a linear MMSE equalizer is 10 dB.

#### 5.2 Sweeping channels and modulation order

In Fig. 7(a) we present 8QAM results for SiP and five multipath channels described in section 2.2. The SiP stem plot is centered at $x= 10$ dB. Beside each section of the stem we note the corresponding relative penalty in dB found at the FEC threshold in Fig. 6(b). The other stem plots are found in a similar manner.

With this new graphical view of performance, we examine the effects of ISI severity and QAM modulation order. In addition to the 8QAM case in Fig. 7(a), results for 16QAM and 32QAM are given in Fig. 7(b) and 7(c), respectively. We found the super-Gaussian channels follow similar trends to those of the multipath channels; see plots in section 6.C.

For all three modulations, we can see similarities in the behavior as ISI worsens. The milder ISI channels (low MMSE-theory gap), of course, see little difference in performance between equalizers; even the linear equalizer performs well. As the ISI becomes more severe (moving right on the $x$-axis), our collection of parameterized channels manifests roughly linear growth in the performance gaps. We observe that the unrecoverable performance seems to saturate, so that at large $x$ the theory-MLSE gap (gray zone thickness) becomes roughly constant. In other words, for large $x$ the grey zone lower boundary tends to run parallel to the upper boundary, i.e., the $x=y$ line.

Consider now how NN performance deteriorates as we move to higher modulation order. Though not included, the QPSK plot would show zero gap between the NN and MLSE - the blue region would not be present. We see the blue region increase with increasing modulation order. Therefore our NN can no longer achieve optimal performance. However, the red region never disappears completely, indicating that the NN can always provide improvement over the linear solution.

For 16QAM the NN can recover roughly half the performance gap between the practical (MMSE) and the optimal (MLSE). The larger the combined blue/red regions, the more room for improvement our channel has in using a NN over the practical linear case. For channels with severe ISI, the NN can offer 4.2 dB of gain over a linear solution even for 32QAM.

## 6. Discussion

We examined the regression (mean square error (MSE) criteria) and classification versions (CE criteria) of all NNs. In Fig. 5(b) for QPSK we saw that the FFNN and LSTM achieved similar performance for CE; this was also true for MSE results. For higher order modulation formats the CE performance was best, so we use CE for all results presented in this paper.

#### 6.1 FFNN vs. LSTM with sliding window inputs

As the LSTM and FFNN training achieved the same performance, we compared the weights in the hidden layer. For instance, at 11 dB SNR for MP5 channel, these two NN solutions had only 50$\%$ of their erroneous symbols in common. Therefore, they converged to different solutions, but with similar performance. Both FFNN and LSTM had sliding window inputs. We were surprised that despite the feedback available in the LSTM, it did not outperform the FFNN. We concluded that, while the LSTM led to a distinct solution, it was not exploiting the LSTM cells for sequence detection. Reducing the inputs to a single pair of IQ inputs did somewhat improve LSTM performance.

#### 6.2 Convergence of BiLSTM

To examine the convergence of the CE error in classification for the deep BiLSTM, we examine two multipath channels: channel MP2 with moderate ISI and channel MP5 with high ISI. We consider 16QAM modulation with 25 dB SNR where the NN can achieve 1.2E-5 and 1.8E-2 BER for channels MP2 and MP5, respectively. The learning curves (not shown) for channel MP2 are smooth with fast convergence with a few hundred epochs. Even for much lower SNR, the MP2 convergence was not problematic.

In Fig. 8 we present the CE learning curve for the validation set for deep BiLSTM. We see clear convergence anomalies, with spikes appearing often in the learning curve. The greater ISI of channel MP5 leads to less robust convergence. This is not unexpected as the ISI distortion makes training challenging. To overcome this behavior, we regularly saved the NN parameters. As seen in the red traces in Fig. 8, the parameters are discarded when the error increases. Once a sufficient number of epochs has been examined, we recover the parameters with the lowest error. We use this parameter set in the deep BiLSTM architecture to estimate the BER. Saving the parameter is important for high ISI and extremely low SNR, but is also beneficial for other scenarios.

#### 6.3 Scalability of MLSE versus deep BiLSTM

For modulation order $M$ and a channel represented by $L$ taps in an FIR filter, i.e., channel memory of $L-1$ symbols, the MLSE equalizer computes $M^{L-1}$ metrics for each new received symbol. Due to this exponentially increasing complexity, the MLSE receiver is infeasible to implement for higher order QAM (16, etc.) and/or long channel memory (five symbols and more). Common hardware solutions for wireless communications can handle $2^9=256$ states. The MLSE is an excellent indicator of optimal performance, but is unattainable in higher order modulation systems.

For the deep BiLSTM, the floating point operations required for training is given by $( (2 \times 60 +60\times 60)\times 2 + (60\times 2\times M) + 60\times 4\times I_1 )\times I_2\times I_3= 960\times I_2\times I_3+ 240 \times I_1\times I_2 \times I_3 + 120\times I_2\times I_3 \times M$, where the required number of floating point operations is $I_1$ for each LSTM cell and $I_2$ for training symbols. Both $I_1$ and $I_2$ are constants. The number of epochs required for convergence is $I_3$, which is a variable less than 800, as shown in Fig. 8. Once the training is done, the complexity of processing each received symbol reduces to $(2 \times 60 +60\times 60)\times 2 + (60\times 2\times M) + 60\times 4\times I_1 = 960 + 240 \times I_1 + 120 \times M$. This processing complexity should be used for a fair comparison with MLSE. Therefore, it is clear that complexity increases linearly with $M$, an immense improvement over MLSE complexity scaling.

Our examination of 30 to 100 nodes and 2 or 3 hidden layers did not see significant performance improvement over the 60-node/2-layer solution. For QPSK we attained equal performance as MLSE, but the performance decreased as we moved to 32QAM. Nonetheless, 8QAM and 16QAM saw significant improvement in performance - greatly outperformed the linear solution.

The memory length did not appear to offer significant impact on performance of the deep BiLSTM with fixed complexity. The super Gaussian family has memory length 4, while the memory length is 2 for the multipath channels we examined in section 5. In Fig. 9(a) and (b) we report super Gaussian results for 8QAM and 16QAM, respectively. At 32QAM ($M$=32) transmission for SG channels ($L$=5), our Matlab simulator could not handle the $32^4$ states in the decoder trellis, hence we have no results for MLSE for this channel.

From Fig. 2 we can see that the multipath family has some frequency response shapes exhibiting dips at higher frequency. For the super Gaussian family, the roll-off is the primary change from one channel to another, with some shallow dips at low frequency. The $y$-axis scale has changed from Figs. 6–9 as the multipath family has more severe ISI. Nonetheless, despite these differences and the greater channel memory length, the qualitative behavior of the two families is quite similar.

In this article, we assume ideal CSI for MLSE. Obtaining ideal CSI under ISI channels however is challenging in practice, especially when the ISI is severe. Therefore, future work should compare the performance of NNs to MLSE with estimated CSI for a fair comparison. Additionally, it is worthy examining whether a clear scaling law can be discerned for the deep BiLSTM solution. Would significant increases in the complexity (beyond the sweep we made) uncover architectures that continued to achieve MLSE performance?

## 8. Conclusion

We examined several NN architectures to identify which is the most effective in mitigating severe ISI in bandlimited channels. For the first time, we demonstrated (via simulations) that our proposed deep BiLSTM achieves optimal MLSE performance for QPSK. This NN exploits memory cells within each node and two independent but identical structure NNs that treat the data in the forward and the backward directions before outputting the equalized data.

We also examined how NN performance scaled as we swept the severity of the ISI, the length of channel memory, and the modulation level. The severity of the ISI impacts the best attainable performance, while the other two factors determine the complexity of traditional MLSE to achieve the best attainable performance. Performance was qualitatively similar for the two memory lengths examined. While performance degraded with modulation order, improvement compared to simple linear MMSE filtering was still compelling, even at 32QAM. Of particular importance, these benchmarks for MMSE and MLSE assumed perfect channel state information, while the NN solution used only the training set.

## Funding

Natural Sciences and Engineering Research Council of Canada (537311-18); Huawei Industrial Research Chair (537311-18).

## Disclosures

The authors declare no conflicts of interest.

## References

**1. **E. Alpaydin, * Introduction to machine learning* (Massachusetts Institute of Technology, 2014).

**2. **I. Goodfellow, Y. Bengio, and A. Courville, * Deep learning* (Massachusetts Institute of Technology, Nov. 2016).

**3. **Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature **521**(7553), 436–444 (2015). [CrossRef]

**4. **T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Transactions on Cogn. Commun. Netw. **3**(4), 563–575 (2017). [CrossRef]

**5. **O. Simeone, “A very brief introduction to machine learning with applications to communication systems,” IEEE Transactions on Cogn. Commun. Netw. **4**(4), 648–664 (2018). [CrossRef]

**6. **M. Ibnkahla, “Applications of neural networks to digital communications–a survey,” Signal Process. **80**(7), 1185–1215 (2000). [CrossRef]

**7. **S. Dörner, S. Cammerer, J. Hoydis, and S. Ten Brink, “Deep learning based communication over the air,” IEEE J. Sel. Top. Signal Process. **12**(1), 132–143 (2018). [CrossRef]

**8. **N. Farsad and A. Goldsmith, “Neural network detection of data sequences in communication systems,” IEEE Trans. Signal Process. **66**(21), 5663–5678 (2018). [CrossRef]

**9. **N. Shlezinger, N. Farsad, Y. C. Eldar, and A. J. Goldsmith, “ViterbiNet: A deep learning based Viterbi algorithm for symbol detection,” IEEE Transactions on Wirel. Commun. **19**(5), 1–5 (2020). [CrossRef]

**10. **D. Tandler, S. Dörner, S. Cammerer, and S. ten Brink, “On recurrent neural networks for sequence-based processing in communications,” in Proc. Asilomar Conference on Signals, Systems, and Computers, (IEEE, 2019), pp. 537–543.

**11. **S. Deligiannidis, A. Bogris, C. Mesaritakis, and Y. Kopsinis, “Compensation of fiber nonlinearities in digital coherent systems leveraging long short-term memory neural networks,” J. Lightwave Technol. **38**(21), 5991–5999 (2020). [CrossRef]

**12. **A. G. Reza and J.-K. K. Rhee, “Nonlinear equalizer based on neural networks for pam-4 signal transmission using dml,” IEEE Photonics Technol. Lett. **30**(15), 1416–1419 (2018). [CrossRef]

**13. **Z. Xu, C. Sun, T. Ji, J. H. Manton, and W. Shieh, “Cascade recurrent neural network-assisted nonlinear equalization for a 100 gb/s pam4 short-reach direct detection system,” Opt. Lett. **45**(15), 4216–4219 (2020). [CrossRef]

**14. **B. Karanov, G. Liga, V. Aref, D. Lavery, P. Bayvel, and L. Schmalen, “Deep learning for communication over dispersive nonlinear channels: performance and comparison with classical digital signal processing,” in Proc. Annual Allerton Conference on Communication, Control, and Computing (Allerton), (IEEE, 2019), pp. 192–199.

**15. **S. Zhalehpour, M. Guo, J. Lin, Z. Zhang, Y. Qiao, W. Shi, and L. A. Rusch, “System Optimization of an All-Silicon IQ Modulator: Achieving 100-Gbaud Dual-Polarization 32QAM,” J. Lightwave Technol. **38**(2), 256–264 (2020). [CrossRef]

**16. **S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comp. **9**(8), 1735–1780 (1997). [CrossRef]

**17. **S. C. K. Kalla, R. H. Nejad, S. Zhalehpour, and L. A. Rusch, “Neural Nets to Approach Optimal Receivers for High Speed Optical Communication,” in CLEO: Science and Innovations, (Optical Society of America, 2020), pp. STh4M–4.

**18. **S. C. K. Kalla and L. A. Rusch, “Recurrent neural nets achieving MLSE performance in bandlimited optical channels,” in 2020 IEEE Photonics Conference (IPC), (IEEE, 2020), pp. MA3–3.

**19. **X. Li and X. Wu, “Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2015), pp. 4520–4524.

**20. **H. Sak, A. Senior, and F. Beaufays, “Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling,” in Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), (2014), pp. 338–342.

**21. **A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International conference on machine learning (ICML), (2014), pp. 1764–1772.

**22. **D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representations (ICLR), (2015).

**23. **K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), (2014), pp. 1724–1734.

**24. **B. Sklar, “How I learned to love the trellis,” IEEE Signal Process. Mag. **20**(3), 87–102 (2003). [CrossRef]

**25. **J. G. Proakis, * Digital Communications* (McGraw-Hill Education, 2000, Chap. 10).

**26. **D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations (ICLR), (2015).

**27. **G. D. Forney, “The Viterbi algorithm,” Proc. IEEE **61**(3), 268–278 (1973). [CrossRef]

**28. **M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Trans. Signal Process. **45**(11), 2673–2681 (1997). [CrossRef]

**29. **Z. Cui, R. Ke, Z. Pu, and Y. Wang, “Deep bidirectional and unidirectional LSTM recurrent neural network for network-wide traffic speed prediction,” in Proc. 6th Int. Workshop Urban Computing (UrbComp), (2016).