## Abstract

We propose two reduced-complexity (RC) LDPC decoders, which can be used in combination with large-girth LDPC codes to enable ultra-high-speed serial optical transmission. We show that optimally attenuated RC min-sum sum algorithm performs only 0.46 dB (at BER of 10^{−9}) worse than conventional sum-product algorithm, while having lower storage memory requirements and much lower latency. We further study the use of RC LDPC decoding algorithms in multilevel coded modulation with coherent detection and show that with RC decoding algorithms we can achieve the net coding gain larger than 11 dB at BERs below 10^{−9}.

©2010 Optical Society of America

## 1. Introduction

In response to high-bandwidth demands due to rapid growth of data-centric applications and deployment of broadband access networks, the network operators are upgrading their dense wavelength division multiplexing (DWDM) networks from 10Gb/s per channel to more spectrally efficient 40 Gb/s and 100 Gb/s [1]. The 400 Gb/s and 1 Tb/s serial optical transmissions are regarded to be the next steps after 100Gb/s and have already started attracting interests from research community in both academia and industry [2]. In order to achieve ultra-high-speed optical transmission at 400Gb/s and beyond with commercially available equipment operating at 40 Giga symbols/s (40 GS/s), we have recently proposed the use of iterative polarization quantization (IPQ)-based modulation scheme with component codes being large-girth low density parity check (LDPC) codes [3]. This scheme, however, requires the implementation of sum-product algorithm (SPA), commonly used in decoding of LDPC codes, at 40 Gb/s, which is challenging to implement even with state-of-the-art electronic integration circuits technology.

In order to reduce the complexity of SPA, a number of different approximate algorithms were proposed [2,4,5]. The main focus was to reduce the complexity of check-node (c-node) update rule. Because c-node update rule is the key step in SPA, imperfect approximations lead to significant BER performance degradation. In this paper we follow a different strategy. Instead of trying to reduce the complexity of c-node update rule, we try to reduce the complexity of variable-node (v-node) update rule. Two reduced complexity (RC) LDPC decoding algorithms are introduced: (i) RC min-sum algorithm and (ii) RC a *posteriori* probability (APP) algorithm. We show that even complete elimination of v-node update rule leads to only 0.46 dB degradation when large-girth LDPC codes are used. We further study the use of RC decoding algorithm in polarization multiplexed (PolMUX) 32-IPQ with symbol rate of 50 Giga symbols/s (50 GS/s) and show that that net coding gains (NCGs) beyond 11 dB (at BER ≤10^{−9}) are possible.

The paper is organized as follows. In Section 2, we provide two RC LDPC decoding algorithms and evaluate their BER performance, decoding complexity, and memory requirements. In Section 3, we describe polarization-multiplexed IPQ scheme suitable for ultra-high-speed serial optical transmission and evaluate its performance when the proposed RC LDPC decoding algorithms are used. Concluding remarks are given in Section 4.

## 2. Reduced-complexity LDPC decoders

Before we fully describe the proposed RC algorithms, we introduce several definitions that will facilitate the description. A *regular* (*n*, *k*) LDPC code is a linear block code whose parity-check matrix ** H** contains exactly

*W*1's per column and exactly

_{c}*W*=

_{r}*W*(

_{c}*n*/(

*n*-

*k*)) 1’s per row, where

*W*<< (

_{c}*n*-

*k*). Decoding of LDPC codes is based on SPA, which is an iterative decoding algorithm where extrinsic probabilities are iterated forward and back between variable and check nodes of bipartite (Tanner) graph representation of a parity-check matrix

**. The Tanner graph of an LDPC code is drawn according to the following rule: check node**

*H**c*is connected to variable node

*v*whenever element

*h*in

_{cv}**is a 1. For**

*H**v*-node

*v*(c-node

*c*) we define its neighborhood

*N*(

*v*) (

*N*(

*c*)) as the set of c-nodes (v-nodes) connected to it. For the completeness of presentation we briefly describe the log-domain SPA (for detailed description an interested reader is referred to [2]).

Gallager SPA

- 0. Initialization: For
*v =*0,1,…*,n*-1; initialize the messages*L*_{v}_{→}to be sent from v-node_{c}*v*to c-node*c*to channel log-likelihood ratios (LLRs)*L*_{ch}(*v*)*, namely*L*_{v}_{→}=_{c}*L*_{ch}(*v*). - 1. c-node update rule: For
*c*= 0,1,…,*n*-*k*-1; compute ${L}_{c\to v}=\underset{N\left(c\right)\backslash \left\{v\right\}}{\begin{array}{||}\hline +\\ \hline\end{array}}\text{\hspace{0.17em}}{L}_{v\to c}.\text{\hspace{0.17em}}$The box-plus operator is defined iteratively by ${L}_{1}\begin{array}{||}\hline +\\ \hline\end{array}{L}_{2}={\displaystyle \prod _{k=1}^{2}\mathrm{sign}\left({L}_{k}\right)}\cdot \varphi \left({\displaystyle \sum _{k=1}^{2}\varphi \left(\left|{L}_{k}\right|\right)}\right),\text{\hspace{0.17em}}\varphi \left(x\right)=-\mathrm{log}\mathrm{tanh}\left(x/2\right)$. The box operator for |*N*(*c*)\{*v*}| components is obtained by recursively applying 2-component version defined above. - 2. v-node update rule: For
*v =*0,1,…*,n*-1; set ${L}_{v\to c}={L}_{\mathrm{ch}}\left(v\right)+{\displaystyle \sum _{N\left(v\right)\backslash \left\{c\right\}}{L}_{c\to v}}$ for all c-nodes for which*h*= 1._{cv} - 3. Bit decisions: Update
*L*(*v*) (*v*= 0,…,*n*-1) by $L\left(v\right)={L}_{\mathrm{ch}}\left(v\right)+{\displaystyle \sum _{N\left(v\right)}{L}_{c\to v}}$ and set $\widehat{v}=1$ when*L*(*v*)<0 (otherwise, $\widehat{v}=0$). If $\widehat{\mathit{v}}{\mathit{H}}^{T}=\mathbf{0}$ or pre-determined number of iterations has been reached then stop, otherwise go to step 1.- *The channel LLR is defined by
*L*_{ch}(*v*) = log[*P*(*v*= 0|*y*)/*P*(*v*= 1|*y*)], where*y*is the channel sample. For example, for asymmetric AWGN channel*L*_{ch}(*v*) = log(σ_{1}/σ_{0})-(*y*-μ_{0})^{2}/2σ_{0}^{2}+ (*y*-μ_{1})^{2}/2σ_{1}^{2}, while for symmetric AWGN (σ_{1}= σ_{0}= σ) channel*L*_{ch}(*v*) = 2*y*/σ^{2}, where μ(σ_{i}) denote the mean (standard deviation) corresponding to symbol_{i}*i*(*i*= 0,1).

Because the c-node update rule involves log and tanh functions, it is computationally intensive, and there exist many approximation approaches. A popular one is the min-sum-plus-correction-term approximation [4]. Namely, it can be shown that “box-plus” operator + can also be calculated by ${L}_{1}\begin{array}{||}\hline +\\ \hline\end{array}{L}_{2}={\displaystyle \prod _{k=1}^{2}\mathrm{sign}\left({L}_{k}\right)}\cdot \mathrm{min}\left(\left|{L}_{1}\right|,\left|{L}_{2}\right|\right)+c\left(x,y\right),$ where *c*(*x*,*y*) denotes the correction factor defined by *c*(*x*,*y*) = log[1 + exp(-|*x* + *y*|)]-log[1 + exp(-|*x*-*y*|)], commonly implemented as a look-up table (LUT). In our approach, we concentrate on variable node update rule in order to reduce the complexity of SPA. It is clear from algorithm above that the computational complexity of c-node update rule is much higher than that of v-node, while memory storage requirements are comparable. Because the v-node update rule step is not that critical compared to c-node update rule, we propose to completely eliminate it. Our reduced complexity SPA, which will be called here RC-min-sum algorithm given below, is therefore composed only steps 1 and 3. To compensate for BER performance loss due to elimination of v-node update rule we propose to re-formulate the c-node update rule by introducing the attenuation factor α in box-plus operator as follows: ${L}_{1}\begin{array}{||}\hline +\\ \hline\end{array}{L}_{2}\cong \alpha \cdot \left[{\displaystyle \prod _{k=1}^{2}\mathrm{sign}\left({L}_{k}\right)}\cdot \mathrm{min}\left(\left|{L}_{1}\right|,\left|{L}_{2}\right|\right)\right].$ In addition to reducing memory storage requirements, the proposed RC algorithm improves the latency, which is of crucial importance for ultrahigh-speed implementation. The RC min-sum algorithm can be formulated as follows.

Reduced-complexity min-sum algorithm

- 0. Initialization: For
*v =*0,1,…*,n*-1; initialize the v-node reliabilities*L*to channel LLRs,_{v}*L*=_{v}*L*_{ch}(*v*). - 1. c-node update rule: For
*c*= 0,1,…,*n*-*k*-1; compute ${L}_{c\to v}=\underset{N\left(c\right)\backslash \left\{v\right\}}{\begin{array}{||}\hline +\\ \hline\end{array}}\text{\hspace{0.17em}}{L}_{v}.\text{\hspace{0.17em}}$ The box-plus operator is defined by ${L}_{1}\begin{array}{||}\hline +\\ \hline\end{array}{L}_{2}=\alpha {\displaystyle \prod _{k=1}^{2}\mathrm{sign}\left({L}_{k}\right)}\cdot \mathrm{min}\left(\left|{L}_{1}\right|,\left|{L}_{2}\right|\right).$ - 2. Bit decisions: Update
*L*(*v*) (*v*= 0,…,*n*-1) by $L\left(v\right)={L}_{\mathrm{ch}}\left(v\right)+{\displaystyle \sum _{N\left(v\right)}{L}_{c\to v}}$ and set $\widehat{v}=1$ when*L*(*v*)<0 (otherwise, $\widehat{v}=0$). If $\widehat{\mathit{v}}{\mathit{H}}^{T}=\mathbf{0}$ or pre-determined number of iterations has been reached then stop, otherwise go to step 1.

The second RC algorithm to be described here is based on the RC decoding algorithm proposed by Fossorier *et al*. [5]. The original algorithm was applicable to AWGN channels only. Our proposed algorithm, which will be called here reduced complexity *a posteriori* probability (RC-APP) algorithm, is independent on channel assumption and can be formulated as follows.

Reduced complexity *a posteriori* probability algorithm

- 0. Initialization: For
*v =*0,1,…*,n*-1; determine hard decisions z_{v}= {1-sign[*L*_{ch}(*v*)]}/2 and magnitudes*m*_{v}= |*L*_{ch}(*v*)| from channel LLRs*L*_{ch}(*v*). - 1. c-node update rule: For
*c*= 0,1,…,*n*-*k*-1; compute the magnitudes*m*_{c}_{→}to be sent from c-node_{v}*c*to v-node*v*by ${m}_{c\to v}=\alpha \cdot \left(\underset{N\left(c\right)\backslash \left\{v\right\}}{\mathrm{min}}\text{\hspace{0.17em}}{m}_{v}\right),$ where α is the attenuation factor. Compute also ${z}_{c\to v}=\left({\displaystyle \sum _{N\left(c\right)\backslash \left\{v\right\}}^{}{z}_{v}}\right)\mathrm{mod}2\text{\hspace{0.17em}}.$ - 2. Bit decisions: Update the v-node magnitudes
*m*(_{v}*v*= 0,…,*n*-1) by ${m}_{v}=\left|{L}_{\mathrm{ch}}\left(v\right)\right|+{\displaystyle \sum _{N\left(c\right)}\left(1-2{z}_{{}_{c\to v}}\right){m}_{c\to v}}$ and set z_{v}= z_{v}⊕1 for*m*<0. If $\mathit{z}{\mathit{H}}^{T}=\mathbf{0}$ or pre-determined number of iterations has been reached then stop; otherwise go to step 1._{v}

Note that this algorithm requires only the use of summations, comparators and mod 2 adders. The attenuations can be implemented by appropriate register shifts (e.g., 0.5 corresponds to one register shift to the right). The decoding complexity can be estimated in terms of number of required operations per single iteration. For example, the c-node update rule in RC-min-sum algorithm is dominated by number of comparators, which is (*d*
_{c}-2), where *d*
_{c} is the c-node degree. The conventional min-sum algorithm requires *d _{v}*-additions (where

*d*is the v-node degree) in v-node update rule, while the bit-decision step requires (

_{v}*d*+ 1) additions. Therefore, the RC-min-sum algorithm requires

_{v}*nd*additions less compared to conventional min-sum algorithm. The complexity of Gallager SPA is even higher since the c-node update rule requires 15(

_{v}*d*

_{c}-2) additions per check node. To study the memory allocation requirements, we assume that partially parallel implementation is used, with bit-processing elements (BPEs) and check processing elements (CPEs) being assigned as shown in Fig. 1 . In Table 1 , we summarize the memory allocation for quasi-cyclic LDPC(16935, 13550) code, where we use the following notation: MEM B and MEM C denote the memories used to store bit node and check node edge values; MEM E stores the codeword estimate; MEM I stores the initial LLRs, and MEM F stores final LLRs (required in bit decision step). When RC-min-sum algorithm is used, the MEM B block is not needed at all. Finally, since v-node update rule does not exist in RC-min-sum algorithm, the decoding latency will be lower. Although exact latency improvement is dependent on the implementation platform, our study indicates that proposed RC-min-sum algorithm has lower complexity compared to conventional min-sum algorithm and SPA.

We turn our attention now to the BER performance evaluation of proposed reduced complexity algorithms. In Fig. 2
, we show the results of Monte Carlo simulations for different LDPC decoding algorithms and optimum attenuation factors (obtained numerically). The large-girth LDPC(16935, 13550) code is used in simulations. It is interesting to notice that min-sum-plus-correction-term algorithm faces negligible performance loss, which is in agreement with [2,4]. The min-sum algorithm with optimum attenuation factor α = 0.8 shows similar performance as SPA, while RC-min-sum with optimum attenuation factor of 0.44 faces only 0.46 dB degradation at BER of 10^{−9}. The storage complexity and latency of optimally attenuated RC-min-sum algorithm are much lower, which makes the proposed algorithm as an excellent candidate for high-speed implementation. The optimally attenuated RC-APP algorithm performs 0.2 dB worse than optimally attenuated RC-min-sum algorithm. It is interesting to notice that it is possible to improve performance of SPA by 0.1 dB at BER = 10^{−9} for attenuation factor of 0.9. In Fig. 3
we study the influence of quantization effect on BER performance of various decoders. When 4 bits (3 bits for magnitude and one bit for sign) are used, there is only extra 0.05 dB penalty for both 0.8-min-sum and 0.4-RC-APP algorithms (at BER of 10^{−7}). When 3 bits are used (2 for magnitude and 1 for sign), the corresponding degradations are 0.26 dB and 0.38 dB respectively. Finally, when only 1 bit is used for magnitude and 1 for sign, 0.8-min-sum algorithm exhibits about 0.89 dB degradation, while degradation for 0.4-RC-APP is much worse. Given this description of proposed RC LDPC decoding algorithms, in next Section we study their application in PolMUX multilevel coded modulation with coherent detection.

## 3. PolMUX IPQ coded-modulation based on large-girth LDPC codes and RC decoders

The PolMUX IPQ-based coded modulation scheme suitable for beyond 400 Gb/s per wavelength optical transmission is shown in Fig. 4
. The *m*
_{x} + *m*
_{y} (index x (y) corresponds to x-(y-) polarization) independent data streams are encoded using different LDPC codes of code rates *R _{i}* =

*K*/

_{i}*N*(

*i*∈{x,y}), where

*K*

_{x}(

*K*

_{y}) denotes the number of information bits used in the binary LDPC code corresponding to x- (y-) polarization, and

*N*denotes the codeword length. The

*m*

_{x}(

*m*

_{y}) input bit streams from

*m*

_{x}(

*m*

_{y}) different information sources, pass through identical encoders that use LDPC codes with code rate

*R*

_{x}(

*R*

_{y}) designed using a quasi-cyclic code design [2]. The outputs of the encoders are then bit-interleaved by an

*m*

_{x}×

*N*(

*m*

_{y}×

*N*) bit-interleaver where the sequences are written row-wise and read column-wise from a block-interleaver. The mapper maps each

*m*

_{x}(

*m*

_{y}) bits, taken from interleaver, into a ${2}^{{m}_{x}}$-ary (${2}^{{m}_{y}}$-ary) IPQ signal constellation point based on the LUT, which is for

*M*= 32 given in Table 2 (

*m*denotes the

_{i}*i*th circle radius,

*L*denotes the number of constellation points per

_{i}*i*th circle,

*r*denotes the decision regions in amplitude coordinate). The corresponding constellation diagram is shown in Fig. 5 , as we described in [3]. The IPQ mapper x (y) assigns

_{i}*m*

_{x}(

*m*

_{y}) bits to a constellation point represented in polar coordinates as

*s*

_{i}_{,x}= |

*s*

_{i}_{,x}|exp(jϕ

_{i}_{,x}) [

*s*

_{i}_{,y}= |

*s*

_{i}_{,y}|exp(jϕ

_{i}_{,y})]. One output of IPQ mapper is used as input of amplitude modulator (AM), while the second output is used as input to phase modulator (PM), as shown in Fig. 4(a). Notice that we use (

*m*

_{x}+

*m*

_{y}) encoders/decoders operating in parallel at data rate of

*R*

_{b}instead of only one encoder/decoder operating at date rate of (

*m*

_{x}+

*m*

_{y})

*R*

_{b.}At the receiver side (see Fig. 4(b)), the outputs at I- and Q-branches in two polarizations, are sampled at the symbol rate, while the symbol LLRs are calculated by λ(

**) = log[**

*s**P*(

**|**

*s***)/**

*r**P*(

*s*_{0}|

**)], where**

*r***and**

*s***denote the transmitted signal constellation point and received symbol at time instance**

*r**i*(in either x- or y- polarization), respectively, and

*s*_{0}represents the reference symbol. The turbo equalization (TE) principle is used to compensate for various channel impairments such as PMD, residual chromatic dispersion, and fiber nonlinearities [2]. Since the turbo equalization is not the main topic of this paper, we assume that channel impairments are ideally compensated for. The results of simulations of PolMUX based LDPC-coded IPQ scheme with 50 GS/s in symbol rate are shown in Fig. 6 . Corresponding encoders, decoders, mappers and demappers operate at data rate of 50 Gb/s. Notice that proposed LDPC decoding schemes are independent on data rate. The aggregate data rate of this particular simulation example is obtained as 2 × 5 × 0.8 × 50 Gb/s = 400 Gb/s (the factor 2 originates from two orthogonal polarizations, the factor 5 from 5 bits carried per single 32-IPQ symbol, and the factor 0.8 from code rate). The NCG at BER of 3 × 10

^{−9}, when min-sum-with-correction-term algorithm is used, is 11.6 dB. Larger coding gains can be expected at lower BERs. The optimally attenuated min-sum algorithm performs comparable to min-sum-with-correction-term algorithm. The RC-min-sum algorithm provides the NCG of 11 dB. This RC decoding based scheme, therefore, represents an excellent candidate to be used for 400G Ethernet technologies.

## 4. Conclusions

In this paper, we proposed two RC LDPC decoders, suitable for use in ultra-high-speed serial optical transmission: (i) attenuated RC min-sum algorithm and (ii) RC APP algorithm. We show that optimally attenuated RC min-sum sum algorithm performs only 0.46 dB (at BER = 10^{−9}) worse than the conventional sum-product algorithm, when used in combination with large-girth LDPC codes. The RC APP algorithm performs 0.2 dB worse than RC min-sum algorithm. The proposed algorithms simplify the implementation complexity and have lower storage memory requirements and lower latency, and can be used to support ultra-high-speed implementation. We further evaluate the proposed algorithm for use in PolMUX multilevel coded modulation with coherent detection, in combination with PolMUX 32-IPQ-based signal constellation. We show that low BERs can be achieved for medium optical SNRs, while achieving the NCG above 11 dB.

## Acknowledgments

This work was supported in part by the National Science Foundation (NSF) under Grants CCF-0952711 and ECCS-0725405.

## References and links

**1. **R. Saunders, M. Traverso, T. Schmidt, and C. Malouin, “Economics of 100Gb/s transport,” in Proc. OFC/NFOEC 2010, Paper No. NMB2, San Diego, CA, March 21–25, 2010.

**2. **I. B. Djordjevic, W. Ryan, and B. Vasic, *Coding for Optical Channels* (Springer, 2010).

**3. **I. B. Djordjevic, and H. G. Batshon, Lei Xu and T. Wang, “Coded polarization-multiplexed iterative polar modulation (PM-IPM) for beyond 400 Gb/s serial optical transmission,” in Proc. OFC/NFOEC 2010, Paper No. OMK2, San Diego, CA, March 21–25, 2010.

**4. **J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier, and X.-Y. Hu, “Reduced-complexity decoding of LDPC codes,” IEEE Trans. Commun. **53**(8), 1288–1299 (2005). [CrossRef]

**5. **M. P. C. Fossorier, M. Mihaljevic, and H. Imai, “Reduced-complexity iterative decoding of low-density parity-check codes based on belief propagation,” IEEE Trans. Commun. **47**(5), 673–680 (1999). [CrossRef]