## Abstract

A blind frequency and phase search algorithm for joint frequency and phase recovery is introduced. The algorithm achieves low complexity due to processing in polar coordinates, which reduces the amount of multiplications. We show an implementation for real-time processing at 32 GBd on FPGA hardware. The hardware design allows for dynamic multi-format operation, where the format can be switched flexibly after each clock cycle (250 MHz, 128 Symbols) between 4QAM, 8QAM, and 16QAM. The performance of the algorithm is evaluated with respect to laser phase noise, carrier frequency offset, and carrier frequency offset drift. The effect of working with limited hardware resources is investigated. An FPGA implementation shows the feasibility of our carrier recovery algorithm with a negligible penalty when compared to a floating point simulation.

Published by The Optical Society under the terms of the Creative Commons Attribution 4.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.

## 1. Introduction

An efficient multi-format carrier recovery is an essential processing block in coherent receivers for future elastic optical networks (EON) [1,2]. A particular challenge in such networks is that transceivers need to adapt flexibly to actual traffic demands. Thus, coherent receivers need to dynamically follow the modulation format and signal bandwidth of the transmitters. For this, transceivers need efficient, universal, and flexible digital signal processing (DSP) units.

Two vital blocks in such DSP units are the carrier frequency recovery (CFR) and carrier phase recovery (CPR) [3]. The CFR detects, tracks, and corrects for the carrier frequency offset (CFO) between transmitter and receiver laser. The CPR compensates for laser phase noise and phase offsets. The phase noise originating from the lasers` finite linewidth is varying much faster than the frequency drift. For the CFR, only slow changes have to be tracked which simplifies the implementation. Here, we confine the discussion to non-data-aided algorithms as they achieve higher spectral efficiencies because they avoid additional training symbols or pilot tones.

There are mainly two categories of CFR algorithms. The *M*^{th} power algorithm that can be implemented either in time [4] or in frequency domain [5] and the blind frequency search (BFS) [6]. CPR algorithms can also be subdivided in two big groups. The most common algorithm for QPSK is the Viterbi-Viterbi phase estimation (VVPE) algorithm [7]. The VVPE algorithm has been adapted with QPSK-partitioning [8], multi-stage approaches [9], and for higher order modulation formats [10]. A second group of CPR algorithms uses the blind phase search (BPS) [11]. The BPS typically suffers from a large complexity. Meanwhile, the complexity has been reduced by multi-stage approaches [12,13] and with a simplified cost function [14]. Both approaches have already been demonstrated in real-time at low symbol rates [15]. Another group of CPR algorithms takes advantage of a nonlinear transformation of the received signal. They use either a harmonic decomposition [16,17] or the nonlinear least square [18]. Recently, an additional approach has been shown for a real-time implementation at 25 GBd [19] where a multi-symbol delay detection (MSDD) scheme is used [20].

However, a flexible, blind, and joint CFR and CPR that operates in real-time with low hardware complexity at highest data rates has not yet been shown. This in part is because such hardware implementations in coherent optical communication links with >100 Gbit/s are challenging as the hardware has to process several 100 Gbit/s to Tbit/s of raw data at DSP clock frequencies that typically operate below 1 GHz. Therefore, massive parallel processing is required [21]. Thus, big blocks of data need to be processed simultaneously. Yet, due to the large number of parallel operations, a real-time capable hardware implementation can only be implemented if the processing complexity is minimized. Minimum complexity is also necessary to achieve a reasonable low power consumption.

In this paper, we present a real-time, flexible, blind, and joint CPR [22] and CFR [23] algorithm that operates in polar coordinates for operation with low complexity. The algorithms are based on the BFS and BPS techniques. We explain the operation principle, show the performance through simulations and demonstrate a real-time hardware implementation for the 4QAM, 8QAM, and 16QAM modulation formats. The implementation is realized without using hardware multiplications on the FPGA. The resilience of the algorithms to impairments such as laser phase noise (LPN), low signal-to-noise ratio (SNR), and a static and varying carrier frequency offset (CFO) has been analyzed using MATLAB and ModelSim simulations. The complexity and hardware requirements have been studied for an FPGA-based prototyping platform with Vivado design tools (Xilinx).

## 2. Operation principle and implementation

The multi-format carrier recovery algorithm comprises two main blocks. The first part is a CFR algorithm that compensates for CFO and CFO drifts between transmitter and receiver. The second part is a CPR algorithm that compensates for the laser phase noise. The CFR is adapted in parts from [6] and has been modified for lowest hardware complexity as introduced in [20]. The CPR is partially based on [11] and a real-time implementation with a modified metric is presented in [19]. Key for lowest processing complexity and therefore a hardware implementation of our CR are processing in polar coordinates and new simplified metrics.

The general principle for the CFR and CPR algorithms is shown in Fig. 1(a). For both algorithms, the input data is needed in polar coordinates. If the data is provided in Cartesian coordinates, a coordinate transformation block is needed. Such a coordinate transformation can be implemented with little hardware resources using the CORDIC algorithm [24]. Both algorithms consist of three steps. In the first step, test frequencies or test phases are used to blindly correct the received signal. In the second step, the corrected signal is evaluated with a cost function to judge the quality of the applied correction. The cost function becomes minimal when the optimum correction frequency or phase has been found. For different modulation formats like 4QAM, 8QAM, and 16QAM, see Fig. 1(b), different cost functions are used. The cost function can also be adapted for other modulation formats especially if their symbols can adequately be described in polar coordinates. This applies for modulation formats such as phase shift keying (m-PSK) and amplitude and phase shift keying (m-APSK). In the last step, the selected correction frequency or phase is applied to the signal. For the CFR, the frequency error changes slowly. Therefore, the test frequencies can be applied in sequence over a large number of clock cycles. However, the phase error varies quickly so that the test phases in the CPR have to be applied and evaluated in parallel. In the following sections, we describe the implementations of CFR and CPR in detail.

#### 2.1 Multi-format frequency recovery implementation

The schematic of the CFR is depicted in Fig. 2. As the frequency drift of typically 0.2 MHz/μs [25] is significantly smaller than the target symbol rate, the CFR can operate at a lower speed. Therefore, it is sufficient for our implementation of the CFR to evaluate only one test frequency per clock cycle and select the correct frequency offset once all test frequencies have been evaluated.

If the signal is not available in polar coordinates, it is first converted from Cartesian coordinates to polar coordinates to reduce the complexity of the subsequent steps. Here, we exploit the CORDIC algorithm [24] that converts the incoming complex time samples ${r}_{l}$ (sampled at times ${t}_{l}$ with $l=1,\dots ,{L}_{\text{CFR}}$), to polar coordinates with amplitude ($\left|{r}_{l}\right|$) and phase($\angle {r}_{l}$) without the use of multiplications.

The CFR itself operates on a block of ${L}_{\text{CFR}}$ samples per clock cycle and comprises four steps: A test frequency is applied to the signal, a cost function is evaluated, the optimal test frequency is detected, and the CFO is corrected. In the following, we describe the respective steps in more detail.

We first apply one of $K$ test frequencies${f}_{k}$ ($k=1,\mathrm{...},K$) per clock cycle to the received signal. This is implemented by adding to the signal phase ($\angle {r}_{l}$) a linear phase ramp ${\phi}_{k,1},\dots ,{\phi}_{k,L}$ that corresponds to the test frequency${f}_{k}=\text{d}{\phi}_{k}/dt$. For the correct test frequency, this results in a sequence of ${L}_{\text{CFR}}$ samples with a minimal frequency offset.

In the second step, we calculate the cost function *J*_{CFR}(*f _{k}*) to obtain a measure for the residual frequency offset. For the minimal frequency offset that corresponds to the correct test frequency, the phase of the samples should now be nearly constant. As a result, the variance of the phase, which is a superposition of the added noise of the signal and the phase drift due to the frequency offset, will be minimal as well. To calculate the variance of the phase, additional processing steps are required to remove the symbol information. Depending on the received format (4QAM, 8QAM, or 16QAM), we remove the symbol information differently. For 4QAM we map the four symbols to the first quadrant by performing a modulo $\pi /2$ operation on the phase, see Fig. 3(a)

We then calculate the phase difference $\Delta {\phi}_{k,l}$ to the reference phase at ${\phi}_{\text{ref}}=\pi /4$

where the ${\phi}_{\text{ref}}=\pi /4$ angle corresponds to the angle where an ideal symbol should be found after performing the modulo $\pi /2$ operation.Finally, the variance of the phase differences $\Delta {\phi}_{k,l}$ has to be calculated

In the third step, all results of the cost function are evaluated to determine the optimum test frequency and therefore the frequency offset. As only one test frequency is evaluated per clock cycle, the results are buffered. After calculating the result of the cost function for all $K$test frequencies, the buffer contains results of ${J}_{\text{CFR}}\left({f}_{k}\right)$ for all test frequencies and will be continuously updated. The estimated optimal offset frequency is found by the minimum of the cost function ${J}_{\text{CFR}}\left({f}_{k}\right)$. The shape of the cost function for 4QAM, 8QAM, and 16QAM is displayed in Fig. 4(a). Here, we neglected the laser phase noise to clearly show the unperturbed cost function. It can be observed that a larger block of samples leads to a steeper cost function. In our implementation, an exponential average on the selected CFO reduces the impact of noise and decreased precision due to approximations in the implementation. The exponential average is implemented by ${i}_{\mathrm{min},t}=\alpha \cdot {i}_{\mathrm{min},t}+\left(1-\alpha \right)\cdot {i}_{\mathrm{min},t-1}$, where ${i}_{\mathrm{min},t}$ is the recent and ${i}_{\mathrm{min},t-1}$ the previous minimum index of the cost function. The variable $\alpha $ is the weighting factor and is set to $\alpha =0.125$. Figure 4(b) shows the CFR without (blue) and with exponential averaging (green) tracking the CFO of a 16QAM signal (red, drift 1 MHz/*μ*s).

In the last step, the CFO is corrected by adding the respective phase vector of the selected test frequency to the signal. After CFO compensation, the CPR stage described in the following section corrects the remaining phase offset.

#### 2.2 Multiplier-free phase recovery implementation

Figure 5 shows the block diagram of the multi-format carrier phase recovery (CPR). The CPR algorithm uses the same processing structure as the CFR with the main difference that all test phases have to be applied in parallel to track the quickly changing carrier phase. Similar to the CFR, we use polar instead of Cartesian coordinates in order to implement a multiplier free system. Since the received samples ${r}_{l}$ have already been converted to polar coordinates for the CFR block, no further CORDIC stage is needed in the CPR block. While the phase information is used for processing in all of the following steps, the amplitude is only needed for the partitioning in case of 8QAM and 16QAM.

The CPR algorithm corrects the phase offset in four steps. First, test phases are added. Second, the quality of the test phases is calculated with the cost function. Third, the optimal phase is detected by minimizing the cost function. Finally, the signal phase is corrected. We will now describe these processing blocks in more detail.

First, we add a number of $B$ test phases (${\phi}_{b}={\phi}_{1},\dots {\phi}_{B}$) to $B$ copies of the received phase values ($\angle {r}_{l}$) in a block of the length *L*_{CPR}. This results in *B* blocks of *L*_{CPR} estimated phases${\vartheta}_{\text{test}}$, out of which the correct sampling phase has to be selected.

Second, we analyze the *B* blocks of samples with a cost function ${J}_{\text{CPR}}\left({\phi}_{b}\right)$ to detect the correct phase offset. The cost function performs an averaging over the phase difference to one or multiple reference phases${\phi}_{\text{ref}}$:

In the third step, we select the correct phase out of all test phases. For the correct carrier phase, the cost function ${J}_{\text{CPR}}\left({\phi}_{b}\right)$ becomes minimal as depicted in Fig. 6(a) with a block size of ${L}_{\text{CPR}}=128$ and a number of test angles of $B=99$.

In the final step, the carrier phase is corrected by adding the selected test phase to the received phase value. As all values are available in polar coordinates, this can be realized with minimum effort.

## 3. Complexity and hardware utilization

We implemented the proposed multi-format carrier recovery in VHDL and synthesized the design in order to evaluate the hardware consumption on an FPGA chip. The implementation includes the CORDIC algorithm for conversion from Cartesian to polar coordinates, the CFR algorithm, and the CPR algorithm. The implementations of CFR and CPR are multi-format capable and operate for 4QAM, 8QAM, and 16QAM. The design was evaluated with the Vivado design tools. The hardware design processes 128 samples in parallel with a clock frequency of 250 MHz, which results in a symbol rate of 32 GBd. Each sample represents one symbol and has a resolution of 8 bit per in-phase and quadrature component. Each sample represents one symbol and has a resolution of 8 bit per in-phase and quadrature component. For the CFR we choose *L*_{CFR} = 128, *I* = 24, and *K* = 51. The CPR is implemented with *L*_{CPR} = 32 which results in four parallel CPR blocks in order to process 128 symbols per clock cycle. The size of *L*_{CPR} has no significant influence on the hardware complexity as long as the overall amount of processed symbols remains the same. We use a Xilinx Virtex 7 FPGA chip for our design considerations. The hardware requirements for designs with different number of test phases is presented in Table 1. As expected, the implementations utilize no DSP units.

## 4. Simulations and performance evaluation

We studied the performance of our algorithm by numerical simulations. As impairment factors, we modeled additive white Gaussian noise (AWGN), combined laser phase noise (LPN) from the transmitter and receiver, carrier frequency offset (CFO), and a CFO drift. The laser phase noise was modeled as Wiener process as suggested in [11]. To focus on the limiting factors of the CR, other impairments like timing offset, chromatic dispersion, and polarization mode dispersion were excluded from the simulation. The simulation was performed with a PRBS sequence of length 2^{15} – 1, which was repeated to generate a sequence with more than 10^{6 }symbols. The algorithm was analyzed for 4QAM, 8QAM, and 16QAM. We considered a symbol rate of 32 GBd and we used differential encoding [26]. We studied the performance in two steps. First, we investigated the performance of the CPR and in the second step the performance of the combined CFR and CPR algorithm.

The simulation results of the CPR without CFR and neglected impairments like CFO and CFO drift are presented in Fig. 7. In each case, we show a penalty for the required SNR to achieve a BER of 10^{-3}. The penalty is calculated in relation to the theoretical limit. The theoretical limits assumed for differentially encoded 4QAM, 8QAM, and 16QAM are 10.35 dB, 14.08 dB, and 16.97 dB, respectively [26]. The amount of test phases was chosen to be B_{4QAM} = 31 for 4QAM and B_{8QAM,16QAM} = 51 for 8QAM and 16QAM to avoid the influence of a limited resolution of the test phases.

Figure 7(a) shows the influence of the processing block size ${L}_{\text{CPR}}$of the CPR. For different LPN, different block sizes offer advantages in performance. For low LPN, long window lengths ${L}_{\text{CPR}}$ are beneficial as AWGN related errors average out. For larger LPN, shorter window lengths ${L}_{\text{CPR}}$ are required to avoid phase fluctuations within the window length${L}_{\text{CPR}}$. For smallest window lengths, the estimate of the algorithm degrades significantly, as the AWGN cannot be reduced by averaging anymore. For the following simulation, we considered a window length of 32 and 64. Figure 7(b) shows the algorithm’s performance under influence of an increasing LPN. The shorter window length ${L}_{\text{CPR}}=32$ shows better tolerance to LPN, as is expected from Fig. 7(a). Considering an SNR penalty smaller than 0.5 dB, the algorithm tolerates a laser linewidth of up to 4 MHz, 3 MHz, and 800 kHz for 4QAM, 8QAM, and 16QAM, respectively. Figure 7(c) presents the SNR penalty for different numbers of test angles *B*. The block size was fixed to ${L}_{\text{CPR}}=32$and the combined laser linewidth was assumed to be 100 kHz. The amount of test angles has a direct influence on the complexity of the implementation since the different test angles are applied in parallel. Therefore, it is important to investigate the minimal amount of test phases required for a reasonable performance. For 4QAM, 8QAM, and 16QAM the performance degradation is below 0.25 dB for $B\ge 9$, $B\ge 12$ and $B\ge 24$ test angles, respectively. In Fig. 7(d), we show the performance degradation under influence of a limited ADC word width. The signal is impaired by AWGN and a combined laser phase noise of 100 kHz. The block size is fixed to ${L}_{\text{CPR}}=32$. At a limited ADC word width of 6 bit, we observe an SNR penalty of 0.16 dB, 0.22 dB, and 0.36 dB for 4QAM, 8QAM, and 16QAM, respectively. For word width larger than 8 bit, only a minor performance improvement is visible for the three modulation formats.

In the next step, we added our CFR unit to the simulation model. We evaluated the performance of the CFR for varying CFOs and drifting CFOs under the influence of a fixed combined LPN of 100 kHz. The CPR has a fixed processing length of ${L}_{\text{CPR}}=32$ and $B=51$ test. We used a large number of test angles to neglect the influence from a limited amount of test angles. The CFR is implemented with a fixed number of $K=51$ test frequencies in a range of ± 150 MHz. For larger frequency ranges, the number of test frequencies *K* can be increased. Since the different test frequencies are applied sequentially the amount of parallel processing steps is not increasing with an increasing *K,* only a larger memory for the stored test frequencies is needed. The speed of tracking a drifting CFO, however, will decrease with an increasing *K* since it takes a longer time to apply all the test frequencies sequentially.

Figure 8(a) depicts the results of our simulations with different CFO values and a fixed *L*_{CFR} = 128. We present the performance difference between processing without and with CFR. Without CFR stages, a CFO of 37.5 MHz, 25 MHz and 12.5 MHz. for 4QAM, 8QAM, and 16QAM is still tolerable with an SNR-penalty below 0.5 dB. Thus, the single CPR is also able to track and correct a phase development such as a CFO to a certain amount. With the CFR stage no performance dependency for different CFOs can be observed. The influence of CFO drifts to the proposed CR is presented in Fig. 8(b). The results are presented for a processing length of ${L}_{\text{CFR}}=128$and ${L}_{\text{CFR}}=256$. A larger processing length results in a larger performance penalty with increasing speeds of a drifting CFO. It is associated with the larger amount of time samples which are needed to calculate a CFO estimate. Consequently, the CFO estimates are not calculated fast enough to track the drifting CFO.

To investigate the performance of the hardware implementation (HW) and the floating point Matlab simulation (SW) in comparison, we implemented a Matlab-Modelsim co-simulation. We compared the BER performance of the SW and the HW simulation for 24 test angles, a CFO of 80 MHz, a CFO drift of −1 MHz/µs, and a laser linewidth of 300 kHz. We increased the linewidth to 300 kHz to stress test the implementation. The hardware implementation, including the CORDIC algorithm, was designed with a word width of 8 bit for the in-phase and quadrature components or amplitude and phase, respectively. The different block sizes for the CFR and the CPR are set to *L*_{CFR} = 128 and *L*_{CPR} = 32. We simulated >10^{6} symbols for each BER value and investigated the dependence on the SNR for 4QAM, 8QAM, and 16QAM. The same multi-format capable hardware design was used for all modulation formats. Figure 9(a) shows the results for HW and SW simulation in relation to the differentially coded theoretical limits of the respective modulation format. We observe minimal penalties for the required SNR for a BER of 10^{−3} for SW and HW simulations. For 4QAM and 8QAM, the penalty is below 0.2 dB, for 16QAM, the penalty is below 0.4 dB. The minor SNR penalty between SW and HW originates from the fixed point calculation in case of HW processing. The larger SNR penalty of 16QAM can be attributed to the required partitioning which is not ideal in polar coordinates. Constellation diagrams of the output of the HW simulation for all formats are shown in Fig. 9(b). In each case, the SNR that is theoretically required for a BER of 10^{−3} is used for the simulation. Zooming in, one may observe the 256 quantized phase states allowed by the 8 bit resolution.

## 5. Conclusion

We have introduced a joint multi-format frequency and phase recovery algorithm relying on processing in polar coordinates for low hardware complexity and demonstrated its operation at 32 GBd. Processing in polar coordinates is especially beneficial for the parallel application of test phases and calculation of the cost function in the BPS based CPR. The performance of our CR algorithm, working for 4QAM, 8QAM, and 16QAM, has been tested under influence of laser linewidths, carrier frequency offsets and carrier frequency offset drifts. For the hardware implementation, we investigated the influence of design parameters like processing block length, resolution of the test phases and limited word widths. The CR algorithm has been implemented in VHDL and the chip utilization of an FPGA implementation shows its feasibility. The algorithm can dynamically switch the modulation format after each clock cycle (250 MHz, 128 Symbols). We compared the BER performance of our hardware implementation with the software simulation under the influence of a 300 kHz linewidth laser, a CFO of 80 MHz and a CFO drift of −1 MHz/µs. The SNR penalty for the hardware implementation when compared to the theoretical limit is negligible (<0.2 dB for 4QAM and 8QAM, <0.4 dB for 16QAM).

## Funding

We acknowledge financial support by the European Commission under FP7 program, project FOX-C (grant no. 318415) and by the Xilinx University Program (XUP).

## References and links

**1. **O. Gerstel, M. Jinno, A. Lord, and S. J. B. Yoo, “Elastic optical networking: a new dawn for the optical layer?” IEEE Commun. Mag. **50**(2), 12–20 (2012). [CrossRef]

**2. **A. Lau, Y. Gao, Q. Sui, D. Wang, Q. Zhuge, M. Morsy-Osman, M. Chagnon, X. Xu, C. Lu, and D. Plant, “Advanced DSP techniques enabling high spectral efficiency and flexible transmissions: toward elastic optical networks,” IEEE Signal Process. Mag. **31**(2), 82–92 (2014). [CrossRef]

**3. **S. J. Savory, “Digital coherent optical receivers: algorithms and subsystems,” IEEE J. Sel. Top. Quantum Electron. **16**(5), 1164–1179 (2010). [CrossRef]

**4. **A. Leven, N. Kaneda, U.-V. Koc, and Y.-K. Chen, “Frequency estimation in intradyne reception,” IEEE Photonics Technol. Lett. **19**(6), 366–368 (2007). [CrossRef]

**5. **M. Selmi, Y. Jaouen, and P. Ciblat, “Accurate digital frequency offset estimator for coherent PolMux QAM transmission systems,” in *Proc. ECOC* (2009), paper P3.08.

**6. **X. Zhou, J. Yu, M.-F. Huang, Y. Shao, T. Wang, L. Nelson, P. Magill, M. Birk, P. I. Borel, D. W. Peckham, R. Lingle, and B. Zhu, “64-Tb/s, 8 b/s/Hz, PDM-36QAM transmission over 320 km using both pre- and post-transmission digital signal processing,” J. Lightwave Technol. **29**(4), 571–577 (2011). [CrossRef]

**7. **A. M. Viterbi, “Nonlinear estimation of PSK-modulated carrier phase with application to burst digital transmission,” IEEE Trans. Inf. Theory **29**(4), 543–551 (1983). [CrossRef]

**8. **I. Fatadin, D. Ives, and S. J. Savory, “Laser linewidth tolerance for 16-QAM coherent optical systems using QPSK partitioning,” IEEE Photonics Technol. Lett. **22**(9), 631–633 (2010). [CrossRef]

**9. **K. P. Zhong, J. H. Ke, Y. Gao, and J. C. Cartledge, “Linewidth-tolerant and low-complexity two-stage carrier phase estimation based on modified QPSK partitioning for dual-polarization 16-QAM systems,” J. Lightwave Technol. **31**(1), 50–57 (2013). [CrossRef]

**10. **S. M. Bilal, C. R. S. Fludger, V. Curri, and G. Bosco, “Multistage carrier phase estimation algorithms for phase noise mitigation in 64-quadrature amplitude modulation optical systems,” J. Lightwave Technol. **32**(17), 2973–2980 (2014). [CrossRef]

**11. **T. Pfau, S. Hoffmann, and R. Noe, “Hardware-efficient coherent digital receiver concept with feedforward carrier recovery for M-QAM constellations,” J. Lightwave Technol. **27**(8), 989–999 (2009). [CrossRef]

**12. **X. Zhou, “An improved feed-forward carrier recovery algorithm for coherent receivers with M-QAM modulation format,” IEEE Photonics Technol. Lett. **22**(14), 1051–1053 (2010). [CrossRef]

**13. **J. Li, L. Li, Z. Tao, T. Hoshida, and J. C. Rasmussen, “Laser-linewidth-tolerant feed-forward carrier phase estimator with reduced complexity for QAM,” J. Lightwave Technol. **29**(16), 2358–2364 (2011). [CrossRef]

**14. **H. Zhou, J. Dong, S. Yan, Y. Zhou, and X. Zhang, “Low-complexity carrier phase recovery for square M-QAM based on S-BPS algorithm,” IEEE Photonics Technol. Lett. **26**(18), 1 (2014). [CrossRef]

**15. **A. Al-Bermani, C. Wördehoff, K. Puntsri, O. Jan, U. Rückert, and R. Noé, “Real-time synchronous 16-QAM optical transmission system using blind phase search and QPSK partitioning carrier recovery techniques,” in *ITG-Fachtagung Photonische Netze* (Leipzig, Germany, 2012).

**16. **T.-H. Nguyen, M. Joindot, P. Scalart, M. Gay, L. Bramerie, O. Sentieys, J.-C. Simon, and C. Peucheret, “Carrier phase recovery for optical coherent M-QAM communication systems using harmonic decomposition-based maximum loglikelihood estimators,” in *Proc. SPPCom, Advanced Photonics* (2015), paper SpT4D.3.

**17. **T.-h. Nguyen, P. Scalart, M. Gay, L. Bramerie, C. Peucheret, O. Sentieys, J.-C. Simon, and M. Joindot, “Bi-harmonic decomposition-based maximum loglikelihood estimator for carrier phase estimation of coherent optical M-QAM,” in *Proc. OFC*, (2016), paper Tu3K.3. [CrossRef]

**18. **N. Argyris, S. Dris, C. Spatharakis, and H. Avramopoulos, “High performance carrier phase recovery for coherent optical QAM,” in *Proc. OFC*, (2015), paper W1E.1. [CrossRef]

**19. **A. Tolmachev, I. Tselniker, M. Meltsin, I. Sigron, D. Dahan, A. Shalom, and M. Nazarathy, “Multiplier-free phase recovery with polar-domain multisymbol-delay-detector,” J. Lightwave Technol. **31**(23), 3638–3650 (2013). [CrossRef]

**20. **I. Tselniker, N. Sigron, and M. Nazarathy, “Joint phase noise and frequency offset estimation and mitigation for optically coherent QAM based on adaptive multi-symbol delay detection (MSDD),” Opt. Express **20**(10), 10944–10962 (2012). [CrossRef] [PubMed]

**21. **A. Leven, N. Kaneda, and S. Corteselli, “Real-time implementation of digital signal processing for coherent optical digital communication systems,” IEEE J. Sel. Top. Quantum Electron. **16**(5), 1227–1234 (2010). [CrossRef]

**22. **B. Baeuerle, A. Josten, F. C. Abrecht, E. Dornbierer, J. Boesser, M. Dreschmann, J. Becker, J. Leuthold, and D. Hillerkuss, “Multiplier-free, carrier-phase recovery for real-time receivers using processing in polar coordinates,” in *Proc. OFC*, (2015), paper W1E.2. [CrossRef]

**23. **B. Baeuerle, A. Josten, F. Abrecht, E. Dornbierer, D. Hillerkuss, and J. Leuthold, “Blind real-time multi-format carrier recovery for flexible optical networks,” in *Proc. SPPCom, Advanced Photonics*, (2015), paper SpT4D.5.

**24. **J. E. Volder, “The CORDIC trigonometric computing technique,” IRE Trans. Electron. Comput. **EC-8**(3), 330–334 (1959). [CrossRef]

**25. **S.-H. Fan, J. Yu, D. Qian, and G.-K. Chang, “A fast and efficient frequency offset correction technique for coherent optical orthogonal frequency division multiplexing,” J. Lightwave Technol. **29**(13), 1997–2004 (2011). [CrossRef]

**26. **E. Ip and J. M. Kahn, “Feedforward carrier recovery for coherent optical communications,” J. Lightwave Technol. **25**(9), 2675–2692 (2007). [CrossRef]

**27. **L. M. Pessoa, H. M. Salgado, and I. Darwazeh, “Performance evaluation of phase estimation algorithms in equalized coherent optical systems,” IEEE Photonics Technol. Lett. **21**(17), 1181–1183 (2009). [CrossRef]