Large data centers interconnect bottlenecks are dominated by the switch I/O BW and the front panel BW as a result of pluggable modules. To overcome the front panel BW and the switch ASIC BW limitation one approach is to either move the optics onto the mid-plan or integrate the optics into the switch ASIC. Over the last 4 years, VCSEL based optical engines have been integrated into the packages of large-scale HPC routers, moderate size Ethernet switches, and even FPGA’s. Competing solutions based on Silicon Photonics (SiP) have also been proposed for integration into HPC and Ethernet switch packages but with better integration path through the use of TSV (Through Silicon Via) stack dies. Integrating either VCSEL or SiP based optical engines into complex ASIC package that operates at high temperatures, where the required reliability is not trivial, one should ask what is the technical or the economic advantage before embarking on such a complex integration. High density Ethernet switches addressing data centers currently in development are based on 25G NRZ signaling and QSFP28 optical module that can support up to 3.6 Tb of front panel bandwidth.
© 2015 Optical Society of America
The need for optical interconnect for HPC applications was clearly demonstrated in the case of POWER7 server , where 28 VCSEL based MicroPODTM (MicroPOD is a trademark of Avago Technologies.) optical engines each having 12 lanes, were placed on the POWER7 Router Module with an aggregate BW of 2.68 Tb/s. More recently, VCSEL based optical engine integrated at wafer level has been demonstrated, called “Holey Optochip” with a potential to overcome substrate size and congestion, reduce foot print, and power . The 10 GbE MMF link reach based on OM3 (2000 MHz.km) fiber is 300 m, but the 100 GbE link reach based on 4x25.78 GBd with RS-FEC(528,514) is 70 m on OM3 and 100 m on OM4 (4700 MHz.km) . Unlike HPC applications, these massive data centers require a minimum fiber reach of 500 m, which is well beyond the reach of any MMF fibers at 25.78 GBd. A single mode solution operating at 50 Gb/s or 100 Gb/s based on silicon photonics (SiP) MCM (2.5D integration) or TSV (3D integration) has the potential to not only meet the reach objective, but also address the next frontier in ASIC IO BW requirements .
Large data center switches front panel bandwidth-density have increased from 1.44 Tb/s to 3.6 Tb/s assuming 19” rack with 36 ports of QSFP28 by just increasing the signaling from 10 Gb/s to 25 Gb/s. With 25 Gb/s speed signaling the ASIC IO has increased to the point, where no longer MicroPODTM i optics need to be integrated on the substrate to achieve 2.67 Tb/s ASIC BW . To address 25 Gb/s/lane interconnect, IEEE 802.3bm created a low power interface optimized for Chip-to-module applications, called CAUI-4, based on NRZ signaling. CAUI-4 transmitter is based on a 3 tap transmit FFE and low power receive CTLE (Continuous Time Linear Equalizer) with 9 dB gain supporting channel with a loss up to 10 dB at Nyquist. Large multi-Tb/s ASIC on a 19” blade can be routed to 32-36 QSFP28, each operating at 100 Gb/s with a trace length up to 250 mm and meeting CAUI-4 loss budget of the 10 dB if one uses Megtron 6 HVLP PCB material manufactured by Panasonic. Most ASIC/FPGA support copper cabling (CR4) and/or backplane are implementing 100GBase-KR4/CR4 SerDes per IEEE 802.bj . The KR4/CR4 SerDes receiver implements a CTLE followed by an advance multi-tap DFE receiver . The KR4/CR4 SerDes has a higher power dissipation but can support PCB loss in the range of 32-35 dB at Nyquist.
2. Front panel bandwidth bottlenecks
Table 1, lists module form factors for 10 GbE, 40 GbE, and 100 GbE and associated bandwidth density per millimeter. With introduction of the SFP+ in 2007, the front panel bandwidth bottleneck was eliminated by delivering an aggregate BW of 480 Gb/s  assuming 48 ports of SFP+ . In 2010, with introduction of the QSFP+ the front panel BW was increased to 1.44 Tb/s assuming 36 QSFP+ modules each operating at 40 Gb/s. The module form factors such as 100 GbE CFP and CFP2 have resulted in front panel BW set back, but they were required to support 100 GbE PMDs such as 100Gbase-LR4/ER4 (10 km/40 km) based on LAN-WDM spacing which requires cooling. The industry is now migrating to CFP4 and QSFP28 to overcome front panel BW set back created by CFP/CFP2. CWDM4/CLR4 MSA are defining 100 GbE duplex SMF PMD with 2 km reach that does not require cooling to allow QSFP28 implementation with max power dissipation of 3.5 W. CDFP MSA currently in development with 16x25G module electrical system interface raises the front panel bandwidth to 4.4 Tb/s, QSFP56 based on 50 Gb/s signaling expect to increase the front panel BW to 7.2 Tb/s. Higher front panel bandwidth would require mid-board, see Fig. 1. The mid-board optics main drawback is being locked to single PMD, eliminating the lower cost Cu DAC option, and serviceability.
3. Switch ASIC bandwidth bottleneck
Next the switch ASIC bandwidth bottleneck due to package size and PCB break out will be investigated. Figure 2 illustrates a suitable BGA ball map for 25 Gb/s signaling with 4 rows of the high-speed signals. At 25 Gb/s the top and the bottom PCB layer have higher loss due to solder mask and are not suitable for high speed routing. A package with 4 rows of the high speed signal would have to use 4 stripline layers to avoid crosstalk. As the number of high speed rows increases, to clear the BGA field, this will result in some signal degradation that may require a retimer (CDR) prior to driving the optics. As the number of striplines layers increases, either board thickness will increase, or it requires use of narrower traces with higher loss.
The BW density for a switch ASIC with 25G I/O can be estimated from the ball map of Fig. 2, where the ball map is 8x14 and supports 8 duplex links. Assuming the operating speed is 25 Gb/s for a square BGA package of 1 mm pitch, the BW density is only 1.8 Gb/mm^2. The I/O BW assuming ball map of Fig. 1 can then be estimated for any package size by:Fig. 2 ball map. Fig. 3 shows package I/O BW for case assuming 100% area utilization for high- speed signals, 2 rows, 4 rows, and 6 rows of high-speed signals. CDFP in 2015 can deliver 4.4 Tb/s of front panel bandwidth, the switch ASIC require 6 rows of high-speed signals. The package to support 4.4 Tb/s need to be ~62x62 mm ceramic LGA instead of lower cost organic BGA currently limited to about ~57x57 mm. Moving to 50G signaling allow doubling the ASIC BW without the need to move to larger and higher cost ceramic LGA packages.
4. Current and foreseen optical implementations
Figure 4 shows the most common implementation based on pluggable module, followed by mid-board optics, followed by SiP implementation co-packaged in an MCM, and SiP 3D integration with switch using TSV . Optical Internetworking Forum has recently started projects to standardize SerDes interfaces for TSV/MCM (USR) and mid-board optics (XSR) for operation from 39.9 to 56.2 Gb/s to optimize SerDes/OE power dissipation .
In option 1 “pluggable module” with more capable KR4 SerDes and higher power dissipation is the preferred implementation in order to support DAC (Direct Attach Copper) cables or retime optical modules. Option 1, “pluggable module”, is typically implemented with higher cost PCB material such as Megtron 6 to avoid using costly retimer. In option 2, “mid-board optics”, the on board optics may require a retimers due to ISI and associated reflection as a result of signal break out through a large complex BGA package. In option 3, “MCM”, the OE co-packaged implementation can eliminate the retimers and dial down the SerDes I/O to save power but still would require driving 50 Ω transmission lines, unless the electrical roundtrip delay is ~1/5 of a bit period. In option 4, “TSV” implementation, SerDes power is further reduced as 50 Ω back termination no longer is required and a CMOS invertor may drive the silicon modulator directly. Table 2 shows duplex link implementation and power dissipation assuming pluggable module, mid-board optics, MCM, and the TSV [8, 9]. Power dissipation listed in Table 2 excludes the OE/EO power dissipation. Figure 5 shows an example KR4 SerDes and blocks are eliminated to go from KR4 SerDes to a TSV SerDes. The largest source of power saving is eliminating on board retimers and moving from DFE based SerDes to light weight SerDes.
QSFP28 is well matched to the current state of art switch ASIC with 3.2 Tb/s bandwidth offering flexible choices of MMF optics, SMF optics, and Cu cable. Over the next 2-3 years QSFP56 based on 50G signaling could possibly offer 7.2 Tb/s front panel bandwidth. The QSFP56 interface based on 50G NRZ or PAM4 signaling may require additional mid-board retimer, which will increases the cost and power dissipation. To move away from pluggable module requires an overwhelming cost-power advantage or a clear bandwidth bottleneck. Only the MCM or TSV implementation potentially can offer sufficient cost-power saving to forgo the flexibility of the pluggable module for data center applications.
Large data centers have a genuine bandwidth bottleneck; no longer the lower cost-power VCSEL MMF links can address the reach requirements. Parallel SMF or duplex SMF implemented with SiP is an attractive option to address large data centers to reach requirement of 300-700 m. SiP either placed on a common substrate (MCM) or implemented as TSV has the potential to offer cost in par with VCSEL MMF link and meet an overwhelming majority of applications with single mode optics. The mid-board, MCM, or TSV optics becomes more attractive with migration to 50 Gb/s signaling compare to pluggable QSFP56 due channel impairments and the need for possibly additional mid-board retimers. As silicon photonics manufacturing matures it enables lower cost-power embedded SiP for the next generation 7.2 Tb/s front panel applications, but QSFP56 offering flexible MMF/SMF/Cu interfaces will remain a viable alternative.
References and links
1. J. A. Kash1, A. F. Benner, F. E. Doany, D. M. Kuchta, B. G. Lee, P. K. Pepeljugoski, L. Schares, C. L. Schow, and M. Taubenblatt, “Optical interconnects in future servers”, OFC (2011), paper QWQ1.
2. C. L. Schow, “Power efficient transceivers for high-bandwidth, short reach interconnects”, OFC (2012), paper OTh1E.4.
3. A. Ghiasi, F. Tang, and S. Bhoja, IEEE 802.3, 100GNGOPTX study group. http://www.ieee802.org/3/100GNGOPTX/public/mar12/plenary/ghiasi_02_0312_NG100GOPTX.pdf.
4. M. Watts, “Moore’s law, silicon photonics, and the remaining challenges”, HSD (2011).
5. S. Bhoja, A. Ghiasi, F. Chang, M. Dudek, S. Inano, and E. Tsumura, “Next-generation 10 Gbaud module based on emerging SFP+ with host-based EDC,” IEEE Communication Magazine, Vol 45(3), S32–S38 (2007). [CrossRef]
6. IEEE Std 802.3bj-2014 IEEE Standard for Ethernet Amendment 2: Physical Layer Specifications and Management Parameters for 100 Gb/s Operation Over Backplanes and Copper Cables.
7. J. Bulzacchelli, T. Beukema, D. Storaska, P. Hsieh, S. Rylov, D. Furrer, D. Gardellini, A. Prati, C. Menolfi, D. Hanson, J. Hertle, T. Morf, V. Sharma, R. Kelkar, H. Ainspan, W. Kelly, G. Ritter, J. Garlett, R. Callan, T. Toifl, and D. Friedman, “A 28 Gb/s 4 tap FFE/15-tap DFE serial link Transceiver in 32 nm SOI CMOS technology”, Session 19.1, ISSCC (2012).
8. A. Ghiasi, “Is there a need for on-chip photonic integration for large data warehouse switches”, IEEE Photonics Group IV, (2012).
9. Optical Inter Networking Forum, http://www.oiforum.com/public/currentprojects.html.