RIFL: a reliable link layer network protocol for data center communication

Qianfeng Shen; Jun Zheng; Paul Chow

doi:10.1364/JOCN.443448

1. INTRODUCTION

Major data center services and applications such as remote direct memory access (RDMA), machine learning, and cloud storage demand the network interconnect to be low latency and lossless while preserving high bandwidth. Previous works, such as [1,2], demonstrate how the performance of applications in various fields can be drastically impacted by interconnect latency. It is important to realize that most of the technologies and concepts used in today’s data center networks (DCNs) existed before the large-scale data centers of today were even imagined. For example, IPs were first defined in 1974 [3], well before any massive data center was built. Today, with rapidly evolving technologies, it is time to explore new approaches for the DCN that are designed for the needs of today’s data center.

The conventional Transmission Control Protocol (TCP)/IP stack is designed to work reliably not only in a local area network (LAN), but also in a wide area network (WAN). The physical properties of a LAN and a WAN are significantly different. Both bandwidth-wise and latency-wise [4], TCP/IP and User Datagram Protocol (UDP)/IP carry too much redundancy when used in a LAN. Considering the diameter of a data center server room is rarely more than 100 m, a DCN is essentially a LAN. There should be a more efficient protocol stack that fulfills the exact needs of a DCN.

Nevertheless, protocols based on TCP/IP and UDP/IP [5,6] still dominate the data center market. One of the most important reasons for cloud providers to use these protocols is that hardware changes would be required to both the end devices and the network switches to deploy a new protocol in a data center. Traditionally, the network switches and the network interface controllers are all implemented using application-specific integrated circuits (ASICs). It would take years to design, fabricate, test, and deploy the ASICs for a new protocol.

Compatibility with the established infrastructure and the barrier to developing new ASICs makes it extremely difficult to introduce major changes. However, it is still interesting to know what opportunities exist that might influence DCN infrastructure over time. The basis of our work is to build an experimental platform that enables us to explore what might be possible if we could start over, i.e., how would we build the DCN infrastructure starting with what we know is feasible today and not be constrained by any legacy requirements, either technical or business? In this paper, we will show what we can do by leveraging the capabilities of modern field-programmable gate array (FPGAs).

Today, the number of high-speed transceivers is quickly increasing in modern FPGAs. Off-the-shelf FPGAs containing multiple Quad Small Form-factor Pluggable 28 (QSFP28) ports are already available on the market [7], showing that a flexible and economically efficient approach to redesigning DCNs starting from the very bottom layer of the protocol stack can be prototyped without needing new ASICs.

There are many network protocols apart from TCP/IP and UDP/IP. However, some of them [8,9] are dedicated to the link layer, providing limited scalability and flexibility. Some of them are based on the media-independent interface (MII) [10] or UDP [6], and you cannot remove the redundancy carried with the conventional network stack. Others such as InfiniBand [11] implement re-transmission in their transport layers. We will discuss its inefficiency in Section 2.

To meet the exact demands of a DCN, we propose a new protocol stack as follows:

Layer 1: Link Layer
This layer is implemented immediately next to the transceivers. It is a combination of the data link layer (layer 2) and the physical layer (layer 1) in the Open Systems Interconnection model. It should provide a line protocol with appropriate data packetization, channel bonding, and clock compensation. Re-transmission should also be a part of this layer to resolve link level data corruption. The benefits of implementing re-transmission at this layer is discussed in Section 2. Beyond this layer, there should be no data corruption caused by link noise.
Layer 2: Network Layer
This layer should provide a low latency routing scheme that avoids using a centralized routing table. Switch initiated congestion control mechanisms should also be implemented in this layer. Beyond this layer, all the data transfers should be lossless. Anything sitting above this layer does not have to worry about checksums, re-transmission, or congestion at all.
Layer 3: Application Layer
This layer consists of two parts: hardware and software. The hardware serves as an accelerator for common DCN applications and services, e.g., a near-memory computing engine to reduce the round trips for RDMA. The software abstracts the usage of the hardware and provides the software programmer an easy-to-use user interface.

With this protocol stack, we envision a lossless network can be built. In our prototype, at its layer 2 interface, this network can provide lossless links with less than 300 ns typical latency per hop with bandwidths beyond 100 Gbps.

This paper focuses on the link layer design named RIFL. The network layer and the application layer designs will be the subject of our future work. The rest of this paper is organized as follows: Section 2 discusses the physical properties of a DCN and how they can be leveraged to build a more efficient link layer protocol. In Section 3, we define the RIFL frames. Section 4 introduces the RIFL protocols. Section 5 presents the hardware implementation of RIFL. Section 6 provides performance results. Section 7 discusses related work, and Section 8 concludes this work.

2. LAYER 1 - THE LINK LAYER

The goal of our layer 1 is to provide a reliable link layer point-to-point protocol as a foundation for the higher layers. This layer should be low latency, high bandwidth, and use minimal hardware resources. Reliability here means correcting any bit errors that occur during transmission across the link. With a reliable link, the higher layers need not be concerned with any data integrity issues resulting from the physical transmission.

In this section, we cover the following topics: Development of our layer 1 first requires the selection of the mechanism for error detection and correction. After selecting re-transmission, we show that it can work within the constraints of a DCN. After justifying hop-by-hop link layer re-transmission, we show that an additional property can be introduced. Finally, we explain why we can solely rely on negative acknowledgments (NACKs) as the re-transmission notifications in DCNs, and why doing so is critical for the efficiency. Given these justifications we can then develop the circuit for our protocol implementation.

We start by imposing the first constraint:

A the distance between any two nodes within a DCN is less than 500 m.

A. Forward Error Correction versus Re-transmission

There are two major approaches to eliminate the effect of data corruption caused by bit errors: forward error correction (FEC) and re-transmission.

FEC is widely used in wireless and low-level wired communication. It requires the sender to send redundant data along with the payload. The redundant data, which is usually an error correction code (ECC), can be used to detect the errors in the payload as well as correct the errors.

Re-transmission requires redundant data as well. The redundant data are usually a checksum. However, the checksum is not used to correct the errors. Instead, it needs to carry only enough information to detect the errors in the payload. While sending data to the receiver, the sender keeps a copy of the most recent transmitted data. Once an error is detected, the receiver notifies the sender to resend the corrupted data.

While FEC detects and corrects the errors, the checksum only detects the errors. Consequently, for the same size of payload, the size of the ECC used by FEC is much larger than the size of the checksum used by re-transmission, which means the bandwidth overhead for FEC is much larger than re-transmission. Moreover, because FEC usually involves large matrix multiplications, the typical latency overhead for FEC is much larger as well. Therefore, FEC is more suitable in situations where re-transmission is impossible or very expensive, e.g., in one-way communications such as radio networks or simplex links, or in any bidirectional communication that operates on a very high bit error ratio (BER).

In current DCNs, 100G Ethernet is slowly becoming the dominant interconnect technology [12]. The commercially available QSFP28 cables used by 100G Ethernet can guarantee BERs better than ${10^{- 12}}$ without using FEC. Under such a low BER, re-transmission is much more efficient than FEC. However, as the next generation cable technologies pursue higher throughput per lane, their associated BER can be significantly higher than ${10^{- 12}}$. Thus, for better compatibility with future technologies, our BER constraint is

B the effective BER of the link that RIFL operates on must not exceed ${10^{- 7}}$.

We set the minimal BER requirement as ${10^{- 7}}$ because in our simulations, we found that in any link shorter than 500 m with a BER better than ${10^{- 7}}$, re-tranmission can be done efficiently. Plus, a minimal BER of ${10^{- 7}}$ means RIFL can work not only with the current popular cables, but also with any future physical links providing BERs better than ${10^{- 7}}$. For links whose BERs are worse than ${10^{- 7}}$, FEC must be incorporated to guarantee reliable transmissions. Otherwise, the bandwidth will be mainly occupied by re-transmissions instead of regular data transmissions. Nevertheless, even if FEC is used, RIFL still has advantages because it needs only a lightweight FEC code to improve BER to better than ${10^{- 7}}$ while other protocols, such as Ethernet, require much lower post-FEC BERs [13]. Even with FEC, they still cannot guarantee lossless transmissions.

To summarize, we choose re-transmission as the main error recovery method for RIFL. When BER is higher than ${10^{- 7}}$, FEC has to be applied to improve the BER so that constraint B can be satisfied.

B. Re-transmission Efficiency versus Round Trip Time

To guarantee a lossless link, the re-transmission mechanism should be designed for the worst case. Because any frame (the basic unit of data transmitted across the link; any data are transmitted along the link by the means of one or multiple frames) transmitted during the round trip time (RTT) may have errors, and the size of the re-transmission buffer, denoted as ${{S}_{{\rm retrans}}}$, must be larger than the size of the data transmitted during the largest RTT between the sender and the receiver, namely,

(1)$${{S}_{{\rm retrans}}}\geqslant {\lambda _{{\rm line}}} * {\rm RTT},$$

where ${\lambda _{{\rm line}}}$ denotes the line rate.

The larger the RTT is, the larger the re-transmission buffer needs to be. It is worth noting that when the line rate is larger than 100 Gbps and RTT exceeds 100 µm, it requires more than 1 Mb of re-transmission buffer. It is no longer suitable to use embedded memories such as static random-access memory as the buffer. Otherwise, the circuit area will be too large. This issue is encountered by some TCP implementations [14,15]. Their solution is to use double data rate (DDR) memory as an alternative. However, it further increases the RTT and complexity because the latency of a DDR memory is not constant and is sometimes more than 100 ns [16], whereas the latency of an embedded memory is much more stable and is usually a few nanoseconds.

Moreover, a shorter RTT also lowers the latency and bandwidth overhead introduced by re-transmission: a shorter RTT means quicker interaction between the sender and the receiver, and a shorter stalling time after a frame error is detected. Therefore, for optimal efficiency, re-transmission should be implemented in a protocol layer where the RTT is minimized.

The RTT consists of two parts: circuit delay (${{T}_{{\rm circuit}}}$) and cable delay (${{T}_{{\rm cable}}}$). Circuit delay is the time the circuit logic spends to process and forward the data, including the latency introduced by the transceivers (${{T}_{{\rm gt}}}$), upper layer protocols (${{T}_{{\rm proto}}}$), and buffer queues (${{ T}_{{\rm buffer}}}$). Cable delay is the time the data travel along the cable, determined by the speed of light and total link length. Assuming both directions of the link are symmetric, we have

(2)$${\rm RTT} = 2*\left({{T_{\rm{circuit}}} + {T_{\rm{cable}}}} \right),$$

(3)$${T_{\rm{circuit}}} = {T_{\rm{gt}}} + {T_{\rm{proto}}} + {T_{\rm{buffer}}},$$

(4)$${T_{\rm{cable}}} = \frac{{{L_{\rm{cable}}}}}{C},$$

where $C$ denotes the speed of light in the cable, and ${L_{\rm{cable}}}$ denotes the link length.

While the ${{T}_{{\rm cable}}}$ is a constant, as the link length will not grow or shrink over time, the ${{ T}_{{\rm circuit}}}$ can vary in a very wide range, depending on the protocol layer where the RTT is measured. If re-transmission is implemented within or above the network layer, where more than two nodes are involved and the data need to go across a switching node to be routed to the destination, then end-to-end RTT is used. Otherwise, if re-transmission is done hop-by-hop within the link layer, then hop-by-hop RTT is used.

Figure 1 shows the difference between end-to-end and hop-by-hop. For end-to-end, the worst case RTT can be hundreds or thousands of times larger than the typical RTT. When the network is congested, the ${{ T}_{{\rm buffer}}}$ can be unpredictably large. Furthermore, congestion can cause frame losses, frame losses lead to re-transmission, and re-transmission can intensify network congestion, causing a positive feedback. For hop-by-hop, because there is no congestion at this level, the RTT will be constant, and there will be no congestion-caused frame loss. Although end-to-end re-transmission is adopted by protocols such as TCP and InfiniBand, according to the above discussion, hop-by-hop is better for minimizing memory usage, latency, and bandwidth overhead because it achieves the minimal RTT.

Fig. 1. Hop-by-hop versus end-to-end.

Download Full Size | PDF

However, despite its significant advantages, re-transmission is seldom included in existing link layer protocols. One of the reasons, we believe, is related to the circuit area and complexity. The hardware implementation of a link layer protocol should not be heavy or power hungry. Specifically, a link layer protocol should not need megabytes of memory to function properly. In our case, assuming the line rate is 100 Gbps and the ${T_{\rm{Circuit}}}$ is 100 ns, according to Eqs. (1), (2), and (4) and constraint A, the ${{S}_{{\rm retrans}}}$ required is no larger than 45 KB. The size is comparable to a CPU L1 cache, making link layer re-transmission feasible.

In conclusion, in a DCN, re-transmission should be done hop-by-hop within the link layer.

C. Leveraging Hop-by-Hop Link Layer Re-transmission

Once hop-by-hop link layer re-transmission is chosen, a unique and vital property can be added to the constraint set, that is,

C in the hop-by-hop link layer transmission, the receiver can assume that frame ${N} + {1}$ will always arrive immediately after frame $N$ from the same sender.

Such an assumption is not true for end-to-end transmission protocols such as any Ethernet-based protocol, where frames from multiple senders can be routed to the same receiver. The receiver may receive frame ${N}$ and frame ${N} + {1}$ from different sources. The traffic can also stop at frame ${N}$ if none of the senders continues to send data to the receiver after frame ${N}$. However, for the link layer, a receiver is always paired to the same sender at the other end of the cable. If the user at the sender stops sending valid data after frame ${N}$, the link layer protocol can pack invalid/idle data into frames to create frame ${N} + {1}$ and subsequent frames. The invalid frames can be used by the protocol internally without being delivered to the user. This is an extremely useful property for hop-by-hop link layer re-transmission. We will discuss how it can be leveraged in the upcoming sections.

There is another equivalent expression of constraint C that is worth emphasizing; i.e., the receiver will never receive frame ${\textit{N}} + {1}$ before receiving frame ${\textit{N}} $ because in hop-by-hop link layer transmission, there is no buffer overflow caused by congestion. Starting from the sender logic, the data are handed over to the transceiver and then serialized, cross the cable, are de-serialized, and finally handed over to the receiver logic. There can be a few bits that are not sampled, causing the link to be out-of-sync. However, there is no way that a whole frame is lost during this process.

D. ACK versus NACK

ACK (acknowledgment) and NACK are the two possible acknowledgment mechanisms for re-transmission. For ACK, the receiver sends acknowledgments whenever it receives correct frames. For NACK, the receiver sends acknowledgments whenever it receives frames with bit errors.

In a DCN context, NACKs have significantly better efficiency over ACKs: let $p$ denote the frame error ratio [(FER): ratio of frames received with errors over total frames received] and ${N}$ denote the total number of frames to be transmitted during a certain period. For ACK, at least ${N}*({1 - p})$ acknowledgments need to be transmitted from the receiver to the sender; for NACK, at least ${N} * { p}$ NACKs are needed. In DCNs, as a result of constraint B, $p$ is much smaller than ${1 - p}$. Therefore, with NACKs, a much higher reverse channel bandwidth efficiency (ratio of the usable bandwidth to the line rate) can be achieved compared to ACKs.

Nevertheless, for end-to-end re-transmission, reliability cannot be guaranteed with only NACKs and no ACKs. Assume frame ${N}$ is the last frame to be transmitted from the sender to the receiver, and frame ${N}$ is dropped by an intermediate node (e.g., a switch). The receiver will never know that frame ${N}$ has been sent, and hence no NACK will be generated. Similarly, the sender will never know that frame ${N}$ is not received, and hence frame ${N}$ will not be re-transmitted. However, for hop-by-hop link layer re-transmission, with constraint C, it is feasible to use only NACKs to achieve reliability, because there are always frames being transmitted and none of them can be lost. They can only be corrupted. As a result, NACK is the acknowledgment mechanism we choose for RIFL.

E. Summary

In this section, we have now provided the basis for RIFL. We summarize the characteristics here before describing its implementation:

• data corruption is handled by re-transmission;
• the buffers required by re-transmission can be implemented entirely using embedded memories;
• link layer frames will always arrive in sequence;
• we will use NACKs to reduce bandwidth overhead introduced by acknowledgments.

3. DEFINING THE RIFL FRAMES

In Section 2, we justified that link layer hop-by-hop re-transmission is an efficient solution for eliminating bit errors in DCNs. However, the protocol itself and its microarchitecture will also significantly impact efficiency.

Without a concrete protocol, we are still far away from the final answer.

In this section, we will define the RIFL frames by answering the following questions:

1. Frame structure: what are the header fields in a RIFL frame?
2. Frame size: how large is a frame in RIFL?

A. High-Level Exploration of the Data Frame Structure

There is no universal definition of frame. In Section 2, we defined a frame as the basic unit of data transmitted across the link. At higher protocol layers, we use the term packet to denote a bundle of data, such as an IP packet. A packet will be transmitted as a number of RIFL link layer frames. To function properly, link layer frames carry not only the payload, but also other essential signals. For example, when re-transmission or flow control events occur, the corresponding control signals need to be exchanged between the sender and the receiver. There should be frames that carry such information. However, such events are assumed to occur much less frequently than regular data transmission. For bandwidth efficiency, there is no reason to include both control signals and payload in every frame.

We need to define different types of frames. By functionality, we divide the frames into data frames and control frames. Data frames are frames that carry the payload, and all other frames are control frames that help maintain state transitions. In a healthy link, most of the frames transmitted are data frames.

It is important to define the data frame structure well so that it serves the goal of making RIFL a low latency, high bandwidth, lightweight (small circuit area), and lossless link layer protocol. Section 2 showed that the circuit area is mainly impacted by the cable length and the microarchitecture of the protocol, and it is less relevant to the data frame structure. When defining the data frame structure, we should mainly study its impact on latency and bandwidth efficiency.

1. Header Fields

To make the bandwidth overhead small, only essential information should be included in the header of the data frames.

First, to be able to detect any errors, a checksum must be included in every data frame. Second, a data frame should carry a frame ID. Usually, there will be more than one data frame transmitted during an RTT, so the frame ID is used as the identifier to indicate which data frames should be re-transmitted when errors are detected. Third, for better granularity, a data frame should carry the information to indicate how many bytes in the payload are valid. Also, because any packet is divided into one or multiple data frames, there should be a marker in the data frame header to distinguish the end-of-packet data frames from other data frames, so that packet boundaries can be defined. Finally, for any link layer protocol, a line code should be adopted to re-align the data after deserialization. For 64b/66b encoding in Ethernet and Aurora [9] and 64b/67b encoding in Interlaken [8], the encoding is done independently from the protocol framing. Different from conventional protocols, in RIFL, to minimize the complexity and latency, the line code is integrated into every frame.

In summary, the data frame header should carry the following essential information: checksum, frame ID, count of valid bytes in the payload, end-of-packet marker, and line code header.

2. Data Frame Size

The first decision RIFL makes for the data frame size is to use a fixed frame size instead of a variable frame size. While a variable frame size is overall good for bandwidth efficiency, it is more complicated to implement, introduces longer latency, and requires a much larger buffer. Most importantly, a variable frame size introduces variable frame intervals (the difference of the arrival times between two adjacent frames), which can greatly increase the complexity of the re-transmission protocol. It is not worth sacrificing so much to save only 3% of the bandwidth. Thus, we study only the frame size impact of fixed size data frames. We start with exploring the impact of the data frame size on bandwidth efficiency.

The following equation can yield the bandwidth efficiency:

(5)$${\textit{Eff}_{\rm{bandwidth}}} = \left({1 - \frac{{{S_{\rm{Dheader}}}}}{{{S_{\rm{DFrame}}}}}} \right) \times {R_{\rm{DFrame}}},$$

where ${\textit{Eff}_{\rm{bandwidth}}}$ denotes bandwidth efficiency, ${{S}_{\rm{Dheader}}}$ denotes the size of the header in a data frame, ${{S}_{\rm{DFrame}}}$ denotes the data frame size, and ${R_{\rm{DFrame}}}$ denotes the fraction of data frames transmitted to all frames transmitted.

By constraint C, there are continuous frames transmitted, regardless of whether there are valid data to transmit. Let ${R_{\rm{NDFrame}}}$ denote the fraction of all non-data frames, and we get

(6)$${R_{\rm{DFrame}}} = 1 - {R_{\rm{NDFrame}}}.$$

Assuming when an error is detected, on average, there are ${{N}_{\rm{stall}}}$ subsequent non-data frames (including the re-transmitted data frames and control frames) transmitted, we get

(7)$${R_{\rm{NDFrame}}} = {N_{\rm{stall}}} \times {\rm FER}.$$

Combining Eqs. (5)–(7), we get

(8)$${\textit{Eff}_{\rm{bandwidth}}} = \left({1 - \frac{{{S_{\rm{Dheader}}}}}{{{S_{\rm{DFrame}}}}}} \right) \times \left({1 - {N_{\rm{stall}}} \times {\rm FER}} \right),$$

where

(9)$${\rm FER} = 1 - {\left({1 - {\rm BER}} \right)^{{S_{\rm{DFrame}}}}}.$$

According to Eq. (8), a higher bandwidth efficiency is achieved by reducing the ratio of ${{S}_{\rm{Dheader}}}$ to ${{S}_{\rm{DFrame}}}$ and minimizing ${{N}_{\rm{stall}}}$ and FER. Among the three factors, ${{N}_{\rm{stall}}}$ is mainly affected by the protocol design, while the others are mainly determined by ${{S}_{\rm{DFrame}}}$.

For a good re-transmission protocol, most of the frames transmitted should be data frames. For environments with a low BER, ${{R}_{\rm{NDFrame}}}$ will be much smaller compared to the ratio of ${{S}_{\rm{Dheader}}}$ to ${{S}_{\rm{DFrame}}}$. So, the ${\textit{Eff}_{\rm{bandwidth}}}$ will be mainly impacted by the ratio of ${{S}_{\rm{Dheader}}}$ to ${{S}_{\rm{DFrame}}}$. As ${{S}_{\rm{DFrame}}}$ increases, ${{S}_{\rm{Dheader}}}$ will also increase because some of the header fields, such as the checksum, need to be expanded for a larger ${{S}_{\rm{DFrame}}}$, but ${{S}_{\rm{Dheader}}}$ increases more slowly than ${{S}_{\rm{DFrame}}}$ increases. For example, among all the cyclic redundancy check (CRC) codes that feature a Hamming distance [17] (HD) of four (can detect at most three errors), 8-bit CRC codes can protect at most 119 bits of payload, while 16-bit CRC codes can protect at most 32,751 bits of payload [18,19]. Therefore, generally, the ratio of ${{S}_{\rm{Dheader}}}$ to ${{S}_{\rm{DFrame}}}$ decreases when ${{S}_{\rm{DFrame}}}$ increases. Nevertheless, this does not mean that ${{S}_{\rm{DFrame}}}$ can be infinitely large. For the same BER, the larger ${{S}_{\rm{DFrame}}}$ is, the larger the FER is. Even though by constraint B, the BER should be smaller than ${10^{- 7}}$, if ${{S}_{\rm{DFrame}}}$ is large enough, ${{R}_{\rm{NDFrame}}}$ can still impact ${\textit{Eff}_{\rm{bandwidth}}}$.

In addition, a larger ${{S}_{\rm{DFrame}}}$ also means a larger latency. During transmission, the receiver can verify the correctness of a data frame only after all the bits of the data frame are received. To guarantee a lossless transmission, before examining the entire data frame, not a single bit of the data frame can be delivered from the receiver. That is to say, the larger ${{S}_{\rm{DFrame}}}$ is, the larger the latency that will be introduced by checksum verification.

In summary, ${{S}_{\rm{DFrame}}}$ cannot be too small; otherwise, the bandwidth overhead of the header will be too large. On the other hand, ${{S}_{\rm{DFrame}}}$ cannot be too large either; otherwise, the bandwidth can also be reduced because of a high FER, and the latency will also be too large.

B. Data Frame

With the conclusion of Section 3.A, we define the following data frame header fields:

Table 1. Meta Code Encoding^a

View Table | View all tables in this article

Fig. 2. Data frame structure.

Download Full Size | PDF

1. Syncword

SYN is a 2-bit line code header. It is also used as a marker to mark whether a frame is a data frame or a control frame. Using the verilog constant notation, in data frames, SYNs are set to 2’b01; in control frames, SYNs are set to 2’b10. A SYN of 2’b00 or 2’b11 is illegal, indicating that data are not aligned.

2. Payload

User payload.

3. Meta Code

The meta code is used to indicate whether the payload is not valid, partially valid, or all bytes of the payload are valid. The end-of-packet marker is also encoded by the meta code. Table 1 shows the meta code encoding and the corresponding interpretation. With only two bits, the meta code cannot indicate how many bytes in the payload are valid. It can indicate only whether all bytes of the payload are valid. When not all bytes of the payload are valid, the last byte of the payload, which is certainly invalid as user data, becomes the format code.

4. Format Code

The format code is an 8-bit field. It is used to indicate how many bytes in the payload are valid when the meta code indicates that not all bytes of the payload are valid. By combing the meta code and the format code, the count of the valid bytes in the payload and the end-of-packet marker mentioned in Section 3.A can be represented with only a cost of two bits in the data frame header. Meanwhile, because the format code is limited to eight bits, it works only when the payload size is not larger than 2048 bits (256 bytes).

5. Verification Code

The verification code is the exclusive-or result of the checksum and the frame ID. It combines the functionalities of the checksum and the frame ID; i.e., the verification code is used to verify the correctness of the data frames as well as to locate the error frame when an error is detected. More details of usage of the verification code will be illustrated in the next section.

Figure 2 shows the data frame structure, where ${{S}_{\rm{DFrame}}}$ (we use bit as the size unit for the rest of the paper) denotes the data frame size, ${{S}_{\rm{payload}}}$ denotes the size of the payload, and ${{S}_{\rm{verification}}}$ denotes the size of the verification code. We use Xilinx FPGAs for prototyping, and the available transceivers offer 32-bit, 64-bit, and 128-bit interfaces [20–22]. To minimize the latency and complexity of data width conversion, ${{S}_{\rm{DFrame}}}$ should be a multiple of the interface width of the transceiver. In our prototype, we set ${{S}_{\rm{DFrame}}}$ to be a power of two, and no less than 128. According to Fig. 2, we get

(10)$$\begin{split}{{S_{\rm{DFrame}}}}&{= {S_{\rm{payload}}} + {S_{\rm{Dheader}}}}\\ &= {{S_{\rm{payload}}} + {S_{\rm{verification}}} + 4}\end{split}.$$

If we assume ${R_{\rm{NDFrame}}}$ is small, then ${R_{\rm{DFrame}}}$ is close to one. Combining Eqs. (5) and (10), we get

(11)$${\textit{Eff}_{\rm{bandwidth}}} = 1 - \frac{{{S_{\rm{verification}}} + 4}}{{{S_{\rm{DFrame}}}}}.$$

As discussed in Section 3.A, to minimize the latency and maximize bandwidth efficiency, both ${{S}_{\rm{DFrame}}}$ and ${{S}_{\rm{verification}}}$ need to be small. Because ${{S}_{\rm{DFrame}}}$ is set to be a power of two, and no less than 128, and the frame code can support only up to 2048 bits of payload, the ${{S}_{\rm{DFrame}}}$ options are limited to 128, 256, 512, 1024, and 2048.

Let ${{S}_{\rm{FrameID}}}$ denote the size of the frame ID field and ${{S}_{\rm{checksum}}}$ denote the size of the checksum. Because the verification code is the exclusive-or result of the checksum and the frame ID, we get

(12)$${S_{\rm{verification}}} = {\rm Max}\left({{S_{\rm{FrameID}}},{S_{\rm{checksum}}}} \right).$$

A valid tuple of (${{S}_{\rm{DFrame}}}$, ${{S}_{\rm{verification}}}$) should satisfy the following requirements:

1. The size of the frame ID should provide enough unique data frame IDs to cover all the data frames sent during an RTT.
2. For any BER that is better than ${10^{- 7}}$, the mean time before failure (MTBF) (in this paper, we define the MTBF as the time to make the system failure possibility equal to 1%) associated with the checksum should be at least longer than the lifetime of the circuit, say 100 years.

The first requirement can be quantitatively described as

(13)$$2^{S_{\rm {FrameID }}} \geqslant \frac{\lambda_{\rm{line }} \times {\rm RTT}}{S_{\rm{DFrame }}} .$$

The second requirement can be expressed by

(14)$${\left({1 - {\rm FFR}} \right)^{\frac{{{\lambda _{\rm{line}}} \times {\rm MTBF}}}{{{S_{\rm{DFrame}}}}}}} = 99\% ,$$

where FFR denotes the frame failure ratio, representing the ratio of the error frames that cannot be detected by verifying the checksum to the total number of frames transmitted. In RIFL, we use a CRC code as the checksum. For an m-bit CRC code that features an HD [17] of $n + 1$, it can detect all error frames that carry no more than $n$ error bits. If the number of error bits is more than $n$, one over ${2^m}$ of the error frames cannot be detected. Therefore,

(15)$${\rm FFR} = \frac{1}{{{2^m}}} \times \left({1 - \mathop \sum \limits_{i = 0}^n P\left(i \right)} \right),$$

where $P(i)$ denotes the possibility of a frame carrying exactly ${ i}$ bits of errors:

(16)$$P\left(i \right) = \left({\begin{array}{*{20}{c}}{{S_{\rm{DFrame}}}}\\i\end{array}} \right){{\rm BER}^i}{\left({1 - {\rm BER}} \right)^{{S_{\rm{DFrame}}} - i}}.$$

There are a wide range of CRC codes listed in [19]. Let the line rate be 100 Gbps, RTT be 500 ns, and the BER be ${10^{- 7}}$. Combining Eqs. (13)–(16), and the CRC codes in [19], the minimal ${{S}_{\rm{FrameID}}}$ and ${{S}_{\rm{checksum}}}$ for different ${{S}_{\rm{DFrame}}}$ can be found in Table 2.

Table 2. Minimal ${{S}_{\rm{FrameID}}}$ and ${{S}_{\rm{checksum}}}$ for Different ${{S}_{\rm{DFrame}}}$^a

View Table | View all tables in this article

Because the payload is input from the user interface, and following the convention that the data width of the user interface should be a power of two, there should be a data width conversion module to convert the user input to the payload. To minimize the latency and complexity of the conversion module, the payload should be byte-aligned:

(17)$${S_{\rm{payload}}} \equiv 0\,{\rm mod}\,8.$$

Because we have limited ${{S}_{\rm{DFrame}}}$ to a power of two and to be no less than 128, we get

(18)$${S_{\rm{DFrame}}} \equiv 0\,\,{\rm mod}\,\,8.$$

Combining Eqs. (10), (17), and (18), we get

(19)$${S_{\rm{verification}}} \equiv 4\,\,{\rm mod}\,\,8.$$

The minimal ${{S}_{\rm{verification}}}$ and the corresponding ${\textit{Eff}_{\rm{bandwidth}}}$ for various values of ${S_{\rm{DFrame}}}$ can be found in Table 3.

Table 3. ${{S}_{\rm{verification}}}$ versus ${{S}_{\rm{DFrame}}}$

View Table | View all tables in this article

Let 90% be the acceptance threshold of the bandwidth efficiency; then the available options for the data frame size are 256, 512, 1024, and 2048 bits, and ${S_{\rm{verification}}}$ should always be 12 bits. Because the ${S_{\rm{verification}}}$ should be 12 bits, we extend the CRC code to 12 bits for stronger protection. We choose not to extend the frame ID field, because a larger ${{S}_{\rm{FrameID}}}$ means larger ${S_{\rm{retrans}}}$, which leads to a larger circuit area.

In summary, we defined the data frame fields and the size of each field in this subsection.

C. Control Frame

As discussed in Section 3.A, there should be control frames in RIFL to help maintain state transitions. Because control frames are used much less than data frames, the size of the control frames does not have much effect on the protocol efficiency. Therefore, we do not need to further analyze the impact of the control frame size as we did for the data frame size. To minimize complexity, the control frame size is set to be equal to the data frame size.

Figure 3 shows the control frame structure, where ${{S}_{\rm{DFrame}}}$ denotes the data frame size, and ${{S}_{\rm{verification}}}$ denotes the size of the verification code.

Fig. 3. Control frame structure.

Download Full Size | PDF

The SYN and the verification code do the same thing in the control frames as they do in the data frames. The control codes are listed below.

1. Idle

This code indicates the sender is not in the normal data transfer state. This code is sent out when the sender is in the transition state between the pause state, re-transmit state, and normal state. Detailed explanations of each state will be introduced in the next section.

2. Pause Request

This code is sent by the receiver when the link is out-of-sync. It notifies the sender to pause from sending data.

3. Re-transmit Request

This code is sent by the receiver when a bad verification code is encountered. It tells the sender to switch from normal data transmission to the re-transmission procedure.

D. Summary

In this section, we analyzed the frame structure’s impact on the bandwidth efficiency and latency of the protocol. We defined the structures of data frames and control frames based on our analysis. It is worth noting that the frame sizes we chose are based on the interface data width of the transceivers we used for prototyping. For other types of transceivers that offer different interface data widths, the same analysis can be done again to determine the best frame size options.

4. DEFINING THE RIFL PROTOCOL

In this section, we will introduce how RIFL operates with the frames we defined in Section 3. By functionality, this section is divided as follows:

1. Transmit (TX) and receive (RX) protocols: how RIFL TX and RX sides operate.
2. Re-transmission: how re-transmission is done with the verification code we defined in Section 3.
3. Flow control and clock compensation: explanation of the flow control procedure and the clock compensation procedure.
4. Channel bonding: how RIFL aggregates multiple transceivers to achieve higher line rates.

A. TX Protocol

There are six states for the TX logic:

• Init. In this state, invalid data frames are generated with meta code 2’b00, and frame ID from zero to the max. (The max value depends on how many bits are used for the frame ID, e.g., if ${{S}_{\rm{FrameID}}}$ is 8 bits, then the max value is 255. The ${{S}_{\rm{FrameID}}}$ can be at most 12 bits because the ${S_{\rm{verification}}}$ is set to 12 bits.) The corresponding verification codes are also computed and inserted into each frame. These invalid data frames will fill the re-transmission buffer during initialization. Throughout this state, the TX logic sends out back-to-back pause request frames.
• Send pause. Transmitting falls into this state when the RX logic detects that the link is out-of-sync, or right after the TX logic finishes initialization. Throughout this state, the TX logic sends out back-to-back pause request frames.
• Pause. Transmitting falls into this state when pause request frames are received by the RX logic. Throughout this state, the TX logic sends back-to-back idle frames.
• Retrans. Transmitting falls into this state when re-transmit request frames are received by the RX logic or a re-transmission is resumed from an interruption caused by higher priority events. In this state, the TX logic can send three types of frames: re-transmitted data frames, idle frames, or re-transmitted request frames. More details will be elaborated in the upcoming re-transmission subsection.
• Send retrans. Transmitting falls into this state when an error is detected by the RX logic and there is no other higher priority condition. Throughout this state, the TX logic sends out back-to-back re-transmit request frames.
• Normal. The normal data transmission state. As discussed previously, the link should stay in this state most of the time if the BER is within the designed operation range (${10^{- 7}}$ in our case). In this state, a user is allowed to transmit data. When valid user data are input, the data are transformed into the payload of one or multiple data frames. When the user does not input valid data, invalid data frames with meta code 2’b00 are generated. In other words, in this state, the TX logic constantly sends out back-to-back data frames and copies them to the re-transmission buffer. Whenever user input is not valid, protocol-generated invalid data frames will be transmitted along the link to fill in the gaps.

Figure 4 shows the state transition diagram for the TX logic. Except for the init state, all other states follow the same transition logic.

Fig. 4. TX state transition diagram.

Download Full Size | PDF

B. RX Protocol

There are in total five special events in RIFL: out-of-sync, pause request, re-transmit request, frame error, and flow control. The reactions of the TX logic to the first four events are already described in Section 4.A. The flow control protocol will be introduced in Section 4.D. The RX logic is responsible for monitoring such events and generating event flags. Once an event is detected, the RX logic turns on the corresponding flag to notify the TX logic to make a proper reaction.

There is no state in the RX logic. All special events are monitored independently and concurrently. The priority order of these events is presented in Fig. 4. To prevent a frame that carries errors from being recognized as a control frame, eight consecutive pause requests or re-transmit requests need to be received by the RX logic to activate the pause or re-transmit control flag. The out-of-sync flag is turned on whenever an illegal syncword is received. The frame error flag is turned on whenever a data frame with an incorrect verification code is received.

C. Re-transmission

When both directions of the link are synchronized, the TX logic will switch between normal, retrans, and send retrans states. The re-transmission falls into three scenarios.

1. No Error for Either Direction

When there is no error for either direction of the link, both ends stay in the normal state. In this scenario, the SYN of every frame is always set to 2’b01 to represent a data frame. The meta code and the payload are generated based on different scenarios of user input. Every time a new meta code and payload are generated, the 12-bit CRC checksum will be calculated. The verification code is then yielded by performing exclusive-or between the frame ID and the checksum. After the TX logic sends out a data frame, the frame ID will increment by one. Each data frame sent out will also be copied to the re-transmission buffer. The re-transmission buffer is essentially a shift register: when a new entry is written, the oldest entry will be removed. Because ${S_{\rm{retrans}}}$ is set to equal $2_{\rm{FrameID}}^{S}$, each entry in the re-transmission buffer holds a frame with a unique frame ID. When a new frame is written into the buffer, the old frame to be removed has the same frame ID with the new frame.

2. Errors Are Detected in One of the Directions

When errors are detected in only one of the directions, the endpoint where the errors are detected enters the send retrans state, and the other end enters the retrans state. In the endpoint in the send retrans state, a frame error flag is raised, and its TX logic will send out back-to-back re-transmit request frames. In the endpoint in the retrans state, a re-transmit request flag will be raised after the most recent received eight control frames are all re-transmit requests. The TX logic will then perform the re-transmission procedure. Throughout the re-transmission procedure, the TX logic will send $2.5*2_{\rm{FrameID}}^{S}$ frames. The first $2*2_{\rm{FrameID}}^{S}$ frames are interleaved with idle frames and re-transmitted data frames. The last $0.5*2_{\rm{FrameID}}^{S}$ frames are idle frames. After the last frame of the re-transmission procedure is sent, if the re-transmit request flag is still raised, the TX logic will perform the re-transmission procedure all over again, until the re-transmit request flag is down.

3. Errors Are Detected in Both Directions

When errors are detected in both direction, both endpoints will enter the retrans state and start the re-transmission procedure. Different from the situation where only one direction detects the errors, for this scenario, the first $2*2_{\rm{FrameID}}^{S}$ frames will be re-transmitted data frames interleaved with idle frames or re-transmit requests frames. The last $0.5*2_{\rm{FrameID}}^{S}$ frames can also be either idle frames or re-transmit requests frames. Whether to send the re-transmit requests frames depends on whether the frame error flag is down or not.

By interleaving the idle/re-transmit request frames with the re-transmitted data frames in the first $2*2_{\rm{FrameID}}^{S}$ frames, even if there are errors in both direction, both endpoints can perform re-transmission while sending re-transmission notifications at the same time. In addition, when one of the endpoints stops sending the re-transmit request frames, it will take half of the RTT for the last re-transmit request frame to arrive at the other end, and only if the other end stops receiving the re-transmit request frames, it can put down the re-transmit request flag. To cover this delay, the last $0.5*2_{\rm{FrameID}}^{S}$ frames are designed to be the buffer frames.

Thus far, we have introduced the re-transmission procedure for the TX logic. On the RX side, there is also a procedure to verify whether a data frame should be delivered to the user and whether the frame error flag should be raised. Pseudo code of the verification procedure is shown in Listing 1.

Listing 1. RX Verification Procedure

View Table | View all tables in this article

As shown in Listing 1, the RX logic keeps its own frame ID counter (${\rm FrameID}$) and a threshold counter (${\rm Threshold_{\rm{FrameID}}}$). ${\rm FrameID}$ is initialized as zero, and ${\rm Threshold_{\rm{FrameID}}}$ is initialized as 16. When a frame is received, the RX side will first calculate the CRC checksum of the frame. The exclusive-or result of the checksum and ${\rm FrameID}$ will then be compared against the verification code in the frame. If the compared result is not equal, it implies the frame has errors and the verification failed. The frame error flag will be raised, and the frame will not be delivered to the user. If the compared result is equal, meaning the verification passed, the RX logic will then examine the syncword. If the syncword is 2’b10, meaning the frame is a control frame, the RX verification logic will not do anything. If the frame is a data frame that carries a syncword of 2’b01, ${\rm FrameID}$ will then be compared against ${\rm Threshold_{\rm{FrameID}}}$, and only if ${\rm FrameID}$ is equal to ${\rm Threshold_{\rm{FrameID}}}$ can the data frame be delivered to the user, and both ${\rm FrameID}$ and ${\rm Threshold_{\rm{FrameID}}}$ will then increment by one. If ${\rm FrameID}$ is not equal to ${\rm Threshold_{\rm{FrameID}}}$, then only ${\rm FrameID}$ will increment by one, and the frame will not be delivered to the user. In the case that the verification fails, ${\rm FrameID}$ will be rolled back to ${\rm Threshold_{\rm{FrameID}}}$ minus 16.

The RX verification procedure is designed to deal with a special sequence of errors that can cause a false-positive verification result without the verification procedure. Here is an example of the special sequence of errors: assume frame 68 has an error. A re-transmission is requested. Meanwhile, the subsequent frames, such as frame 69 and frame 70, are already on the fly. Because the 12-bit verification code is the exclusive-or result of the frame ID and the CRC checksum, if either frame 69 or frame 70 has an error, it can be misrecognized as a correct frame 68—there is only one bit difference between the binary representations of 69 and 70 from 68. Also, because the TX logic will start re-transmission whenever the re-transmit request flag is raised, the re-transmission will not start exactly from frame 68. Instead, it will start from a frame sent before frame 68. If some of the re-transmitted frames before frame 68 carry errors, they may also look like frame 68 for the same reason. Thus, when an error is detected in frame 68, the frame ID will be rolled back to 52. We require the RX logic to see a correct sequence from frame 52 to frame 67 before accepting frame 68. This means the RX logic must see a correct sequence of 16 12-bit verification codes. In this way, even a frame with white noise (${\rm BER} = {0.5}$) has a chance of only one over ${{(2^{12}})^{16}}$ to be misrecognized as frame 68. For BER better than ${10^{- 7}}$, the probability of a false-positive is even more negligible.

D. Flow Control

As we discussed in Section 1, congestion control should be done at the network layer. However, besides congestion control, flow control is still necessary—the receiver may not be able to receive data all the time, and a method for the receiver to notify the sender to stop transmitting data is needed. To provide flow control, a buffer is added between the RX logic and the user interface. A simple on–off flow control mechanism is adopted for low complexity. When the buffer queue length exceeds the on threshold (${\rm Thr_{\rm{ON}}}$), the TX logic of the receiver will send out a flow control pause frame. (The flow control pause frame is different from the pause request control frame.) When the buffer queue length drops below the off threshold (${\rm Thr_{\rm{OFF}}}$), the TX logic of the receiver will send out a flow control resume frame. The sender completely stops transmitting any data after receiving the flow control pause frame, and it resumes transmitting at the line rate after receiving the flow control resume frame.

The size of the flow control buffer (${S_{\rm{FC}}}$) must be carefully chosen to prevent any buffer overflow or starving during a flow control process—buffer overflow will cause frame losses, and buffer starving will cause bandwidth under-utilization. Because it takes half of the RTT for a flow control notification frame to arrive from the receiver to the sender, during this period, the flow control buffer must reserve enough space to receive the frames sent from the sender at the line rate, and hence,

(20)$$S_{\rm FC}-{\rm Thr}_{\rm ON} \geqslant \lambda_{\rm line} * \frac{\rm RTT}{2}.$$

During this period, the buffer must also be able to deliver frames to the user at the line rate, and then we get

(21)$${\rm Thr}_{\rm OFF} \geqslant \lambda_{\rm line} * \frac{\rm RTT}{2} .$$

Last, ${\rm Thr_{\rm{ON}}}$ and ${\rm Thr_{\rm{OFF}}}$ must not be too close. Otherwise, frequently switching between on and off will cause the flow control notification frames to occupy too much bandwidth of the reverse channel. For convenience, we set

(22)$${\rm Thr}_{\rm ON}-{\rm Thr}_{\rm OFF} \geqslant \lambda_{\rm {line }} * \frac{\rm RTT}{2}.$$

Combining Eqs. (20)–(22), we get

(23)$$S_{\rm FC} \geqslant \frac{3}{2} * \lambda_{\rm {line }} * {\rm RTT},$$

and we can set

(24)$${\rm Thr_{\rm{ON}}} = \frac{2}{3}*{S_{\rm{FC}}},$$

(25)$${\rm Thr_{\rm{OFF}}} = \frac{1}{3}*{S_{\rm{FC}}}.$$

After defining the flow control mechanism and the flow control buffer size, there is one remaining issue for flow control: bit error. Every frame, including flow control notification frames, can end up being corrupted during transmitting. If there is a bit error in the flow control pause frame, it can result in a buffer overflow and a frame loss. If there is a bit error in the flow control resume frame, the link may stop transmitting data forever. In our case, we extended the meta code encoding scheme and defined flow control notification frames as special data frames. Previously, when the meta code was 2’b00, it indicated the frame was an invalid data frame. Now, three types of frames share meta code 2’b00. Only if the last byte of the payload is 2’h00 it does it represent an invalid data frame. Otherwise, 2’h01 represents a flow control pause frame, and 2’h02 represents a flow control resume frame.

By defining the flow control notification frames as special data frames, the flow control notifications are guaranteed delivered to the sender. Even when there are bit errors, the flow control notifications will only be delayed, but not missing. During the delay time, regular data transmissions at both sides of the link will be completely stopped because of re-transmission. Hence, there will be no data loss because of the flow control pause notifications not taking effect on time.

E. Clock Compensation

Although both sides of the link should operate at the same nominal line rate, the actual frequencies of their clocks will not be exactly the same because of the crystal oscillator frequency deviation. The endpoint with the faster clock will send data slightly faster than the slower end can receive. This will eventually overflow the slower end’s receive buffer. With flow control, the issue can be resolved. However, it comes with a price of higher latency. Because flow control relies on the buffer queue length to slowly increase to ${\rm Thr_{\rm{ON}}}$ for a pause, the frames at the end of the queue will experience large latency. It will be ideal if the TX logic at the faster endpoint can proactively regulate its rate. Because clocks are embedded into the data streams for serial transmission between transceivers, and RIFL directly interfaces with the transceivers, we are able to compare the frequency of the recovered clock with the frequency of the local clock to determine whether and when the TX logic should pause for one cycle for clock compensation. Details of the clock compensation implementation will be introduced in Section 5.

F. Channel Bonding

So far, we have introduced the single-lane protocol of RIFL. It works when both ends of the link use only a single transceiver for transmission. Nevertheless, although transceiver technology evolves rapidly, transceivers that support above 100 Gbps line rate are still rare to see. To achieve a bandwidth of hundreds of gigabytes per second, channel bonding has to be done to aggregate the bandwidths of multiple transceivers. In RIFL, when multiple transceivers are used, every single pair of transceivers runs the single-lane protocol. The channel bonding logic is responsible for dispatching user data to each lane and aggregating the received data from each lane. To simplify the logic, we divide user data into segments, and the size of each segment is ${S_{\rm{Payload}}}$. At the TX side, the first segment goes to lane 1, the second goes to lane 2, and so on. The same applies to the RX side: the frame received from lane 1 is delivered first, followed by the frame received from lane 2, and so on. Because of the lane skew, lane 1 is not guaranteed to be the first lane to receive a frame. The flow control buffer at each lane is used to overcome the lane skew. Details of the channel bonding implementation will be introduced in Section 5.

G. Summary

In this section, we have defined the RIFL protocols. We first introduced how TX and RX logic operates in general. We then added more details of re-transmission, flow control, and clock compensation. Finally, we presented the channel bonding protocol.

More details on the implementation of the protocols are presented in Section 5.

5. IMPLEMENTATION

In this section, we present the FPGA implementation of RIFL that is open sourced at [23]. RIFL is fully parameterized. Implementation options such as frame size and transceiver line rate are exposed as synthesis parameters. For convenience, in this section, we demonstrate a four-lane implementation. In the implementation, each lane runs at 28 Gbps, and the frame size is set to 256 bits.

A. Top-Level Architecture

Figure 5 shows the top-level architecture. RIFL provides a pair of AXI4-stream [24] interfaces to the user. Both interfaces consist of TDATA, TVALID, TKEEP, TLAST, and TREADY fields. With these fields, each flit (data transmitted in a single clock cycle) of the user data stream carries all the essential information we discussed in Section 3.A. Adjacent to the user interfaces is the AXI4-stream data width conversion block. It converts the stream width from any power of two to a multiple of ${S_{\rm{Payload}}}$.

Fig. 5. RIFL top-level architecture.

Download Full Size | PDF

When more than one transceiver is used, the AXI4-stream data width converter will then be connected to the channel bonding module. In the TX path, the channel bonding module splits a single data stream into multiple data streams. In the RX path, it does the inverse. To provide more flexibility, two different channel bonding methods can be used in the channel bonding module: temporal channel bonding and spacial channel bonding. Temporal channel bonding splits a single data stream that runs at a higher frequency into multiple data streams that run at a lower frequency. After being split, the data width remains unchanged. Spacial channel bonding splits a single wider data stream into multiple narrower data streams, and it does not change the frequency. In the example shown in Fig. 5, both methods are used: the 512-bit AXI4-stream is first converted into a 480-bit AXI4-stream. Then, inside the channel bonding module, it is split into two 480-bit AXI4-streams running at half the original frequency. Finally, each of the 480-bit AXI4-streams is split into two 240-bit streams. With two channel bonding methods, more user interface data width options are provided. For a four lane implementation with a frame size of 256 bits, the data width can be 256 bits, 512 bits, or 1024 bits. When implementing RIFL on a low speed device such as a low end FPGA, wider interfaces with lower frequency can help timing closure, while on a high speed device, narrower interfaces are ideal for smaller circuit areas.

Fig. 6. RIFL single-lane architecture.

Download Full Size | PDF

If there is only a single lane, the channel bonding module will be omitted. The AXI4-stream data width converter will directly connect the single-lane logic. Details of the single-lane architecture will be presented in the next subsection.

B. Single-Lane Architecture

Figure 6 shows the single-lane architecture of RIFL. As shown in the figure, there are two clock domains: the RX domain is driven by the recovered clock generated by the transceiver, and the TX domain is driven by two local clocks—a high speed clock drives transceiver-facing logic and a low speed clock drives the rest of the protocol logic. The high and low speed clocks are derived from the same clock source. The frequency of the faster one is a power of two times the frequency of the slower one. Hence, the two TX clocks are synchronous to each other. In the example, the high speed clock runs at 437.5 MHz, and the lower speed clock runs at 109.4 MHz.

In the RX domain, the lane aligner converts the unaligned transceiver output stream to an aligned stream by locating the position of the syncword. The lane aligner is essentially a two-level cascaded multiplexer chain. After the lane aligner, the verification code validator is used to verify the correctness of the verification code. It is responsible for raising the frame error flag. The scrambler and the descrambler used in RIFL are implemented in linear-feedback shift registers (LFSRs). The standard 33-bit scrambler code $(1 + {x^{13}} + {x^{33}}$) is adopted for good DC balance and transition density [25]. After descrambling, the clock domain crossing (CDC) module filters out the non-data frames by checking the syncword. It then converts the filtered stream from the RX domain to the TX domain using a low latency asynchronous first-in, first-out (FIFO) buffer. The control event monitor and the flow control monitor are responsible for checking every frame and generating the pause request flag, the re-transmit flag, and the flow control on–off notifications.

In the TX domain, the modules that are closer to the transceiver are driven by the high speed clock. They are the scrambler and verification code generator. A pair of multi-gigabit transceiver data width converters is used to perform the conversion between the high speed narrow stream used by the transceiver and the low speed wide stream used by the rest of the protocol logic. The modules driven by the low speed clock are the TX controller, meta code encoding and decoding modules, and flow control buffer. The finite-state machine (FSM) in the TX controller implements the TX logic described in Section 5.A. The meta code encoding and decoding modules convert the AXI4-stream signals to meta code signals. The flow control buffer is a synchronous FIFO. It monitors its buffer queue length and issues flow control requests to the TX controller.

Finally, the clock compensation module takes the TX clock and the RX clock from the transceiver and a free-running clock as inputs. Each transceiver clock drives a gray code counter. Both counters are then brought to the free-running clock domain for comparison. If the counter of the TX clock increases faster than the counter of the RX clock, the difference of the counter values will be kept in a register. Whenever the difference increases, the clock compensation module will issue ${N}$ cycles of pause signals to the TX controller. ${N}$ is equal to the change of the difference between comparisons.

6. PERFORMANCE EVALUATION

We have validated the functional correctness of RIFL on both Intel and Xilinx devices for line rates from 25 Gbps to 200 Gbps. In this section, we present the performance results of RIFL that we obtained from Xilinx devices. We will first introduce our test setup. Then, we will compare the bandwidth efficiency, latency, and resource usage between RIFL and Xilinx’s Aurora [9], Interlaken [26], and 100G Ethernet (CMAC) [27] implementations. We will then provide RIFL’s performance result under various BERs to demonstrate RIFL’s reliability.

A. Experimental Setup

Our prototype is implemented on Fidus Sidewinder-100 (SW100) [28] boards. There are two QSFP28 ports on the board, connected to an XCZU19EG FPGA; 10 m active optical cable (AOC) and 3 m direct attach copper (DAC) cables are used for the QSFP28 connections. For the sake of simplicity, we present only the results for the AOC in this section.

Fig. 7. Performance test setups.

Download Full Size | PDF

A software-defined AXI4-stream traffic generator is built to generate the testing traffic. This traffic generator allows AXI4-stream traffic to be defined cycle by cycle in comma-separated values (CSV) format. The CSV file is then encoded into binary format and moved from an X86/ARM host to FPGA memory. The hardware driver of the traffic generator retrieves traffic data from the FPGA memory, performs decoding, and generates the traffic in a cycle-accurate manner according to the CSV definition.

A traffic validator is also built. It can cache the transmitted packets and compare them against loopback traffic to verify the correctness. It also internally time stamps each packet to monitor the bandwidth and latency.

Two different tests are designed for the performance comparison and reliability test. The setup shown in Fig. 7(a) is used for performance comparison between the RIFL implementations and Xilinx cores. The designs under test (DUTs) are placed in two FPGA boards to represent their general use case. The bandwidth efficiency and RTT are measured in the first board. The point-to-point latency is yielded by halving the RTT—assuming the latencies for both directions are the same. For fair comparison, all DUTs use four Xilinx GTY transceivers. The line rate of each transceiver is set to 25.78 Gbps.

The reliability test setup is shown in Fig. 7(b). In this test, the same BER is imposed in both directions. To make the error patterns of the two directions independent, their random seeds are set different. In this case, the point-to-point latency cannot be considered as half of the RTT anymore because the link is not symmetric. For example, in a round trip, errors may happen in one of the directions, causing the latencies of both directions to be unequal. Therefore, the point-to-point latency has to be directly measured. As a result, both RIFL cores are placed in the same FPGA. Traffic generators and traffic validators are connected to both RIFL cores. The bandwidth efficiency and average latency are computed by averaging the test results of both directions. The tail latencies are computed from the aggregated results of both directions. In the reliability test, each RIFL core uses four GTYs [21] running at 28 Gbps. The aggregated line rate is 112 Gbps, which is the maximum line rate a QSFP28 cable can support.

B. RIFL versus Aurora versus Interlaken versus CMAC

In this subsection, we compare the bandwidth efficiency, latency, and resource usage performance among RIFL, Aurora, Interlaken, and CMAC.

For bandwidth efficiency comparison, we test the bandwidth efficiency results for different user payload sizes. The payload sizes sweep from 1 byte to 8192 bytes, with a step of 1 byte. (CMAC starts at 64 bytes because its minimal accepted payload size is 64 bytes.) When the size of a payload is larger than the maximum frame size of the DUT (32 bytes for RIFL256, 64 bytes for RIFL512, Interlaken, and Aurora, 9600 bytes for CMAC), it is divided into multiple frames for transmission. For each payload size, traffic of 10 GB is sent. The traffic generator saturates the available bandwidth of the DUT by sending out a flit of traffic whenever the DUT can accept one.

Figures 8(a)–8(c) show the bandwidth efficiency comparison among RIFL, Aurora, Interlaken, and CMAC. In the figures, RIFL256 represents the RIFL implementation with a frame size of 256 bits, and RIFL 512 represents RIFL with a frame size of 512 bits. To preserve more details for small payload sizes, the results for payload sizes that are larger than 1500 bytes are not included in the figures. As the figures show, in terms of bandwidth efficiency, from the best to the worst, it is CMAC, RIFL512, RIFL256, and Interlaken. Unlike the zigzag curves of the other three cores, CMAC shows a much smoother curve. It is because for RIFL, Aurora, and Interlaken, if the payload size is not a multiple of the user interface data width, for the last flit of the packet, only a fraction of the user interface will receive valid data. After receiving the partial valid flit, the entire flit is fed into the pipeline, and the invalid bits are replaced with bubbles. Meanwhile, for CMAC, the data received from the user interface are first buffered and then reconstructed. The last flit of packet ${N}$ can be concatenated with the first flit of packet ${N} + {1}$ to eliminate the pipeline bubbles as much as possible. While buffering and reconstructing benefit the bandwidth efficiency, they come with a trade-off of the latency and complexity.

Fig. 8. Performance comparison among RIFL, Aurora, Interlaken, and CMAC.

Download Full Size | PDF

For latency comparison, the same traffic patterns are used. The same as the bandwidth comparison, the traffic generator saturates the available bandwidth of the DUT.

Figure 8(d) shows the point-to-point latency comparison result. From the best to the worst, it is RIFL256, RIFL512, Aurora, CMAC, and Interlaken. For CMAC, as previously mentioned, by buffering and reconstructing the user packets, the latency is increased. The latency for small packets varies significantly more than the large ones. For Aurora and Interlaken, without knowing their implementation details, we cannot infer what forms their latency. However, we are confident that it is our microarchitecture optimizations mentioned in previous sections that make RIFLs the lowest latency implementations.

Table 4 shows the resource usage comparison among three different implementations of RIFL and Aurora. In Table 4, ${\rm RIFL}({X},{Y})$ represents RIFL with a frame size of ${X}$ bits and a user interface width of ${Y}$ bits. Interlaken and Aurora are not included in the resource usage comparison because they are both hard cores, i.e., they are not implemented in FPGA soft logic.

Table 4. Resource Comparison

View Table | View all tables in this article

It can be learned from the table that RIFL uses more resources than Aurora. One of the main reasons is that RIFL adds the re-transmission buffer and flow control buffer for reliability. Another reason is that our FPGA prototype is not fully optimized for resource usage. For example, the data widths of block random access memory (BRAM) in the Sidewinder board are at most 64 bits, while the buffer data width in RIFL is equal to its frame size, being at least 256 bits. Although the capacity of a single BRAM is enough for the flow control buffer, we have to use multiple BRAMs for enough data width. Both reasons are related to the FPGA itself. If RIFL is hardened, the resource usage can be significantly reduced.

C. Reliability Test

In this subsection, we present the bandwidth ratio, latency, and MTBF result of RIFL256 under different BERs. The bandwidth ratio is the ratio of the bandwidth under the current BER to the bandwidth of an error-free link.

In the test, the size of the traffic is set to 10 GB. The traffic consists of mixed length packets. Payload sizes are randomly distributed from 1 byte to 8192 bytes. The BERs sweep from ${10^{- 12}}$ to ${10^{- 5}}$, with a step of ${10^{0.25}}$.

Fig. 9. Bandwidth and latency under different BERs.

Download Full Size | PDF

As shown in Figs. 9(a) and 9(b), the bandwidth and latency of RIFL do not degrade until the BER increases beyond about ${10^{- 7}}$. The bandwidth ratio starts to drop when the BER is $5.6 \times {10^{- 10}}$, and it drops to 96.3% when the BER is ${10^{- 7}}$. The results agree with the theoretical calculation result of Eq. (11).

The latency of RIFL starts to increase when the BER is worse than $1.7 \times {10^{- 6}}$. When the BER is better than ${10^{- 7}}$, the average latency and tail latencies remain within 107 ns. This also agrees with the theoretical calculation.

As we discussed in Section 4.C, during a re-transmisson, even a frame of white noise is impossible to be mis-detected as a correct frame. Therefore, for RIFL, Eq. (14) should be modified as

(26)$${(1 - {\rm FFR})^{\frac{{{\lambda _{\rm{actual}}} \times {\rm MTBF}}}{{{S_{\rm{DFrame}}}}}}} = 99\% ,$$

where ${\lambda _{\rm{actual}}}$ denotes the actual bandwidth. With the bandwidth result, MTBF can be calculated.

As shown in Table 5, when the BER is ${10^{- 7}}$, the MTBF is $1.88 \times {10^7}$ years. Therefore, it is safe to claim that RIFL is reliable for any BER better than ${10^{- 7}}$.

Table 5. MTBF versus BER

View Table | View all tables in this article

D. Cross-Vender Communication

We have successfully validated RIFL on a link between an Intel Agilex device and an Xilinx Vertex Ultrascale+ device.

E. Summary

In this section, we compare the latency and bandwidth efficiency result between two implementations of RIFL and three other link layer protocol implementations. We show that RIFL has the best latency and second best bandwidth efficiency, while it is the only protocol that ensures lossless transmission. We also show RIFL can keep good performance and long MTBF when the BER is better than ${10^{- 7}}$.

7. RELATED WORK

In this section, we describe the works that are most relevant to RIFL.

Ethernet [13] was introduced in the 1980s, and it is the most common protocol used in modern data centers [12]. In the three-layer model we introduced in Section 1, Ethernet includes not only layer 1 functionalities, but also some layer 2 functionalities, such as switching. Ethernet (layer 1) allows variable frame sizes from 72 bytes to 1530 bytes (some implementations allow jumbo frames larger than 9000 bytes, but it is not compatible with the IEEE 802.3 standard). A 32-bit CRC is included in every Ethernet frame, enabling error detection but not error correction. Any re-transmission protocol working on top of Ethernet has to be end-to-end, which means constraint C is not met any more. Moreover, the re-transmission buffer has to be large enough to handle a burst of the maximum-size frames. To summarize, a re-transmission protocol working on top of Ethernet would be more complex and less efficient than RIFL. Also, the experimental results in Section 6 show that RIFL performs better than CMAC, which is the Xilinx 100G Ethernet implementation [27].

Aurora [9] is a link layer protocol developed by Xilinx. It is made for point-to-point communication between FPGAs. There are two versions of Aurora, using two different line codes: 8b/10b for lower line rates and 64b/66b for higher line rates. The user payload is broken into multiple 8-byte frames called data blocks. The remaining bytes are transmitted using a special frame called the separator block. The separator block serves as an indicator of the end of a packet. A 32-bit CRC code is used in Aurora for error detection. Flow control directives are also provided.

Interlaken [8] was invented by Cisco Systems and Cortina Systems. It uses 64b/67b encoding for better DC balance. There are two methods of packetization for Interlaken: BurstMax and BurstShort. The user payload is first broken into multiple 64-byte blocks and then transmitted using the BurstMax method. The remaining bytes are transmitted using BurstShort. The size of BurstShort can be from 32 bytes to 56 bytes, with 8-byte increments. Both BurstMax and BurstShort are ended with an 8-byte block named the control word. A 24-bit CRC code is integrated into the control word. Interlaken also provides in-band and out-of-band flow control, as well as out-of-band re-transmission.

Correa et al. [10] created a protocol stack for FPGA-based high performance computing. Their layer 1 is based on the 10 Gb MII, limiting the throughput per lane to 10 Gbps. Their work is based on the assumption that link channels are error free, and hence reliability is not taken care of at all.

None of the related works described here can provide or implement the low latency, high bandwidth, and especially reliable protocol that we require for our layer 1 link layer protocol.

8. CONCLUSION

We have presented RIFL, a low latency and reliable link layer network protocol. Because of its novel in-band re-transmission protocol, RIFL is capable of providing lossless point-to-point links with ultra-low latency and high bandwidth. We implemented RIFL on Sidewinder boards and showed that, at the line rate of 112 Gbps, approximately 100 ns point-to-point latency is achieved. We have also demonstrated that RIFL is capable of correcting all data corruptions for standard point-to-point links.

With RIFL at the bottom layer, there is no need for the upper layer protocols to deal with any checksum. Therefore, the logic of the upper layer protocols can be simplified, and more resources can be used to deal with congestion control. This suggests that it is feasible to build a low latency, high bandwidth network for a data center environment based on RIFL. Our future work will address the network layer to enable congestion-free multi-hop communication.

Funding

Xilinx; Alibaba; Natural Sciences and Engineering Research Council of Canada.

Acknowledgment

This work was generously supported by Xilinx, Alibaba, and NSERC.

Disclosures

The authors declare no conflicts of interest.

REFERENCES

1. D. Sidler, Z. Wang, M. Chiosa, A. Kulkarni, and G. Alonso, “StRoM: smart remote memory,” in Proceedings of the Fifteenth European Conference on Computer Systems (2020), article 29.

2. Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron, J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang, “Congestion control for large-scale RDMA deployments,” ACM SIGCOMM Comput. Commun. Rev.45, 523–536 (2015). [CrossRef]

3. R. Kahn and V. Cerf, “A protocol for packet network intercommunication,” IEEE Trans. Commun.22, 637–648 (1974). [CrossRef]

4. A. Dragojević, D. Narayanan, M. Castro, and O. Hodson, “FaRM: fast remote memory,” in 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2014), pp. 401–414.

5. M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan, “Data center TCP (DCTCP),” SIGCOMM Comput. Commun. Rev.40, 63–74 (2010). [CrossRef]

6. A. Langley, A. Riddoch, A. Wilk, A. Vicente, C. Krasic, D. Zhang, F. Yang, F. Kouranov, I. Swett, and J. Iyengar, “The QUIC transport protocol: design and internet-scale deployment,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication (2017), pp. 183–196.

7. “ADM-PCIE-9H7,” 2021, https://www.alpha-data.com/alpha-data-announces-12x100g-network-accelerator-board-featuring-samtec-twinax-flyover-systems-and-xilinx-ultrascale-fpga.

8. “Interlaken protocol definition,” 2008, http://interlakenalliance.com/wp-content/uploads/2019/12/Interlaken_Protocol_Definition_v1.2.pdf.

9. “Aurora 64B/66B protocol specification,” 2014, https://www.xilinx.com/support/documentation/ip_documentation/aurora_64b66b_protocol_spec_sp011.pdf.

10. R. S. Correa and J. P. David, “Ultra-low latency communication channels for FPGA-based HPC cluster,” Integration63, 41–55 (2018). [CrossRef]

11. “InfiniBand architecture specification,” 2020, https://www.infinibandta.org.

12. “Data center interconnect system share,” 2021, https://www.top500.org/statistics/list/.

13. “IEEE standard for Ethernet,” IEEE Std 802.3-2018 (Revision of IEEE Std 802.3-2015) (2018), pp. 3492–4199.

14. M. Ruiz, D. Sidler, G. Sutter, G. Alonso, and S. López-Buedo, “Limago: an FPGA-based open-source 100 GBE TCP/IP stack,” in International Conference on Field Programmable Logic and Applications (FPL) (IEEE, 2019), pp. 286–292.

15. D. Sidler, G. Alonso, M. Blott, K. Karras, K. Vissers, and R. Carley, “Scalable 10Gbps TCP/IP stack architecture for reconfigurable hardware,” in IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (IEEE, 2015), pp. 36–43.

16. Z. P. Wu, Y. Krish, and R. Pellizzoni, “Worst case analysis of dram latency in multi-requestor systems,” in IEEE 34th Real-Time Systems Symposium (IEEE, 2013), pp. 372–383.

17. R. W. Hamming, “Error detecting and error correcting codes,” Bell Syst. Tech. J.29, 147–160 (1950). [CrossRef]

18. P. Koopman and T. Chakravarty, “Cyclic redundancy code (CRC) polynomial selection for embedded networks,” in International Conference on Dependable Systems and Networks (IEEE, 2004), pp. 145–154.

19. P. Koopman, “CRC polynomial zoo,” 2019, https://users.ece.cmu.edu/~koopman/crc/hd3.html.

20. Xilinx, “UltraScale architecture GTH transceivers,” 2021, https://www.xilinx.com/support/documentation/user_guides/ug576-ultrascale-gth-transceivers.pdf.

21. Xilinx, “UltraScale architecture GTY transceivers,” 2017, https://www.xilinx.com/support/documentation/user_guides/ug578-ultrascale-gty-transceivers.pdf.

22. Xilinx, “Virtex UltraScale+ FPGAs GTM transceivers,” 2020, https://www.xilinx.com/support/documentation/user_guides/ug581-ultrascale-gtm-transceivers.pdf.

23. Q. Shen, “RIFL,” 2021, https://github.com/swift-link/RIFL.

24. ARM, “AMBA AXI and ACE protocol specification AXI3, AXI4, and AXI4-Lite ACE and ACE-Lite,” 2011, https://developer.arm.com/documentation/ihi0022/d/.

25. S. Pandey, “Scrambler options for multi-Gig PHYs,” 2018, http://grouper.ieee.org/groups/802/3/ch/public/nov18/Pandey_3ch_01_1118.pdf.

26. Xilinx, “Interlaken 150G v1.6 LogiCORE IP product guide,” 2017, https://www.xilinx.com/support/documentation/ip_documentation/interlaken_150g/v1_6/pg212-interlaken-150g.pdf.

27. Xilinx, “UltraScale+ devices integrated 100G Ethernet subsystem v2.4,” 2018, https://www.xilinx.com/support/documentation/ip_documentation/cmac_usplus/v2_4/pg203-cmac-usplus.pdf.

28. Fidus, “Sidewinder-100 datasheet,” 2018, https://fidus.com/wp-content/uploads/2019/01/Sidewinder_Data_Sheet.pdf.

Protocol	LUTs	Flip Flops	BRAM36Ks
RIFL (256,256)	15,308	15,935	16
RIFL (256,1024)	20,048	14,098	16
RIFL (512,512)	28,995	28,960	32
Aurora	10,192	9447	4

BER	MTBF (Year)
1.00 E-11	$1.81 E + 23$
1.00 E-09	$1.81 E + 15$
1.00 E-07	$1.88 E + 8$
1.00 E-05	6.33

Protocol	LUTs	Flip Flops	BRAM36Ks
RIFL (256,256)	15,308	15,935	16
RIFL (256,1024)	20,048	14,098	16
RIFL (512,512)	28,995	28,960	32
Aurora	10,192	9447	4

BER	MTBF (Year)
1.00 E-11	$1.81 E + 23$
1.00 E-09	$1.81 E + 15$
1.00 E-07	$1.88 E + 8$
1.00 E-05	6.33

$S_{D F r a m e}$	$S_{F r a m e I D}$	$S_{c h e c k s u m}$	HD	MTBF (Year)
128	9	8	4	$9.7 * 10^{4}$
256	8	9	4	$2.4 * 10^{4}$
512	7	10	4	$5.9 * 10^{3}$
1024	6	11	4	$1.5 * 10^{3}$
2048	5	12	4	$3.6 * 10^{2}$

Meta Code	Payload Valid	EOP	ABV
00	No	No	No
01	Yes	No	Yes
10	Yes	Yes	Yes
11	Yes	Yes	No

$S_{D F r a m e}$	$S_{v e r i f i c a t i o n}$	${Eff}_{b a n d w i d t h}$
128	12	87.5%
256	12	93.25%
512	12	96.87%
1024	12	98.43%
2048	12	99.22%

Meta Code	Payload Valid	EOP	ABV
00	No	No	No
01	Yes	No	Yes
10	Yes	Yes	Yes
11	Yes	Yes	No

$S_{D F r a m e}$	$S_{v e r i f i c a t i o n}$	${Eff}_{b a n d w i d t h}$
128	12	87.5%
256	12	93.25%
512	12	96.87%
1024	12	98.43%
2048	12	99.22%

Abstract

1. INTRODUCTION

2. LAYER 1 - THE LINK LAYER

A. Forward Error Correction versus Re-transmission

B. Re-transmission Efficiency versus Round Trip Time

C. Leveraging Hop-by-Hop Link Layer Re-transmission

D. ACK versus NACK

E. Summary

3. DEFINING THE RIFL FRAMES

A. High-Level Exploration of the Data Frame Structure

1. Header Fields

2. Data Frame Size

B. Data Frame

1. Syncword

2. Payload

3. Meta Code

4. Format Code

5. Verification Code

C. Control Frame

1. Idle

2. Pause Request

3. Re-transmit Request

D. Summary

4. DEFINING THE RIFL PROTOCOL

A. TX Protocol

B. RX Protocol

C. Re-transmission

1. No Error for Either Direction

2. Errors Are Detected in One of the Directions

3. Errors Are Detected in Both Directions

D. Flow Control

E. Clock Compensation

F. Channel Bonding

G. Summary

5. IMPLEMENTATION

A. Top-Level Architecture

B. Single-Lane Architecture

6. PERFORMANCE EVALUATION

A. Experimental Setup

B. RIFL versus Aurora versus Interlaken versus CMAC

C. Reliability Test

D. Cross-Vender Communication

E. Summary

7. RELATED WORK

8. CONCLUSION

Funding

Acknowledgment

Disclosures

REFERENCES

Cited By

Figures (9)

Tables (6)

Equations (26)

Journal of Optical Communications and Networking