Silicon photonic neuromorphic accelerator using integrated coherent transmit-receive optical sub-assemblies

Ying Zhu; Ying Zhu; Ming Luo; Xin Hua; Lu Xu; Ming Lei; Min Liu; Jia Liu; Ye Liu; Qiansheng Wang; Chao Yang; Daigao Chen; Lei Wang; Xi Xiao; Xi Xiao; Xi Xiao

doi:10.1364/OPTICA.514341

1. INTRODUCTION

Neural networks (NNs) have achieved remarkable breakthroughs in image recognition, language processing, decision-making problems, autonomous driving, and more in recent years [1–4]. In NN algorithms, convolutions and matrix-vector multiplication (MVM) are the cornerstone and exhaust most computing sources [5,6]. The performance of these applications is eventually determined by the working efficiency of the computing hardware when they process convolutions and MVM for large amounts of data, including the speed and power efficiency. To improve hardware performance for artificial intelligence (AI) algorithms, many efforts are being made, from computing materials and devices (e.g., phase-change materials (PCM) [7], memristors [8], and ferroelectric field-effect transistors [9,10]), architectures (e.g., near-memory [11] and in-memory computing [12,13]) to computing mechanisms (e.g., TrueNorth [14] and BrainScaleS [15]) and software–hardware co-design [16–19]. Among them, optoelectronic and optical computing hardware are promising platforms.

Optical hardware has several advantages over its electronic counterparts, including low power consumption, high speed, a large bandwidth, and massive parallelism [20,21]. Optical devices can process linear transformations at the speed of light, and then detect data at a rate over 100 GHz with minimal power consumption [22–24]. Many inherent optical nonlinearities can directly process nonlinear operations in optical neural networks (ONNs) and offer diverse nonlinear transformations for algorithm design. Additionally, NNs implemented in optical hardware operate in the analog domain, reducing energy and time consumption during computing and moving data back and forth between the arithmetic logic unit (ALU) and storage, thereby avoiding the von Neumann bottleneck [25]. The optical domain has an abundant bandwidth source, allowing signals to be processed or calculated in multiple wavelengths simultaneously [26,27]. Moreover, the maturing silicon photonic process, which is complementary metal oxide semiconductor (CMOS) compatible, can provide the potential for large-scale integration of optical and electronic materials and devices, reaching an ultrahigh computation density [28,29].

So far, many optoelectronic and optical NNs have been proposed. According to the NN structures, frequently researched optical NNs include optical reservoir NNs and optical feedforward NNs. The reservoir NNs utilize the physical system dynamic behaviors and memory effects to process sequential and temporal data [30]. Many optical reservoir NNs have been developed [31–34]. Different from the reservoir NNs, the feedforward optical NNs mainly accelerate multiply–accumulate (MAC) operations when lights propagate through the optical systems. These systems generally fall into three categories according to their working principles [21]. The first category is the cascaded Mach–Zehnder interferometer (MZI) grid. By tuning the phase shifters, the MZI grid transform matrix can be programmed. An MZI-based ONN can complete arbitrary matrix-vector multiplications within 0.01 ns [35]. The second category of NNs utilizes wavelength division multiplexing (WDM) to realize MAC operations with incoherent lights. A photonic accelerator can achieve 11 tera-operations per second (TOPS) utilizing fiber chromatic dispersion to accomplish accumulation of optical intensities on different wavelengths [36]. A PCM-based integrated photonic tensor core can also achieve speeds of TOPS [37]. A silicon photonic-electronic NN can solve a 10080 km fiber nonlinearity compensation, comparable to a software-based NN with a 32-bit graphic processing unit (GPU) [38]. An optoelectronic neuromorphic accelerator with only one dual-polarization IQ modulator and one balance-photodetector (BPD) can also achieve over 500 GOPS and potentially several peta operations per second (POPS) [39]. Besides, there are several photonic neural networks based on spatial diffraction, which can achieve over several POPS [40,41]. However, in these photonic NNs, there are still many challenges in integration, scalability, flexibility, and reconfiguration. In addition, the existing photonic devices, which are designed mainly for communication and interconnections, do not meet the computing requirements (e.g., quantization resolution and linearity range), which impacts the computing performance of the optical hardware. Therefore, clear indications of design and optimization for computing devices and circuits are significant.

Here, we demonstrate what we believe, to the best of our knowledge, is a novel silicon photonics neuromorphic accelerator (SiPh-NA) based on optical frequency combs (OFCs), IQ modulators, and one-channel coherent receivers. The proposed accelerator architecture separates the to-be-calculated data into several groups. Data in every group are assigned to multiple orthogonal frequency electronic signals, modulating one optical frequency comb tooth, respectively. Then, the modulated signal groups are stitched with multiplexers (MUX) and detected by a one-channel coherent receiver to complete MAC operations. The mentioned devices are off-the-shelf in the commercial optical communication domain. Moreover, they are integrable and compatible with the CMOS technology by utilizing the silicon photonic technique. OFCs can increase the scalability of the accelerator. Adding data in the electronic domain, where the amplitude values and the orthogonal frequencies can be adjusted for optimal computing precision and speed, guarantees flexibility. The accelerator can be programmed to perform convolutions, MVM, and neural networks. We experimentally evaluate the capacity of the SiPh-NA with one carrier (i.e., one computing cell), based on an integrated coherent transmit–receive optical sub-assembly (IC-TROSA) for convolutions and the handwritten digits (0–9) recognition. The results demonstrate that this SiPh-NA can complete convolutions with a speed of 1.024TOPS/cell and several POPS with multiple OFC teeth in principle. Furthermore, it can achieve accuracies of 95.78%, 94.89%, and 96.67% by using it as a one-layer fully connected NN, a one-layer convolutional NN, and processing a convolution layer of a multilayer forward NN, respectively, which is comparable to the accuracies obtained from the same structure NNs running on a GPU. According to the experimental phenomenon, we also discuss the elements limiting the speed, precision, and power efficiency in the current SiPh-NA. Furthermore, to improve the performance, we propose possible solutions and optimization requirements for the accelerator and its silicon photonics computing devices.

2. PRINCIPLE OF THE ARCHITECTURE

The idea of the SiPh-NA is from the convolution theorem that convolution in one domain (e.g., frequency domain) equals point-wise multiplication in the other domain (e.g., time domain) [42]. The latter can be accomplished by a one-channel coherent receiver, which consists of a 180° hybrid and a BPD. For example, as shown in the computing cell of Fig. 1(c), when two optical signals $S = {A_1}\cos 2\pi ({f_c} + {f_A} + {f_0})t$ and $L = {B_1}\cos 2\pi ({f_c} + {f_B} + {f_0})t$ are transmitted into the coherent receiver signal and local paths, respectively, the value of the output current $I$ amplitude spectrum at $f = |{f_B} - {f_A}|$ is proportional to ${A_1}{B_1}$, where ${f_c}$ is the carrier frequency, ${f_A}$ and ${f_B}$ the signal start frequencies, and ${f_0}$ the frequency interval. Then, another two optical signals ${A_2}\cos 2\pi ({f_c} + {f_A} + 2{f_0})t$ and ${B_2}\cos 2\pi ({f_c} + {f_B} + 2{f_0})t$ are added to the signal and local paths, respectively, as shown in Fig. 1(c). In this scenario, at $f = |{f_B} - {f_A} - {f_0}|,\def\LDeqbreak{}|{f_B} - {f_A}|$, $|{f_B} - {f_A} + {f_0}|$ of the $I$ amplitude spectrum, the values are proportional to $|{A_2}{B_1}|,|{A_1}{B_1} + {A_2}{B_2}|,|{A_1}{B_2}|$, respectively. An example of the SiPh-NA computing cell to process $[1,1] \otimes [1,1]$ is presented, where the data represented by the amplitudes of waveforms with orthogonal frequencies modulate the light carrier via MZM-based IQ modulators and the MAC operation is completed by the hybrid and BPD, obtaining [1,2,1] at corresponding frequencies of the BPD current. Generally, when

(1)$$\begin{split}S &= \sum\limits_{i = 1}^N {A_i}\cos 2\pi ({f_c} + {f_A} + i{f_0})t,\\L & = \sum\limits_{j = 1}^M {B_j}\cos 2\pi ({f_c} + {f_B} + j{f_0})t,\end{split}$$

in the $I$ amplitude spectrum at ${f_k} = |{f_B} - {f_A} + k{f_0}|$, where $k = i - j$, the value is

(2)$$\mathfrak{A}({f_k}) = \alpha \left| {\sum\limits_{i = \max (1,1 - k)}^{\min (N,M - k)} {A_i}{B_{i + k}}} \right|,$$

where $\alpha$ is a proportional parameter, determined by the system status (e.g., the photodetector responsivity and driver voltages). Therefore, to realize the convolution between the vectors $\textbf{A} = [{A_1},{A_2}, \cdots ,{A_N}]$ and $\textbf{B} = [{B_1},{B_2}, \cdots ,{B_M}]$, they are encoded to the amplitudes of a series of waveforms with orthogonal frequencies ${f_A} + [{f_0},2{f_0}, \cdots ,N{f_0}]$ and ${f_B} + [{f_0},2{f_0}, \cdots ,M{f_0}]$ in the signal and local paths, respectively, and modulated to optical signals as in Eq. (1). The convolution results are the values of the $I$ amplitude spectrum at corresponding frequencies as in Eq. (2). After optical or optoelectronc filter banks and PDs, the amplitudes can be directly obtained by peak detectors [43–45]. Within the process, when the symbol rate is set to $R$, the computing speed is

Fig. 1. Conceptual diagram of the SiPh-NA computing structure. (a) Silicon photonic integration. (b) Computing array in the SiPh-NA. (c) Computing cell in the SiPh-NA. PS: phase shifter; BPD: balanced photodetector; MZM: Mach–Zehnder modulator; and MUX/DEMUX: multiplexer and demultiplexer.

Download Full Size | PDF

(3)$$P = 2R \cdot NM\;\;{\rm OPS}.$$

Noted, ${R_{{\max}}} = {f_0}$ guarantees that all orthogonal frequency signals in every symbol have integer periods. Since the amplitude extraction is processed with analog implementation, the computing speed could remain. More importantly, $\forall \; i,j,\def\LDeqbreak{}|{f_B} - {f_A} + k{f_0}|$ should be different, which can be easily satisfied by setting ${f_A}$, ${f_B}$, ${f_0}$ (e.g., ${f_B} \ge {f_A} + N{f_0}$). Otherwise, at $f = |{f_B} - {f_A} + {k_1}{f_0}| = |{f_B} - {f_A} + {k_2}{f_0}|$ in the amplitude spectrum, the value is $\alpha |\sum\nolimits_i {A_i}{B_{i + {k_1}}} + \sum\nolimits_i {A_i}{B_{i + {k_2}}}|$, which for convolutions is meaningless. According to Eq. (3), the larger $R$, $M$, and $N$ are, the larger the computing capacity can be. However, when ${f_0}$ is fixed (usually determined by the electrical circuit) and the maximum values of $N{f_0}$ and $M{f_0}$ are constrained by the bandwidth of the electrical circuit and the modulators, the maximum number of vectors $\textbf{A}$ and $\textbf{B}$ sizes are limited; thereby, the computing speed upper limit is determined.

To improve the convolution speed, adding more ${A_i}{\rm s}$ and ${B_j}{\rm s}$, introduces OFC, in which multiple equal distant frequency and phase coherent lights exist [46] to stitch multiple orthogonal frequency signal groups, expanding the computing cell to the computing array, as shown in Fig. 1(b). To guarantee the stitched signals meet the relationship above, the maximum frequency ${f_m}$ that the electronic circuit can offer should be no smaller than ${f_r}$, and the ${f_0}$ should be ${f_r}/K$, where ${f_r}$ is the OFC repetition frequency and $K$ is an integer. For example, ${f_m} = {f_r},{f_0} = {f_r}/2$, which means that for every OFC tooth only two numbers can be added. To convolve ${A^\prime} = [{A_1},{A_2},{A_3},{A_4}]$ and ${B^\prime} = [{B_1},{B_2},{B_3},{B_4}]$, they are first separated into two groups: $[{A_1},{A_2}],[{A_3},{A_4}]$ and $[{B_1},{B_2}],[{B_3},{B_4}]$, respectively. Then they encode the amplitude of two orthogonal frequency signals as ${S_1} = {A_1}\cos 2\pi {f_0}t \,+ \def\LDeqbreak{}{A_2}\cos 2\pi 2{f_0}t$, ${S_2} = {A_3}\cos 2\pi {f_0}t + {A_4}\cos 2\pi 2{f_0}t$ and ${L_1} = {B_1}\cos 2\pi {f_0}t + {B_2}\cos 2\pi 2{f_0}t$, ${L_2} = {B_3}\cos 2\pi {f_0}t + {B_4}\cos 2\pi 2{f_0}t$. Afterward, these waveforms are loaded to the OFC teeth with frequencies $f_c^\prime + (k_A^s + 1){f_r}$, $f_c^\prime + (k_A^s + 2){f_r}$ and $f_c^\prime + (k_B^s + 1){f_r}$, $f_c^\prime + (k_B^s + 2){f_r}$ through IQ modulators, respectively, where $f_c^\prime $ is the OFC central frequency, and $k_A^s$ and $k_B^s$ are the start index of OFC teeth for ${A^\prime}$ and ${B^\prime}$. With the MUX, ${S^\prime} = \sum\nolimits_{i = 1}^4 {A_i}\cos 2\pi (f_c^\prime + (k_A^s + 1){f_r} + i{f_0})t$ and ${L^\prime} =\sum\nolimits_{j = 1}^4 {B_j}\cos 2\pi (f_c^\prime + (k_B^s + 1){f_r} + j{f_0})t$ are received by the one-channel receiver. Eventually, at frequency ${f_k} = |(k_B^s - k_A^s){f_r} + k{f_0}|,k = i - j$, the convolution result between ${A^\prime}$ and ${B^\prime}$ can be obtained, the same as in Eq. (2). Figure 1(b) demonstrates an example of two eight-value vector convolutions where four combs are used; in each, two data of ${A^\prime}$ and ${B^\prime}$ are located in the local and signal paths, respectively. After the MUX, in the local and signal paths, eight data are combined in the frequency domain. Eventually, the BPD output consists of the 15-value convolution result at corresponding frequencies. Generally, the precise description is presented in Supplement 1, Section 1. Moreover, in the computing array, when $K$ OFC teeth are utilized and N data are appended in each modulator, the computing speed is

(4)$$P = 2R \cdot {(NK)^2}\;\;{\rm OPS}.$$

Therefore, without requiring higher-performance electronic modules, the accelerator is entitled to a much higher computing speed, increased by the square of the implemented OFC teeth number $K$.

3. EXPERIMENT AND RESULT

In the SiPh-NA, IQ modulators and coherent receivers are the core devices determining the power consumption and footprint. To reduce the costs, silicon photonic technology is introduced to manufacture the IC-TROSA, which consists of a silicon photonic integrated circuit (PIC), a driver chip, and two trans-impedance amplifier (TIA) chips [47]. The component utilizes a ball grid array (BGA) packaging form, allowing it to be integrated with microprocessors using microelectronic-compatible packaging techniques. The footprint of the IC-TROSA is $13\;{\rm mm} \times 12\;{\rm mm}$, presented in Fig. 2(a), providing the advantages of low cost and a small footprint. Inside the IC-TROSA, PIC was fabricated on an 8-inch silicon-on-insulator (SOI) wafer using a 180 nm lithography process. All device blocks were monolithically integrated into a footprint as small as $5.8 \times 6.9\;{{\rm mm}^2}$ [Fig. 2(b)]. In the chip, two input ports and one output port exist. Input port 1, ${{\rm in}_1}$ in Fig. 2(c), is for the source laser coupled into the integrated coherent transmitter (ICT) part or as the local light receive port to the integrated coherent receiver (ICR); the other is the signal light receive port ${{\rm in}_2}$ in Fig. 2(c). The output port ${{\rm out}_1}$ in Fig. 2(c) is for the modulated signals XI, XQ, YI, and YQ from the ICT. In the ICT part, the source laser is first equally distributed to two IQ Mach–Zehnder modulators (MZMs) with electro-optical conversion for data adding, in which the modulated optical signals are first equalized through variable optical attenuators (VOA). The VOA utilizes a PIN-doped structure fabricated on the optical waveguide. By applying a forward bias voltage to the PIN structure, the attenuation of the optical signal can be adjusted. Then they are synthesized via a $\pi /2$ phase shifter to form a single polarized coherent IQ signal. The two IQ signals are then polarization multiplexed through a polarization combining device, the Pol.MUX in Fig. 2(c), and output through the chip output port ${{\rm out}_1}$ in Fig. 2(c). In the ICR, the signal light is coupled into the chip via ${{\rm in}_2}$ and demultiplexed by a polarization splitter. Afterward, the signal light is incident into a 90° hybrid with the local laser from the other input port ${{\rm in}_1}$ to recover the modulated signal XI, XQ, YI, and YQ. Noted, here we explain the basic function of IC-TROSA that is originally designed for communication systems. In the next paragraph, we will demonstrate how to use IC-TROSAs to build a SiPh-NA prototype. The 3 dB EO and OE bandwidth of the MZM and PDs in the fully packaged IC-TROSA are around 45 GHz and the approximately flat frequency response range is from 0 to 40 GHz, as shown in Figs. 2(d) and 2(e), supporting the SiPh-NA to process a signal up to 40 GHz. Higher bandwidth transmitters and receivers have been realized [48,49], which can improve the bandwidth utility for a higher computing speed.

Fig. 2. SiPh IC-TROSA. (a) Highly compact SiPh IC-TROSA integrating a silicon coherent transceiver chip, a driver chip, and two TIA chips inside. (b) Microscope image of silicon coherent transceiver chip. (c) Silicon coherent transceiver structure. (d) Frequency response of the Tx EO S21 (driver + MZM). (e) Frequency response of the Rx OE S21 (PD + TIA).

Download Full Size | PDF

Based on two IC-TROSAs, we built an experimental platform for SiPh-NA with one computing cell, as shown in Fig. 3 (see Supplement 1, Section 2 for the system constructions). An arbitrary waveform generator (AWG) generates the waveforms of $S$ and $L$. These waveforms then modulate a laser light via an ICT with drivers in IC-TROSA 1, producing two single-band $S$ and $L$ signals in $X$ and $Y$ polarizations, respectively, as in Eq. (1). The powers of the signals are amplified by an erbium-doped fiber amplifier (EDFA) to compensate for the coupling power loss and insertion loss from the polarization multiplexer and demultiplexer, which can be removed in future work. Subsequently, $S$ and $L$ are separated by a polarization beam splitter (PBS) and eventually collected by the two input ports of an ICR with a TIA in IC-TROSA 2, respectively. Noted, while the ICR is designed for coherent communication systems, which includes eight output ports in two 90° hybrids connected to four BPDs with four TIAs, in our architecture only one BPD output is needed using two outputs 0 and 180° of the hybrid. In future work, the ICR will be simplified to a one-channel integrated receiver. The receiver output current, sampled by a digital storage oscilloscope (DSO), includes the waveforms with corresponding amplitudes and frequencies.

Fig. 3. Experimental diagram. AWG: arbitrary waveform generator; ICT: integrated coherent transmitter; EDFA: erbium-doped fiber amplifier; PC: polarization controller; PBS: polarization beam splitter; ICR: integrated coherent receiver; TIA: trans-impedance amplifier; DSO: digital storage oscilloscope; and $\tau$: symbol time width. The ICT and ICR are the photonic integrated chips, which are co-packaged with drivers and TIAs in the printed circuit boards (green solid boxes). After removing the polarization multiplexer from ICT, the function of the components in the gray dashed box can be realized in only one photonic integrated chip without EDFA, PC, and PBS.

Download Full Size | PDF

It should be noted that in the experimental platform, fibers are utilized to connect integrated devices, then the path length of $S$ and $L$ may not equal, which introduces phase deviations into Eq. (1), resulting in computing errors. The phase deviation problem caused by the path difference may also exist in the integration chips. To solve this problem, we propose a path-difference calibration method. The method is to set up a digital twin system in software, obtain the real path difference by minimizing a cost function that assesses the output difference between the real hardware system and the software digital twin system, and pre-compensate the path difference in the modulation waveforms (see Supplement 1, Section 3). In our experiments, the path difference is estimated as 0.0163 m, which may be from the length difference of fibers for connecting the equipment and chips, and the length difference of the cables for connecting the AWG to the drivers. Once the system is built, the path difference is fixed. Therefore, only one path calibration is needed before running the system. Besides, PC and BPS between the ICT and ICR modules are utilized to demultiplex the $S$ and $L$ combined by the polarization multiplexer in the ICT, which are not necessary and can be removed when the IQ modulators and the one-channel receiver are directly connected with waveguides and integrated into one chip. Furthermore, AWG could be replaced by analog circuits that can generate mixed orthogonal frequency signals with data vectors as inputs. While our experimental work involved extracting the amplitude spectrum using the fast Fourier transform within MATLAB, which incurs additional computational demands, we introduce an alternative approach through a photonic-assisted channelization scheme utilizing coherent dual OFCs [43–45]. A brief introduction can be found in Supplement 1, Section 4. The photonic-assisted scheme could maintain the data throughput and the computing speed.

To demonstrate the computational capacity of the SiPh-NA, we conduct two experiments, applying the SiPh-NA to realize convolutions and using the SiPh-NA to process four different NNs.

A. SiPh-NA for Convolutions

To evaluate the effectiveness of the SiPh-NA for convolutions, we test convolutions between ${\textbf{1}_n}$ and ${\textbf{1}_n}$ with different $n$, where ${\textbf{1}_n} = [{1,1, \cdots ,1}] \in {{\cal R}^n}$. The start frequencies of $\textbf{A}$ and $\textbf{B}$ are ${f_A} = 5\;{\rm GHz}$ and ${f_B} = 18\;{\rm GHz}$, respectively, and the bandwidths that they occupy are constrained to 4 GHz, to avoid information frequency spectrum overlapping due to the incomplete compressed carrier and to work within the available analog bandwidth of the AWG, modulators, and BPDs. The intervals between the orthogonal frequencies are shown in Table 1 and the symbol rate is $R = {f_0}$. The convolution speed is $2{n^2}{f_0}$. For example, when two vectors with 128 values each convolve, and the symbol rate is set to $R = {f_0} = 0.03125\;{\rm GHz}$, the resulting convolution speed is 1.024 TOPS. The theoretical convolution results between two ${\textbf{1}_n}{\rm s}$ are $[1,2, \cdots ,n, \cdots ,2,1]$. To assess the similarity between the theoretical results and the experimental outputs, the correlation coefficients are used and presented in Table 1 (${\rm Corrcoef.}{\textbf{1}_n}$), accompanied by the corresponding convolution computational speeds. The experimental output spectrums, presented in Fig. 4, closely resemble the theoretical results and remain consistent even for input sizes that are larger than 128, with the convolution speed surpassing 1TOPS.

Table 1. Performance of the SiPh-NA for Convolutions

View Table | View all tables in this article

Fig. 4. Output normalized amplitude spectrum of SiPh-NA processing ${\textbf{1}_n}$ with different $n$ convolutions.

Download Full Size | PDF

Generally, we test the convolution performance of the SiPh-NA for randomly generated n-dimensional positive real number vectors, conducting 100 trials for each value of $n$ and presenting the averaged correlation coefficients, the averaged error standard deviation, and the effective bit precision [50] in Table 1: ${\rm Avg.Corrcoef.}({\textbf{R}_n})$, ${\rm Avg.std.E}({\textbf{R}_n})$, and Eff.B. Examples of the experimental outputs compared to the theoretical results are shown in Fig. 5. As the input size $n$ increases, the correlation coefficient and effective bit precision decrease, and the points deviate further from the ideal line $y = x$. The error standard deviation scales with the square root of vector dimension $n$ according to the data fitting. Nonetheless, the experimental outputs are still similar to the theoretical ones even with $n = 128$, achieving a speed of 1.024 TOP. These experiments demonstrate that the SiPh-NA can provide a high-speed convolution speed and is able to accelerate convolutions in various applications, such as image and signal processing.

Fig. 5. One trial of the measured results of SiPh-NA processing ${\textbf{R}_n}$ with different $n$ convolutions.

Download Full Size | PDF

Furthermore, since the system exploits IQ modulators and a coherent receiver, the proposed approach is able to process convolutions between two complex data (including negative real data) vectors. The modulus and argument of the complex numbers are first encoded to the amplitudes and phases, respectively, of a series of waveforms with orthogonal frequencies. Then, the encoded waveforms modulate the optical carrier with the IQ modulators and are received by the coherent receiver with the same working principle, as demonstrated above. The amplitudes and phases of the corresponding frequency output waveforms are the complex convolution results. This extension will be done in future work.

B. SiPh-NA for NNs

To evaluate the effectiveness of the SiPh-NA for neural networks, we test two one-layer fully connected neural networks, $F{C_{10 \times 32}}$ and $F{C_{10 \times 64}}$, a one-layer convolutional neural network, ${{\rm Conv}_{1 \times 424}}$, and a forward neural network with a convolution layer in the first layer and a fully connected layer, $C{F_{3{,}16}}$, to process recognition of handwritten digits with $8 \times 8$ pixels from the sklearn library [51]. The dataset is split into training and test parts with 4:1 and $F{C_{10 \times 32}}$ and $F{C_{10 \times 64}}$, ${{\rm Conv}_{1 \times 424}}$, and $C{F_{3{,}16}}$ are first trained offline. In $F{C_{10 \times 32}}$, the $8 \times 8$ pixel images are downsampled to $4 \times 8$. The weight matrix sizes in $F{C_{10 \times 32}}$ and $F{C_{10 \times 64}}$ are (10,32) and (10,64), respectively. In the ${{\rm Conv}_{1 \times 424}}$, rather than the general convolutional neural networks (CNNs) where the kernel size is smaller than the image size and at least one fully connected layer is added after the convolution layers, the images are flattened into a (1,64) vector and convolved with a (1,424) kernel vector and a stride as 40. The $C{F_{3{,}16}}$ is a conventional CNN with two forward layers, the first one of which is a convolution layer with 16 kernels whose size is [1,3] and the second of which is a fully connected layer whose weight matrix size is (512,10) after a max-pooling function with a stride equaling 2. The training is based on PyTorch [52] with the cross-entropy as the cost function and the stochastic gradient descent (SGD) in the optimizer. In the cross-entropy, the nonlinear operation (i.e., softmax function) is included, which can benefit the training phase by amplifying the difference between the outputs of NNs for multiclass classification [53]. After the offline training, the trained weight matrices/kernel and the test images are mapped into the SiPh-NA for online inference.

Fig. 6. Handwritten digit image classification demonstration. (a) Demonstration for $F{C_{10 \times 64}}$ and ${{\rm Conv}_{1 \times 424}}$. (b) Demonstration for convolution layer in $C{F_{3{,}16}}$.

Download Full Size | PDF

To process the fully connected neural network $F{C_{10 \times 64}}$ in SiPh-NA, the $8 \times 8$ pixel image and $10 \times 64$ matrix are flattened into a 64-length vector and a 640-length vector, respectively, as shown in Fig. 6(a). Then, the numbers of the 64-length image vector encode the amplitudes of a series of orthogonal frequency waveforms, whose frequencies start from 7.00 GHz with an interval of 0.01 GHz. Meanwhile, the numbers of the 640-length vector encode another series of waveforms with frequencies from 16.00 to 22.40 GHz. Afterward, the waveforms modulate two optical carriers split from one source. Eventually, the convolution results between the two vectors are the amplitudes from the frequency 8.37–15.39 GHz with an interval of 0.01 GHz, and the results of MVM are ones at frequencies [14.76,14.12,…,9.00] GHz. Details can be found in the Supplement 1, Section 5. Similarly, to complete the multiplication in $F{C_{10 \times 32}}$ for the flattened downsampled images in the SiPh-NA, the start frequencies of the series of frequency orthogonal waveforms for the 32-length image vector and the weight matrix with the $320(10 \times 32)$-length vector are 7.00 GHz and 16.00 GHz, respectively. The frequency intervals are 0.02 GHz for both. In the output, amplitudes at the same frequencies as the $F{C_{10 \times 64}}$ are the results of MVM for $F{C_{10 \times 32}}$.

For the ${{\rm Conv}_{1 \times 424}}$ as in Fig. 6(a), numbers of the flattened images and $1 \times 424$ kernel vector first encode the amplitudes of two series of orthogonal frequency waveforms. In this scenario, the start frequencies for image and kernel vectors are 5 GHz and 12 GHz, respectively, and the frequency interval in both series of frequency orthogonal waveforms is 0.01 GHz. In the output current, the amplitudes at frequencies $[12.6{,}12.2, \cdots ,9]\;{\rm GHz}$ are the required 10 outputs of the ${{\rm Conv}_{1 \times 424}}$. For the $C{F_{3{,}16}}$ as in Fig. 6(b), the convolution layer is processed in the experimental system. The numbers of the flattened images encode a series of orthogonal frequency waveforms whose start frequency is 5 GHz with a frequency interval 0.0625 GHz. The convolutions between the image and the 16 kernels are processed one after the other. In each of them, the corresponding kernel values encode a series of orthogonal frequency waveforms with the same start frequency 12 GHz and the same frequency interval 0.0625 GHz. The second fully connected layer is processed offline in the digital computer.

Fig. 7. Confusion matrices (%) of recognizing the handwritten digits. (a) Confusion matrix of $F{C_{10 \times 64}}$. (b) Confusion matrix of $F{C_{10 \times 32}}$. (c) The confusion matrix of ${{\rm Conv}_{1 \times 424}}$. (d) Confusion matrix of $C{F_{3{,}16}}$.

Download Full Size | PDF

Note that the amplitudes at the corresponding frequencies of the output without considering the phases represent the absolute values of the convolution results. Taking the absolute values is the nonlinear operation implemented in our experiments and the implemented NNs. No additional nonlinear operations are needed in the inference phase because the softmax function in the cross entropy does not change the maximum value position that represents the predicted classification. Besides, our structure provides a potential to add trigonometric nonlinear operations to the operands during modulations before MVM and convolutions, while other nonlinear operations (e.g., ReLU), require extra electronic circuits to process, which is still a challenge to our structure.

The confusion matrices for $F{C_{10 \times 64}}$, $F{C_{10 \times 32}}$, ${{\rm Conv}_{1 \times 424}}$, and $C{F_{3{,}16}}$ are presented in Fig. 7. The online inference of SiPh-NA can achieve 89.11% ($F{C_{10 \times 32}}$), 95.78% ($F{C_{10 \times 64}}$), 94.89% (${{\rm Conv}_{1 \times 424}}$), and 96.67%($C{F_{3{,}16}}$), which is comparable to the theoretical accuracy of 90.67%, 96.67%, 95.33%, and 97.11% obtained in the electrical computer.

C. Performance Analysis

Table 2 shows performance comparisons of the state-of-the-art optical computing frameworks and Nvidia GPU A100 [57]. For a fair and effective comparison, the data and their obtained conditions are presented, most of which are obtained with experimental tests or claimed in the original works. When calculating the power and energy efficiency, the considered devices are listed in the power column. Our SiPh-NA can achieve one of the largest MVM dimensions and provide a competitive inference accuracy. For the peak computing speeds, the SiPh-NA has realized up to 1.024 TOPS with only one computing cell and can achieve over 16 TOPS and 156 TOPS with a computing array consisting of four cells and 10 4-cell arrays, respectively. The computing speed can scale linearly or squarely to the cell number. It has to be acknowledged that the SiPh-NA as analog computing hardware has limited bit precision compared to electronic digital hardware (e.g., GPU). Due to the NN robustness, the application accuracies can be maintained by methods, such as quantization-aware training [16,58], adaptive training [59], and statistical training [18]. Noted, photonic computing solutions based on diffraction (e.g., Accel [41]) can achieve several POPS, where most computing operations are from the space dot products. However, they still face significant challenges in collimation, modulation, phase mask fabrication, and flexibility. The energy efficiency of the proposed SiPh-NA is calculated as 5.485 pJ/MAC. More details can be found in Supplement 1, Section 6. The energy efficiency is competitive even with several electronic devices counted, which cannot be removed in the current photonic computing framework due to the significance of optoelectronic conversion and the lack of all-optical control and storage.

Table 2. Comparison of the State-of-the-Art Computing Frameworks and SiPh-NA^a

View Table | View all tables in this article

4. DISCUSSION

Although the SiPh-NA can accelerate convolutions and MVMs for NNs, there are still challenges in the computing speed and precision. Next, we will discuss the factors limiting the SiPh-NA performance in the current silicon photonic modules and the potential solutions for performance improvements.

In the experiments, the SiPh-NA is found with only one computing cell and achieves 1.024 TOPS. When $K$ OFC teeth are implemented to expand the computing cell to a computing array, while other hardware settings are maintained, the computing speed potentially can achieve $1.024{K^2}\;{\rm TOPS}$, increasing by the square of the OFC teeth number. Limited by the OFC teeth power, the BPD sensitivity, and AD/DA quantization precision, there could be a trade-off between the system’s scalability, computing speed, and precision [60,61]. It has to be acknowledged that silicon modulators may encounter challenges such as high insertion losses and dynamic penalties, which can hinder the architecture’s scalability. To achieve large scalability, lithium niobate modulators are a promising solution. A comprehensive formulation between the computing speed and related elements is presented in Supplement 1, Section 7, which provides a numerical instrument for the chip design and optimization. In future work, we plan to investigate the scalability of the SiPh-NA and the NA with other processes.

As in the results above, the coefficients between the experimental and theoretical outputs decrease as the input vector size increases. However, we find the coefficients can grow by reducing the symbol rate $R$ while sacrificing the computing speed. This phenomenon is due to incomplete compressed carrier, limited linear modulation range, and non-flat EO and OE frequency responses of IC-TROSA. In the SiPh-NA, the carrier is expected to be fully compressed, and all the optical powers are supposed to carry vectors $\textbf{A}$ and $\textbf{B}$. When the optical power increases and the number $n$ in vectors $\textbf{A}$ and $\textbf{B}$ decreases, each value in the vectors can occupy a growing power, leading to a higher signal-to-noise ratio (SNR) and higher computing precision. However, the carrier power cannot be fully compressed, and the linear modulation range is limited. As $n$ increases, the power of each value in $\textbf{A}$ and $\textbf{B}$ is reduced and even less than that of the carrier, resulting in a lower SNR and reduced computing precision. Furthermore, both A and B modulate the lights from the same source, which causes self-homodyne and further decreases the SNR of BPD outputs. Two independent sources for A and B cannot work because the phase difference between two signals from the sources results in computing errors. Therefore, the way to maintain and improve the computing speed and precision at the architecture level is by using multiple computing cells simultaneously with fewer values in each cell to obtain higher SNR or with the same vectors in multiple cells to improve the SNR by averaging the outputs. At the device and material levels, SNR gains can be realized by enhancing the linear range of the modulators and detectors, as well as lowering the noise from the optical devices and light source, which can fundamentally address the concerns.

Besides, the EO and OE frequency response of IC-TROSA is not ideal flat as in Figs. 2(d) and 2(e), introducing deviations in adding data and the reading results. A numerical analysis of non-ideal frequency response impacts is presented in the Supplement 1, Section 8. These impacts could be mitigated by testing the frequency response of the modulators and photodetectors, pre-compensating the modulation amplitudes and tuning the output amplitudes at corresponding frequencies, which is similar to the frequency equalization in [62] and should be realized in the hardware implementation instead of software-based calibration.

5. CONCLUSION

In this work, we propose a novel scalable, integrable, and flexible SiPh-NA and demonstrate the proof-of-concept of SiPh-NA with the commercial IC-TROSAs. The experiments confirm that our SiPh-NA can operate at 1.024 TOPS/cell and achieve up to 96.67% accuracy in handwritten digit recognition when processing fully connected and convolutional neural networks. The energy efficiency with the advanced CMOS and silicon photonic technology can achieve 5.485 pJ/MAC, which is competitive in the state-of-the-art optical computing frameworks. This work is significant for expanding the SiPh-NA cell to an array and utilizing higher linearity modulators to obtain a faster computing speed over several POPS and higher computing precision, which will be verified in future works.

Funding

National Natural Science Foundation of China (U21A20454); Young Top-notch Talent Cultivation Program of Hubei Province; Natural Science Foundation of Hubei Province (2023AFB528).

Acknowledgment

Y. Zhu designed the architecture, conducted the experiments, collected and analyzed the data, and wrote the manuscript. M. Luo, X. Hua, C. Yang, and Q. Wang provided guidance and discussion for experiments. X. Hua, L. Xu, and J. Liu revised the manuscript. L. Xu finished the theory simulation. M. Lei provided a discussion on circuit design. J. Liu, Y. Liu, and M. Liu tested the device. X. Xiao provided guidance in architecture and experiment design and critical comments on the manuscript. All authors reviewed and approved the final version of the manuscript.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

REFERENCES

1. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521, 436–444 (2015). [CrossRef]

2. D. Silver, A. Huang, C. J. Maddison, et al., “Mastering the game of Go with deep neural networks and tree search,” Nature 529, 484–489 (2016). [CrossRef]

3. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM 60, 84–90 (2017). [CrossRef]

4. T. Brown, B. Mann, N. Ryder, et al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems (2020), Vol. 33, pp. 1877–1901.

5. E. L. Denton, W. Zaremba, J. Bruna, et al., “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems (2014), Vol. 27.

6. S. Han, X. Liu, H. Mao, et al., “EIE: Efficient inference engine on compressed deep neural network,” ACM SIGARCH Comput. Archit. News 44(3), 243–254 (2016). [CrossRef]

7. M. Le Gallo, R. Khaddam-Aljameh, M. Stanisavljevic, et al., “A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference,” Nat. Electron. 6, 680–693 (2023). [CrossRef]

8. D. B. Strukov, G. S. Snider, D. R. Stewart, et al., “The missing memristor found,” Nature 453, 80–83 (2008). [CrossRef]

9. M. Jerry, P.-Y. Chen, J. Zhang, et al., “Ferroelectric FET analog synapse for acceleration of deep neural network training,” in IEEE International Electron Devices Meeting (IEDM) (IEEE, 2017), pp. 2–6.

10. K. Ni, X. Yin, A. F. Laguna, et al., “Ferroelectric ternary content-addressable memory for one-shot learning,” Nat. Electron. 2, 521–529 (2019). [CrossRef]

11. G. Singh, L. Chelini, S. Corda, et al., “Near-memory computing: past, present, and future,” Microprocess. Microsyst. 71, 102868 (2019). [CrossRef]

12. K. Roy, I. Chakraborty, M. Ali, et al., “In-memory computing in emerging memory technologies for machine learning: An overview,” in 57th ACM/IEEE Design Automation Conference (DAC) (IEEE, 2020), pp. 1–6.

13. A. Sebastian, M. Le Gallo, R. Khaddam-Aljameh, et al., “Memory devices and applications for in-memory computing,” Nat. Nanotechnol. 15, 529–544 (2020). [CrossRef]

14. M. V. DeBole, B. Taba, A. Amir, et al., “TrueNorth: Accelerating from zero to 64 million neurons in 10 years,” Computer 52, 20–29 (2019). [CrossRef]

15. S. Schmitt, J. Klähn, G. Bellec, et al., “Neuromorphic hardware in the loop: training a deep spiking network on the BrainScaleS wafer-scale system,” in International Joint Conference on Neural Networks (IJCNN) (IEEE, 2017), pp. 2227–2234.

16. S. Han, H. Mao, and W. J. Dally, “Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding,” arXiv, arXiv:1510.00149 (2015).

17. S. Han, “Efficient methods and hardware for deep learning,” Ph.D. thesis (Stanford University, 2017).

18. Y. Zhu, G. L. Zhang, T. Wang, et al., “Statistical training for neuromorphic computing using memristor-based crossbars considering process variations and noise,” in Design, Automation & Test in Europe Conference & Exhibition (DATE) (IEEE, 2020), pp. 1590–1593.

19. P. Spilger, E. Müller, A. Emmel, et al., “hxtorch: PyTorch for BrainScaleS-2: perceptrons on analog neuromorphic hardware,” in IoT Streams for Data-Driven Predictive Maintenance and IoT, Edge, and Mobile for Embedded Machine Learning: 2nd International Workshop, IoT Streams 2020, and 1st International Workshop, ITEM 2020, Co-located with ECML/PKDD, Revised Selected Papers 2, Ghent, Belgium, 14 –18 September 2020 (Springer, 2020), pp. 189–200.

20. G. Wetzstein, A. Ozcan, S. Gigan, et al., “Inference in artificial intelligence with deep optics and photonics,” Nature 588, 39–47 (2020). [CrossRef]

21. H. Zhou, J. Dong, J. Cheng, et al., “Photonic matrix multiplication lights up photonic accelerator and beyond,” Light Sci. Appl. 11, 1–21 (2022). [CrossRef]

22. J. Cardenas, C. B. Poitras, J. T. Robinson, et al., “Low loss etchless silicon photonic waveguides,” Opt. Express 17, 4752–4757 (2009). [CrossRef]

23. L. Vivien, A. Polzer, D. Marris-Morini, et al., “Zero-bias 40gbit/s germanium waveguide photodetector on silicon,” Opt. Express 20, 1096–1101 (2012). [CrossRef]

24. L. Yang, L. Zhang, and R. Ji, “On-chip optical matrix-vector multiplier,” Proc. SPIE 8855, 100–104 (2013). [CrossRef]

25. S. Ambrogio, P. Narayanan, H. Tsai, et al., “Equivalent-accuracy accelerated neural-network training using analogue memory,” Nature 558, 60–67 (2018). [CrossRef]

26. J. Gu, C. Feng, Z. Zhao, et al., “Efficient on-chip learning for optical neural networks through power-aware sparse zeroth-order optimization,” in Proceedings of the AAAI Conference on Artificial Intelligence (2021), Vol. 35, pp. 7583–7591.

27. Y. Zhu, M. Liu, L. Xu, et al., “Multi-wavelength parallel training and quantization-aware tuning for WDM-based optical convolutional neural networks considering wavelength-relative deviations,” in Proceedings of the 28th Asia and South Pacific Design Automation Conference (2023), pp. 384–389.

28. P. Dong, Y.-K. Chen, G.-H. Duan, et al., “Silicon photonic devices and integrated circuits,” Nanophotonics 3, 215–228 (2014). [CrossRef]

29. A. N. Tait, T. F. De Lima, E. Zhou, et al., “Neuromorphic photonic networks using silicon photonic weight banks,” Sci. Rep. 7, 1–10 (2017). [CrossRef]

30. G. Tanaka, T. Yamane, J. B. Héroux, et al., “Recent advances in physical reservoir computing: a review,” Neural Netw. 115, 100–123 (2019). [CrossRef]

31. D. Brunner, M. C. Soriano, C. R. Mirasso, et al., “Parallel photonic information processing at gigabyte per second data rates using transient states,” Nat. Commun. 4, 1364 (2013). [CrossRef]

32. K. Vandoorne, P. Mechet, T. Van Vaerenbergh, et al., “Experimental demonstration of reservoir computing on a silicon photonics chip,” Nat. Commun. 5, 3541 (2014). [CrossRef]

33. K. Liu, T. Zhang, B. Dang, et al., “An optoelectronic synapse based on α-In₂Se₃ with controllable temporal dynamics for multimode and multiscale reservoir computing,” Nat. Electron. 5, 761–773 (2022). [CrossRef]

34. Y.-W. Shen, R.-Q. Li, G.-T. Liu, et al., “Deep photonic reservoir computing recurrent network,” Optica 10, 1745–1751 (2023). [CrossRef]

35. Y. Shen, N. C. Harris, S. Skirlo, et al., “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11, 441–446 (2017). [CrossRef]

36. X. Xu, M. Tan, B. Corcoran, et al., “11 tops photonic convolutional accelerator for optical neural networks,” Nature 589, 44–51 (2021). [CrossRef]

37. J. Feldmann, N. Youngblood, M. Karpov, et al., “Parallel convolutional processing using an integrated photonic tensor core,” Nature 589, 52–58 (2021). [CrossRef]

38. C. Huang, S. Fujisawa, T. F. de Lima, et al., “A silicon photonic–electronic neural network for fibre nonlinearity compensation,” Nat. Electron. 4, 837–844 (2021). [CrossRef]

39. Y. Zhu, X. Zhang, X. Hua, et al., “Optoelectronic neuromorphic accelerator at 523.27 GOPS based on coherent optical devices,” in Optical Fiber Communication Conference (Optica Publishing Group, 2023), paper M2J-4.

40. X. Meng, G. Zhang, N. Shi, et al., “Compact optical convolution processing unit based on multimode interference,” Nat. Commun. 14, 3000 (2023). [CrossRef]

41. Y. Chen, M. Nazhamaiti, H. Xu, et al., “All-analog photoelectronic chip for high-speed vision tasks,” Nature 623, 48–57 (2023). [CrossRef]

42. C. D. McGillem and G. R. Cooper, Continuous and Discrete Signal and System Analysis (Oxford University, 1991).

43. X. Xie, Y. Dai, K. Xu, et al., “Broadband photonic RF channelization based on coherent optical frequency combs and I/Q demodulators,” IEEE Photonics J. 4, 1196–1202 (2012). [CrossRef]

44. N. Picqué and T. W. Hänsch, “Frequency comb spectroscopy,” Nat. Photonics 13, 146–157 (2019). [CrossRef]

45. Z. Tang, D. Zhu, and S. Pan, “Coherent optical RF channelizer with large instantaneous bandwidth and large in-band interference suppression,” J. Lightwave Technol. 36, 4219–4226 (2018). [CrossRef]

46. T. Fortier and E. Baumann, “20 years of developments in optical frequency comb technology and applications,” Commun. Phys. 2, 153 (2019). [CrossRef]

47. X. Xiao, L. Wang, M. Luo, et al., “High baudrate silicon photonics for the next-generation optical communications,” in European Conference on Optical Communication (ECOC) (IEEE, 2022), pp. 1–4.

48. X. Hu, D. Wu, H. Zhang, et al., “Ultrahigh-speed silicon-based modulators/photodetectors for optical interconnects,” in Optical Fiber Communications Conference and Exhibition (OFC) (IEEE, 2023), pp. 1–3.

49. M. Xu, M. He, H. Zhang, et al., “High-performance coherent optical modulators based on thin-film lithium niobate platform,” Nat. Commun. 11, 3911 (2020). [CrossRef]

50. M. J. Filipovich, Z. Guo, M. Al-Qadasi, et al., “Silicon photonic architecture for training deep neural networks with direct feedback alignment,” Optica 9, 1323–1332 (2022). [CrossRef]

51. F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res. 12, 2825–2830 (2011).

52. A. Paszke, S. Gross, F. Massa, et al., “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems (2019), Vol. 32.

53. M. A. Nielsen, Neural Networks and Deep Learning (Determination, 2015), Vol. 25.

54. A. Sludds, S. Bandyopadhyay, Z. Chen, et al., “Delocalized photonic deep learning on the internet’s edge,” Science 378, 270–276 (2022). [CrossRef]

55. H. Zhu, J. Zou, H. Zhang, et al., “Space-efficient optical computing with an integrated chip diffractive neural network,” Nat. Commun. 13, 1044 (2022). [CrossRef]

56. F. Ashtiani, A. J. Geers, and F. Aflatouni, “An on-chip photonic deep neural network for image classification,” Nature 606, 501–506 (2022). [CrossRef]

57. NVIDIA Corporation, “NVIDIA A100 tensor core GPU,” https://www.nvidia.com/en-us/data-center/a100/.

58. J. Gu, Z. Zhao, C. Feng, et al., “ROQ: A noise-aware quantization scheme towards robust optical neural networks with low-bit controls,” in Design, Automation & Test in Europe Conference & Exhibition (DATE) (IEEE, 2020), pp. 1586–1589.

59. Z. Zheng, Z. Duan, H. Chen, et al., “Dual adaptive training of photonic neural networks,” Nat. Mach. Intell. 5, 1119–1129 (2023). [CrossRef]

60. M. Al-Qadasi, L. Chrostowski, B. Shastri, et al., “Scaling up silicon photonic-based accelerators: Challenges and opportunities,” APL Photonics 7, 020902 (2022). [CrossRef]

61. G. Giamougiannis, A. Tsakyridis, M. Moralis-Pegios, et al., “Analog nanophotonic computing going practical: silicon photonic deep learning engines for tiled optical matrix multiplication with dynamic precision,” Nanophotonics 12, 963–973 (2023). [CrossRef]

62. C. Yang, R. Hu, M. Luo, et al., “IM/DD-based 112-Gb/s/lambda PAM-4 transmission using 18-Gbps DML,” IEEE Photonics J. 8, 7903907 (2016). [CrossRef]

n	$f_{0}$ (GHz)	Corrcoef. ( $1_{n}$ )	Avg. Corrcoef. ( $R_{n}$ )	Avg. std.E ( $R_{n}$ )	Eff.B. (bit)	Speed (GOPS)
4	1	0.9991	0.9944	0.0369	4.80	32
8	0.5	0.9977	0.9941	0.0377	4.73	64
16	0.25	0.9954	0.9889	0.0428	4.55	128
20	0.2	0.9935	0.9901	0.0434	4.53	160
32	0.125	0.9910	0.9850	0.0495	4.34	256
40	0.1	0.9900	0.9847	0.0517	4.27	320
64	0.0625	0.9869	0.9775	0.0600	4.06	512
80	0.05	0.9799	0.9738	0.0622	4.01	640
100	0.04	0.9766	0.9613	0.0716	3.80	800
128	0.03125	0.9706	0.9559	0.0764	3.71	1024

Type	Realized Matrix Dimension	Inference Accuracy for MNIST	NN Architecture	Exp. System	Peak Speed (TOPS)	Precision	Power (W) (Included Devices)	Efficiency (J/MAC)	Scalability	Latency ( $n s / f r a m e$ )
TCA [36]	$9 * 10$	88%	1Conv +1FC	Combs, WS, AWG, IM, SMF, PD, DSO	11.3	–	–	–	–	75.6
Netcast [54]	–	98.8%	3FC	Laser, Transceiver, TEC, AWG, MZM, DEMUX, PD, SI, IC	–	8 bit	-(MZM, DAC, ADC, integrator, nonlinearity)	$10 f$ ^uc	–	–
MZIONN [35]	$4 * 4$	76.7% (vowel4cat.)	2FC	Laser, PIC, PD, nonlinearity	–	–	–	–	$N^{2}$	–
MRR-MV [50]	$1 * 4$	97.41%^ts	3FC	Laser, WDM, EDFA, MRR array, BPD, TEC, DC power supply	–	6.72 bit	-(MRR, lasers)	2.0 µc	$N^{2}$	–
PCM [37]	$16 * 16$	95.3%	1Conv +1FC	Lasers, EDFA, MRR, DEMUX, VOA, PD, PC, PCM array, MUX, EOM, VNA	2	5 bit	–	$5 p$ ^o	$N^{2}$	8100
IDNN [55]	$10 * 10$ ^le	89.3%	2FC	CM, Laser, PC, TEC, PIC, PD, TIA, ADC	–	–	–	–	2N	–
OCPU [40]	$4 * 4$ ^le	92.17%	1Conv +1FC	CM, Laser array, PC, ${A W G}^{'}$ , MZM, DL, WG, SOA, MMI, PS, PD, DSO	0.266	5 bit	-(laser, DAC, MZM, PS, ADC)	$4.84 p$ ^us	N	91.08 ns
PDNN [56]	$4 * 12$	93.8% (2cat.)	1Conv +2FC +ReLU	Laser, PC, driver, TIA, DAC, MC, IM, PIN att., DC power supply, PD, MRR, EDFA, DSO	0.27	–	3.75 (PIN att., Laser, CM, driver)	$13.89 p$	2N	0.57
NVIDIA A100 [41] (single stream)	–	-(Img3cat.)	–	7 nm CMOS	0.08^o	–	–	$1000 p$ ^o	–	–
NVIDIA A100 [57] (parallel stream)	–	–	–	7 nm CMOS	156	Float32	–	$1.92 p$	–	–
Our SiPh-NA^*	64 $*$ 10	95.78%	1FC	Laser, PC, AWG, IC-TROSA, PBS, EDFA, DSO	1.024	3.71bit	5.485 (Laser, OFCs, Mod., PD, VOA, DAC, ADC, TIA, RF amplifier)	$4.21 p$ ^us	$\sqrt{N}$	50

n	$f_{0}$ (GHz)	Corrcoef. ( $1_{n}$ )	Avg. Corrcoef. ( $R_{n}$ )	Avg. std.E ( $R_{n}$ )	Eff.B. (bit)	Speed (GOPS)
4	1	0.9991	0.9944	0.0369	4.80	32
8	0.5	0.9977	0.9941	0.0377	4.73	64
16	0.25	0.9954	0.9889	0.0428	4.55	128
20	0.2	0.9935	0.9901	0.0434	4.53	160
32	0.125	0.9910	0.9850	0.0495	4.34	256
40	0.1	0.9900	0.9847	0.0517	4.27	320
64	0.0625	0.9869	0.9775	0.0600	4.06	512
80	0.05	0.9799	0.9738	0.0622	4.01	640
100	0.04	0.9766	0.9613	0.0716	3.80	800
128	0.03125	0.9706	0.9559	0.0764	3.71	1024

Type	Realized Matrix Dimension	Inference Accuracy for MNIST	NN Architecture	Exp. System	Peak Speed (TOPS)	Precision	Power (W) (Included Devices)	Efficiency (J/MAC)	Scalability	Latency ( $n s / f r a m e$ )
TCA [36]	$9 * 10$	88%	1Conv +1FC	Combs, WS, AWG, IM, SMF, PD, DSO	11.3	–	–	–	–	75.6
Netcast [54]	–	98.8%	3FC	Laser, Transceiver, TEC, AWG, MZM, DEMUX, PD, SI, IC	–	8 bit	-(MZM, DAC, ADC, integrator, nonlinearity)	$10 f$ ^uc	–	–
MZIONN [35]	$4 * 4$	76.7% (vowel4cat.)	2FC	Laser, PIC, PD, nonlinearity	–	–	–	–	$N^{2}$	–
MRR-MV [50]	$1 * 4$	97.41%^ts	3FC	Laser, WDM, EDFA, MRR array, BPD, TEC, DC power supply	–	6.72 bit	-(MRR, lasers)	2.0 µc	$N^{2}$	–
PCM [37]	$16 * 16$	95.3%	1Conv +1FC	Lasers, EDFA, MRR, DEMUX, VOA, PD, PC, PCM array, MUX, EOM, VNA	2	5 bit	–	$5 p$ ^o	$N^{2}$	8100
IDNN [55]	$10 * 10$ ^le	89.3%	2FC	CM, Laser, PC, TEC, PIC, PD, TIA, ADC	–	–	–	–	2N	–
OCPU [40]	$4 * 4$ ^le	92.17%	1Conv +1FC	CM, Laser array, PC, ${A W G}^{'}$ , MZM, DL, WG, SOA, MMI, PS, PD, DSO	0.266	5 bit	-(laser, DAC, MZM, PS, ADC)	$4.84 p$ ^us	N	91.08 ns
PDNN [56]	$4 * 12$	93.8% (2cat.)	1Conv +2FC +ReLU	Laser, PC, driver, TIA, DAC, MC, IM, PIN att., DC power supply, PD, MRR, EDFA, DSO	0.27	–	3.75 (PIN att., Laser, CM, driver)	$13.89 p$	2N	0.57
NVIDIA A100 [41] (single stream)	–	-(Img3cat.)	–	7 nm CMOS	0.08^o	–	–	$1000 p$ ^o	–	–
NVIDIA A100 [57] (parallel stream)	–	–	–	7 nm CMOS	156	Float32	–	$1.92 p$	–	–
Our SiPh-NA^*	64 $*$ 10	95.78%	1FC	Laser, PC, AWG, IC-TROSA, PBS, EDFA, DSO	1.024	3.71bit	5.485 (Laser, OFCs, Mod., PD, VOA, DAC, ADC, TIA, RF amplifier)	$4.21 p$ ^us	$\sqrt{N}$	50

Silicon photonic neuromorphic accelerator using integrated coherent transmit-receive optical sub-assemblies

Abstract

Corrections

1. INTRODUCTION

2. PRINCIPLE OF THE ARCHITECTURE

3. EXPERIMENT AND RESULT

A. SiPh-NA for Convolutions

B. SiPh-NA for NNs

C. Performance Analysis

4. DISCUSSION

5. CONCLUSION

Funding

Acknowledgment

Disclosures

Data availability

Supplemental document

REFERENCES

Supplementary Material (1)

Data availability

Cited By

Figures (7)

Tables (2)

Equations (4)

Optica