Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Photonics-enabled spiking timing-dependent convolutional neural network for real-time image classification

Open Access Open Access

Abstract

A photonics-enabled spiking timing-dependent convolutional neural network (CNN) is proposed by manipulating photonics multidimensional parameters in terms of wavelength, temporal and spatial, which breaks the traditional CNN architecture mapping from a spatially parallel to a time-dependent series structure. The proposed CNN with the application of real-time image recognition comprises a photonics convolution processor to accelerate the computing and an involved electronic full connection to execute the classification task. A timing-dependent series of matrix-matrix operations is conducted in the photonics convolution processor that can be achieved based on multidimensional multiplexing by the accumulation of carriers from an active mode-locked laser, dispersion latency induced by a dispersion compensation fiber, and wavelength spatial separation via a waveshaper. Incorporated with the electronic full connection, a photonics-enabled CNN is proven to perform a real-time recognition task on the MNIST database of handwritten digits with a prediction accuracy of 90.04%. Photonics enables conventional neural networks to accelerate machine learning and neuromorphic computing and has the potential to be widely used in information processing and computing, such as goods classification, vowel recognition, and speech identification.

© 2022 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

Information processing using CNNs, as one major category of artificial neural networks (ANNs) for two-dimensional data processing, has flourished in electronic hardware [1] for some emerging application areas, such as image classification [25], computer vision [6,7], speech recognition [8], etc. The CNN can abstract the features of the original information and predict the result with a high accuracy and largely reduced parametric complexity. The key challenge to object recognition and classification using machine learning in massive data processing is the trade-off among the computational rate, latency and energy efficiency [911]. In digital neuromorphic hardware based on the von Neumann architecture, central processing units (CPUs), graphical processing units (GPUs), field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs) are the main hardware platforms, which have been developed to dramatically improve the computing properties. Although the computing speed is high, it is indisputable fact that the processor unit is discrete with the memory unit in the von Neumann architecture hardware. Hence, the access time during reading and moving the data back and forth between the memory and the processor is largely restricted, known as the von Neumann bottleneck [12,13]. Moreover, the charge and discharge in the memory and the processor units will result in low power efficiency.

Light, as a promising medium of information transmission, is not widely used in computing. When compared with electronics, light signals have the advantage of low-loss interconnectivity and high-linear operations, which makes neural networks more conducive to implementation in photonics. Photonics, as an alternative approach for machine learning, has created a renewed case for neural networks, taking advantage of higher computation rate, power efficiency and lower time latency [1418] in comparison to the electrical analog. The ONN can perform linear matrix multiplication at the speed of light with subnanosecond latency, and the frequency-dependent distortions are minimal for massively parallel connections in neural networks, which means that the data processing rate can reach up to 65 GHz. In addition, the optical signal experiences lower transmission loss and generates less heat in the medium. Once an optical neural network (ONN) is trained for a specific task, it becomes a passive system and performs identification with minimal power consumption. In general, photonics computing, as a supplement to conventional computing architecture, does not replace electronic components but expands the present application areas, such as quantum signal processing and intelligent microwave signal processing [19].

The ONN either in large-scale parallel, high-speed on-chip and an optical discrete system has been reported in [2032]. One of them introduced a sequential multilayer diffraction neural network using two-dimensional phase diffraction plates as the fully optical diffractive processing unit [20], which can achieve a classification accuracy of 91.75% using 10,000 images from the MNIST test dataset. When it is necessary to execute a new classification task, a redesigned 3D-printed diffraction plane is replaced in the hidden layer to reconfigure the ONN. A programmable unitary matrix processor with a cascading array of 56 programmable Mach–Zehnder interferometers was fabricated in a silicon photonic platform [21], which demonstrated vowel recognition and generated a recognition accuracy of 76.7%. The dominant factor is thermal crosstalk among the thermo-optic phase shifters, which is induced to limit the resolution. The all-optical spiking neuron based on wavelength division multiplexing in [22] successfully demonstrated pattern recognition, and the accuracy was improved by introducing a nonlinear phase-change material (PCM) as a neural activation function. The chip to switch, read and monitor the transmission in the PCM-based nanophotonic memory device needs multiple pulses to change the phase states, which makes it difficult to control the full states.

A high-speed serial CNN composed of an optical convolutional accelerator, an electrical pooling layer and then an optical fully connected layer is demonstrated in [23] to achieve image classification, where the CNN can theoretically reach a vector computing speed as high as 11.3 TOPS for simultaneously ten 3×3 kernels. The CNN performs the recognition of 50 images from the MNIST database of handwritten digits with an accuracy of 88%. In [23], the fully connected layer is performed in the optical domain, which limits the length of the vector to be fully connected and leads to a decrease in recognition accuracy. To improve the recognition accuracy, it is necessary to add the vector length. Most noteworthy is that the dispersion coefficient of the fiber in the fully connected layer has a linear-like relation with the wavelength. As the number of neurons increases (corresponding to an increase in the wavelength range), more wavelengths are used to implement the interconnection among the neurons. These factors induce the disadvantage that the time latency is not a fixed value but a variable value, which makes nonfull connections among the neurons. This also means that the recognition accuracy is unable to be further improved.

Using photonics to enable conventional computing with low latency, high bandwidth and low energy, a photonics-enabled spiking timing-dependent CNN is proposed, which comprises a photonics convolution processor for signal acceleration computing and an electronic fully connection for real-time image classification. We demonstrate a photonics convolution processor operating at 591.12 GOPS for two 3×3 kernels with a convolution window vertical sliding stride of 1 and generated 600 images of real-time recognition. Matrix vector multiplication is operated at photonics that expands the available information dimension from one-dimensional spatial parallel to multidimensional timing-dependent series (wavelength, temporal and spatial). Incorporated with electronic full connection, we achieve classification recognition at an accuracy of 90.04% in the 10,000 test images from the MNIST database of handwritten digits. It is sufficiently powerful for goods classification, vowel recognition, speech identification and so on.

2. Principle

The basic CNN model generally comprises a series of convolution layers, fully connected layers and nonlinear activation functions, in which the convolution layer executes image feature extraction and the involved fully connected layer conducts image classification. The proposed photonics-enabled CNN is illustrated in Fig. 1(A), which is composed of a photonics convolution processor and an electronic fully connection. The photonics convolution processor worked with a pulse laser is implemented in the optical domain and focuses on conducting matrix-matrix multiplications via an optical pulse series, and the dispersion compensation fiber (DCF) is used to generate a spiking time-dependent convolution processing that compresses the dimensions of the neural network from a two-dimensional spatial parallel to a one-dimensional time-dependent serial structure, as shown in Fig. 1(B). The rest of the photonics-enabled CNN is conducted by a digital computer.

 figure: Fig. 1.

Fig. 1. The proposed photonics-enabled CNN. (A) The full neural network structure including an optical convolution layer and an electrical fully connected layer. (B) Schematic diagram of the photonics convolution processor. MLL: mode-locked laser; IM: intensity modulator; DCF: dispersion compensation fiber; BPD: balanced photodetector; OSC: oscilloscope; AWG: arbitrary waveform generator. a: temporal waveform encoded in AWG corresponding to the two-dimensional origin image flattened to one-dimensional serial vector X; b: spectrum (${\lambda _i}({i = 1, \ldots ,n} )$) of optical pulses from MLL; c: the reshaped spectrum ($({{\lambda_i},{W_i}} )({i = 1, \ldots ,4} )$) after amplitude weight adjustment with kernel matrix; d: temporal waveforms ($({{\lambda_i},{W_i}X} )({i = 1, \ldots ,4} )$) of each spectral component modulated by one-dimensional serial vector X; e: temporal distribution ($({{\lambda_i},{W_i}X({t - ({i - 1} )T} )} )({i = 1, \ldots ,4} )$) of spectral components via dispersion delay T in DCF; f-1, f-2: spectral spatial demultiplexing ($({{\lambda_i},{W_i}X} )({i = 2,4} )$ and $({{\lambda_i},{W_i}X} )({i = 1,3} )$) that makes the negative values and nonnegative values of one kernel divided into two physical channels; g: temporal waveform output from BPD.

Download Full Size | PDF

The photonic convolutional processor, as the predominate portion of the proposed photonics-enabled CNN, is illustrated in Fig. 1(B). Suppose the 2×2 kernel matrix is $W = \left[ {\begin{array}{{cc}} {{w_1}}&{{w_3}}\\ {{w_2}}&{{w_4}} \end{array}} \right]$ and the original 4×4 pixel image matrix is $X = \left[ {\begin{array}{{cccc}} {{x_1}}&{{x_3}}&{{x_5}}&{{x_7}}\\ {{x_2}}&{{x_4}}&{{x_6}}&{{x_8}}\\ {{x_9}}&{{x_{11}}}&{{x_{13}}}&{{x_{15}}}\\ {{x_{10}}}&{{x_{12}}}&{{x_{14}}}&{{x_{16}}} \end{array}} \right]$. Mapping the vector-vector multiplication from matrix-matrix multiplication is illustrated as follows. The input image X is in the form of matrix and need to be flattened into vector. The 4×4 image matrix is firstly sliced horizontally into (4-2 + 1) = 3 sub-matrices with 2×4 for each sub-matrix and longitudinal slide step = 1. These 2×4 sub-matrices are then flattened into 1×8 vector slices and connected head-to-tail to form the 1×24 vector ${X_0} = [ {{x_1}}\quad {{x_2}} \quad {{x_3}}\quad {\ldots }\quad {{x_2}} \quad {{x_9}}\quad {{x_4}}\quad {\ldots } \quad {{x_9}}\quad {{x_{10}}}\quad {{x_{11}}}\quad {\ldots } ]$. The magnitude of wavelength ${\lambda _1}\mathrm{\sim }{\lambda _\textrm{n}}$ of the optical pulse from the active mode-locked laser (MLL) is reshaped in the waveshaper to achieve amplitude weight adjustment of the kernel matrix. In the intensity modulator (IM), the one-dimensional vector X sequentially encoded in the arbitrary waveform generator (AWG) is modulated on the optical pulse with synchronous frequency, generating a series of replication weights. The temporal waveform is then multicast through a DCF to produce wavelength-to-time delay. By matching the repetition frequency of optical pulses and latency, the repetition frequency ${f_{rep}}$ of optical pulses, the group velocity dispersion coefficient $|\ddot{\beta }|$ and the length ${L_T}$ of the DCF fulfil the following relationship:

$$\frac{1}{{f_{rep}^2}} = 2\pi |\ddot{\beta }|{L_T}.$$

When the Eq. (1) is fully satisfied, the latency is equal to the period of the serial data, and it effectively achieves wavelength, temporal and spatial interleaving interconnections. The negative weights and positive weights are split into two channels at a demultiplexer. Then, the delayed weight replicas are accumulated at a balance photodetector (BPD) to realize the negative signs of weights and recorded in the oscilloscope (OSC). Each recombined optical pulse yields a convolution value between the kernel matrix $W$ and the input original image matrix X in each convolution window. The peak values of output waveform are expressed as

$$\begin{array}{l} {i_1} = \textrm{ }{w_4}{x_1}\textrm{ }\\ {i_2} = \textrm{ }{w_3}{x_1}\textrm{ } + \textrm{ }{w_4}{x_2}\\ {i_3} = \textrm{ }{w_2}{x_1}\textrm{ } + \textrm{ }{w_3}{x_2}\textrm{ } + \textrm{ }{w_4}{x_3}\\ {i_4} = \textrm{ }{w_1}{x_1}\textrm{ } + \textrm{ }{w_2}{x_2}\textrm{ } + \textrm{ }{w_3}{x_3}\textrm{ } + \textrm{ }{w_4}{x_4}\\ \vdots \end{array}$$

It can be seen from Eq. (2) as well as the inset g in Fig. 1(B) that not all convolution values are valid for feature extraction. For once slice such as ${X_1} = \left[ {\begin{array}{{cccc}} {\begin{array}{{cccc}} {{x_1}}&{{x_2}} \end{array}}&{{x_3}}&{\ldots }&{{x_8}} \end{array}} \right]$, the peak values of the output waveform is ${I_1}\textrm{ = }\left[ {\begin{array}{{cccc}} {{i_1}}&{{i_2}}&{\begin{array}{{cccc}} {{i_3}}&{\ldots }&{{i_8}} \end{array}} \end{array}} \right]$, where the first 2 peak values $\begin{array}{{cc}} {{i_1}}&{{i_2}} \end{array}$ are invalid, and then the next each 2 adjacent values of the last element in vector ${I_1}$ are valid (gray background rectangle of inset (e, f-1, f-2, g) in Fig. 1(B)). Following this regulation, the sub-vectors of $\left[ {\begin{array}{{cccc}} {{i_4}}&{{i_6}}&{{i_8}} \end{array}} \right]$, $\left[ {\begin{array}{{ccc}} {{i_{12}}}&{{i_{14}}}&{{i_{16}}} \end{array}} \right]$ and $\left[ {\begin{array}{{ccc}} {{i_{20}}}&{{i_{22}}}&{{i_{24}}} \end{array}} \right]$ for the respective sub-matrices is formed and recombined into a 3×3 image feature matrix P as follows:

$$P\textrm{ = }\left[ {\begin{array}{{cccc}} {{i_4}}&{{i_6}}&{{i_8}}\\ {{i_{12}}}&{{i_{14}}}&{{i_{16}}}\\ {{i_{20}}}&{{i_{22}}}&{{i_{24}}} \end{array}} \right].$$

The image feature matrix P is then sent into an electronic fully connected layer with a digital computer to perform N classification tasks. The 9×1 flattened feature vector is first fed into the fully connected layer to conduct training with the back propagation algorithm in TensorFlow 1.14. Subsequently, an N×9 fully connected matrix is generated to verify the classification task in the data test set. Obviously, the classification result is an N×1 vector, that is, the matrix-vector multiplication between the N×9 fully connected matrix and the 9×1 feature vector. The maximum value in the N×1 matrix indicates the recognition results, and the recognition accuracy is the proportion of correct recognition results in the total data.

3. Experiment and results

The photonics-enabled CNN is set up as shown in Fig. 2 to verify the ten classification tasks. A sine signal from a vector signal generator (VSG, ROHDE&SCHWARZ, SMW200A) with a frequency of 16.42 GHz is injected into an MLL (PriTel, UOC-05-20 GHz) to generate an ultranarrow optical pulse, where the wavelength is mutually phase locked and the wavelength ranges from 1550.46 nm to 1552.73 nm with a wavelength span of 0.13 nm. The VSG not only generates a microwave signal to actively lock the MLL but also emits a 10 MHz reference signal to the synchronization of the neural network system. Images with 28×28 pixels from the MNIST handwritten digits database in digital grayscale values are flattened into one-dimensional data series and encoded into the AWG (Tektronix, AWG70001A). The temporal waveform of 16.42 gigabaud per second from the AWG with a sampling rate of 49.26 GSa/s passes through an electric amplifier (EA, iXblue, DR-DG-40-MO) and then drives the IM (iXblue, MX-LN-40). As mentioned above, the repeat time of the optical pulse is fully equivalent to the encoded data series. Since the photonics-enabled CNN is a clock synchronization system with a 10 MHz clock signal, each symbol is perfectly sampled by the optical pulse with a duration time of ∼60.90 ps. The sampled optical pulses are multicast along multiple coherent wavelengths and pass through a roll of DCF with a dispersion coefficient of -462 ps/nm. As mentioned above, the parameters including the repetition frequency of optical pulses and dispersion coefficient are substituted into Eq. (1), the latency is calculated to be 60.91 ps, corresponding to one symbol delay.

 figure: Fig. 2.

Fig. 2. Experimental structure diagram of the photonics-enabled CNN for real-time image classification. VSG: vector signal generator; EA: electric amplifier; SOA: semiconductor optical amplifier; EDFA: erbium-doped fiber amplifier; PD: photodetector; PC: personal computer.

Download Full Size | PDF

Each recombination optical pulse with disparate spectral components from various input optical pulses is amplified in a semiconductor optical amplifier (SOA, Thorlabs, SOA1117P) and reshaped via a waveshaper (Finisar, WaveShaper 4000s). The waveshaper, as a main component in the proposed convolutional processor, simultaneously possesses three functions: (1) the amplitude is flattened for the used wavelengths and removes the useless wavelengths, (2) amplitude weight adjustment for the used wavelength, and (3) wavelength demultiplexing for the negative and nonnegative values of a common kernel matrix. Therefore, two kernel cores can be simultaneously used to extract image features using the waveshaper available in our lab with four output channels.

The spectrum from the waveshaper is amplified by four erbium-doped fiber amplifiers (EDFAs, Keopsys, CEFA-C-HG) and then detected at photodetectors (PDs, HEWLETT PACKARD, 11982A). Here, the reason for detecting four independent PDs rather than twined BPDs is that a nonlike model of BPDs is available in our lab. Four detected electronic signals were recorded in an OSC (Tektronix, DPO73304D) with a sampling rate of 100 GSa/s and sent to a digital computer through a LAN electronic interface according to the virtual instrument software architecture (VISA) protocol. Then, the convolution result is that the negative component is subtracted from the nonnegative component for one kernel in a digital computer.

In the experiment, two kernels are used to extract the feature of the original image, where kernel 1$\textrm{ = }\left[ {\begin{array}{{cccc}} { - 1}&0&1\\ { - 1}&0&1\\ { - 1}&0&1 \end{array}} \right]$ can extract the right edge and kernel 2$\textrm{ = }\left[ {\begin{array}{{ccc}} { - 1}&{ - 1}&{ - 1}\\ 0&0&0\\ 1&1&1 \end{array}} \right]$ is used to extract the bottom edge. The kernel weighted spectra at the outputs of the waveshaper are shown in Fig. 3, where the orange and green lines represent the negative and nonnegative values of kernel 1. Meanwhile, the red and blue lines represent the negative and nonnegative values of kernel 2.

 figure: Fig. 3.

Fig. 3. The spectra for two convolution kernels at four output ports of the waveshaper. The orange and green lines represent the negative and nonnegative values of convolution kernel 1. The red and blue lines represent the negative and nonnegative values of convolution kernel 2.

Download Full Size | PDF

A total of 70,000 images with 28×28 pixels for each image from the MNIST handwritten digits database are processed in the photonics convolution processor. The matrix-matrix multiplication is verified using double 3×3 kernel cores with a stride = 1. Following the data flattening method illustrated in Fig. 1(B), since 3×3 kernel is used, adjacent 3 rows of input image matrix are selected and flattened in turn. The 28×28 image matrix is firstly sliced horizontally into (28-3 + 1) = 26 sub-matrices with 3×28 for each one. These 3×28 sub-matrices are then flattened into 1×84 vector slices and connected head-to-hail to form a 1×2184 vector. In our demonstration, 600 images are encoded sequentially in one process. To precisely distinguish contiguous images, the 1×2184 input vector is padded with ten zeros into a 1×2184 vector. The data input rate is set to 16.42 gigabaud per second, and the convolution operation duration for one image is (2184 + 10)/16.42 G ≈ 133.62 ns. Therefore, the computing speed of the convolution operation is 2×9×16.42×2 = 591.12 billion operations per second (GOPS).

The main factor that limits the computational speed ∼591.12 GOPS is the available experiment devices in our lab including the AWG with the sampling rate of 50 GSa/s and four channels waveshaper. There is enormous potential for improving the computing capability via enhancing the wavelength and spatial numbers along with the encoded data rate, which can be implemented by available off-the-shelf components and devices. Firstly, the optical pulse from mode-locked laser without wavelength reshape would yield 125 wavelengths versus the 18 wavelengths used here, that can increase either the kernel size for one kernel or the more kernel cores. Additionally, the encoded speed can be reached to 65 gigabaud per second using the commercially available components and devices, including the high-speed AWG (Keysight, M8199A, 65 GHz), IM (EOspace, AZ-DV5-65, 65 GHz) and PD (Finisar, XPDV3120R, 70 GHz). Finally, the waveshaper or even similar device such as wavelength selective switch (Lumentum, Twin 1×35 WSS) can offer 35 channels or even more to spatially separate the wavelengths to simultaneously realize at least 17 kernel cores. Taking the power spatial-division multiplexing into account, the photonics convolution processor can be split into several parallel paths with several waveshapers to further improve the number of kernel cores.

A total of 600 images are sequentially encoded in the AWG, and feature extraction is performed on the photonics convolution processor. The output waveforms of the photonics convolution processor are recorded by the OSC and feedback to a digital computer via a LAN electronic interface according to the VISA protocol in real time. The output waveforms stored in digital computer are shown in Fig. 4(A)∼(D), where Fig. 4(A)∼(B) are the temporal waveforms of negative values and non-negative values for convolution kernel 1, similar to Fig. 4(C)∼(D) for kernel 2. Figure 4(A)∼(D) are comprised of valid peak values, invalid peak values and the trigger values to locate the initial time. Taking Fig. 4(D) as an example to illustrating the extraction procedure of the valid peak values. The enlarged Fig. 4(E) shows a section of Fig. 4(D) that contains the convolution results at the time 0 ns∼133.62 ns for one image and -1.83 ns∼0 ns for trigger signal. The trigger signal contains 10 electrical pulses as the data sample beginning. The digital processor begins to sample the data with an interval of 1/16.42 G ≈ 60.90 ps, once the trigger signal is sought out. Since 28×28 images are convolved with 3×3 kernel, each image is divided into 26 slices, and for each slice the first 6 peak values are invalid and then the last of the next each 3 adjacent values are valid. So, the valid values (orange dots in Fig. 4(E)) are retained and the invalid values (gray dots in Fig. 4(E)) are eliminated. Zooming in part of Fig. 4(E), Fig. 4(F) shows the extracted peak values.

 figure: Fig. 4.

Fig. 4. The normalized temporal waveforms from four PDs when 600 images from the training dataset of the MNIST database of handwritten digits perform the convolution operations. (A) and (B) are the temporal waveforms of negative and nonnegative values for kernel 1. (C) and (D) are the temporal waveforms of negative and nonnegative values for kernel 2. (E) shows the nonnegative values of kernel 2 for one image at 0 ns∼133.62 ns and the trigger signal at -1.83 ns∼0 ns. (F) Magnified part of (E) at 102 ns∼107 ns, where gray dots are sampled invalid values and orange dots are sampled valid values.

Download Full Size | PDF

The convolution results of kernel 1 are the valid peak values in Fig. 4(A) subtracted from the valid peak values in Fig. 4(B), and the convolution results of kernel 2 are the valid peak values in Fig. 4(C) subtracted from the valid peak values in Fig. 4(D). The convolution results are rearranged into a matrix that is the feature images. The ten stochastic convolution results are shown in Fig. 5. These images are evaluated by both a 64-bit digital computer and a photonics convolution processor for comparison. The feature maps using kernel 1 extract the right edge, and the feature maps using kernel 2 enhance the bottom edge.

 figure: Fig. 5.

Fig. 5. The feature images of the right and bottom edges are theoretically calculated using the 64-bit computer and photonics convolution processor.

Download Full Size | PDF

Each 28×28 pixel image convolves with two 3×3 convolution kernels in a stride = 1 to obtain two 26×26 pixel feature maps. After two feature maps are further flattened into a 1352×1 vector that serves as input nodes for the fully connected layer, the prediction results are generated using ten spatial output neurons. The nonlinear activation function we used in the fully connected layer is the ReLU function. We first obtained feature images of the 70,000 images in the MNIST database of handwritten digits using the photonics convolution processor. Then, we trained the fully connected weight matrix with feature images of 60,000 images from the training set of the MNIST database using the backpropagation algorithm to minimize the cross-entropy loss. Finally, we tested 10,000 test images of the MNIST database of handwritten digits for prediction results. Figure 6(A) shows the recognition accuracy of 90.04% in 120 epochs, in contrast to 96.56% for the theoretical accuracy in a 64-bit computer (with the same network structure, i.e., a CNN with a convolution layer and a fully connected layer where two kernels are used in the convolution layer to extract the right edge and the bottom edge. For this purpose, the kernel values are only set to 1, -1 and 0 in the kernel matrix.). The confusion matrix for 10,000 images is shown in Fig. 6(B).

 figure: Fig. 6.

Fig. 6. (A) Changes in the recognition accuracy and cross-entropy loss during the training process and (B) the final confusion matrix for recognizing 10,000 images in the test set.

Download Full Size | PDF

With the OSC in our lab, real-time recognition of 600 images is realized, and more images can be recognized in real time by improving the record length of OSC. The lags generated in the photonics convolution processor is relatively small which is caused by the data transmission in the fiber and cable (which is measured 31.54 µs). Most of time is taken to recording the data, transferring data from the OSC to the computer via LAN electronic interface and processing the data to perform the target recognition. In our experiment for 600 images, the fully process totally takes 185.375 ms, 6.11 s and 2.80 s (866 Mbps network bandwidth, Intel i5-9400 CPU), respectively. So, one image recognition from input to result is probably need 15.16 ms and the main limitation is the low data transfer and process speed in electrical domain.

The reduction in accuracy (90.04% for the experiment versus 96.56% for theory) results from the large noise of the photonics convolution processor, such as MLL, EA, SOA and EDFA. The fact that the noise directly reduces the signal-to-noise ratio corresponds to impacting the effective number of bits. Furthermore, errors introduced by the unbalanced amplifier of four EDFAs may also lead to a reduction in prediction accuracy. The prediction accuracy, in principle, can be further increased to be close to the theoretical accuracy. For example, by choosing an MLL with high stability and high output power, the fluctuation of optical power and the noise of amplifiers will be insignificant, thereby improving the signal-to-noise ratio of the signal and the accuracy of recognition. Selecting EDFAs or other optical amplifiers with similar performance will also be a good way to improve the accuracy.

4. Conclusion

In this paper, we have proposed and experimentally demonstrated a photonics-enabled spiking timing-dependent CNN based on multidimension multiplexing that is composed of a photonics convolution processor and electrical fully connected layer for real-time image classification. Large-scale matrix-matrix operations for simultaneously two kernels are achieved by employing wavelength division multiplexing for weighted addition, time division multiplexing for weight latency, and space division multiplexing for weighted separation between the negative and nonnegative kernels. The results demonstrated the high-performance implementation of the photonics convolution processor, and the computing speed reached 591.12 GOPS. When combined with electronic full connections, the image classification tasks are performed on the MNIST database of handwritten digits, with an accuracy of 90.04%. Furthermore, the proposed photonics-enabled CNN is highly scalable. The large-scale for photonics-enabled CNN using more ports waveshaper or other optical analogues to add the kernels and employing power division multiplexing to increase the parallelization can be realized. And the number of convolution layers can be expanded by reentering the output into the photonic convolution processor. This enables applications such as goods classification, image recognition, natural language processing, image classification-assisted autodriving and other artificial intelligence applications.

Funding

National Key Research and Development Program of China (2018YFB2201802); National Natural Science Foundation of China (61925505, 62075212).

Acknowledgments

The authors thank José Azaña for helpful discussions and excellent suggestions during the revision of this paper.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]  

2. C. Szegedy, L. Wei, J. Yangqing, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 1–9.

3. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems (NIPS) (2012), pp. 1097–1105.

4. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of a convolutional network and a graphical model for human pose estimation,” in Advances in Neural Information Processing Systems (NIPS) (2014), pp. 1799–1807.

5. C. Farabet, C. Couprie, L. Najman, and Y. Lecun, “Learning hierarchical features for scene labeling,” IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013). [CrossRef]  

6. S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recognition: a convolutional neural-network approach,” IEEE Trans,” IEEE Trans. Neural Netw. 8(1), 98–113 (1997). [CrossRef]  

7. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arxiv:1409.1556 (2015).

8. T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for LVCSR,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), pp. 8614–8618.

9. D. A. B. Miller, “Attojoule optoelectronics for low-energy information processing and communications,” J. Lightwave Technol. 35(3), 346–396 (2017). [CrossRef]  

10. R. J. Schwabe, S. Zelinger, T. S. Key, and K. O. Phipps, “Electronic lighting interference,” IEEE Ind. Appl. Mag. 4(4), 43–48 (1998). [CrossRef]  

11. B. Sengupta and M. B. Stemmler, “Power consumption during neuronal computation,” Proc. IEEE 102(5), 738–750 (2014). [CrossRef]  

12. D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “Intelligent RAM (IRAM): chips that remember and compute,” in 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers (ISSCC) (1997), pp. 224–225.

13. H. S. Stone, “A Logic-in-Memory Computer,” IEEE Trans. Comput. C-19(1), 73–78 (1970). [CrossRef]  

14. J. Huang, C. Li, R. Lu, L. Li, and Z. Cao, “Beyond the 100 Gbaud directly modulated laser for short reach applications,” J. Semicond. 42(4), 041306 (2021). [CrossRef]  

15. Y. Fei, T. Yang, Z. Li, W. Liu, X. Wang, W. Zheng, and F. Yang, “Design of the low-loss waveguide coil for interferometric integrated optic gyroscopes,” J. Semicond. 38(4), 044009 (2017). [CrossRef]  

16. M. Wang, S. Zhang, Z. Liu, X. Zhang, Y. He, Y. Ma, Y. Zhang, Z. Zhang, and Y. Liu, “High-frequency characterization of high-speed modulators and photodetectors in a link with low-speed photonic sampling,” J. Semicond. 42(4), 042303 (2021). [CrossRef]  

17. D. A. B. Miller, M. H. Mozolowski, A. Miller, and S. D. Smith, “Non-linear optical effects in InSb with a c.w. CO laser,” Opt. Commun. 27(1), 133–136 (1978). [CrossRef]  

18. L. Larger, M. C. Soriano, D. Brunner, L. Appeltant, J. M. Gutierrez, L. Pesquera, C. R. Mirasso, and I. Fischer, “Photonic information processing beyond Turing: an optoelectronic implementation of reservoir computing,” Opt. Express 20(3), 3241–3249 (2012). [CrossRef]  

19. D. R. Solli and B. Jalali, “Analog optical computing,” Nat. Photonics 9(11), 704–706 (2015). [CrossRef]  

20. X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, and A. Ozcan, “All-optical machine learning using diffractive deep neural networks,” Science 361(6406), 1004–1008 (2018). [CrossRef]  

21. Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljačić, “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11(7), 441–446 (2017). [CrossRef]  

22. J. Feldmann, N. Youngblood, C. D. Wright, H. Bhaskaran, and W. H. P. Pernice, “All-optical spiking neurosynaptic networks with self-learning capabilities,” Nature 569(7755), 208–214 (2019). [CrossRef]  

23. X. Xu, M. Tan, B. Corcoran, J. Wu, A. Boes, T. G. Nguyen, S. T. Chu, B. E. Little, D. G. Hicks, R. Morandotti, A. Mitchell, and D. J. Moss, “11 TOPS photonic convolutional accelerator for optical neural networks,” Nature 589(7840), 44–51 (2021). [CrossRef]  

24. D. Pierangeli, V. Palmieri, G. Marcucci, C. Moriconi, G. Perini, M. De Spirito, M. Papi, and C. Conti, “Deep optical neural network by living tumour brain cells,” arxiv:1812.09311 (2018).

25. Y. Luo, D. Mengu, N. T. Yardimci, Y. Rivenson, M. Veli, M. Jarrahi, and A. Ozcan, “Design of task-specific optical systems using broadband diffractive neural networks,” Light Sci Appl 8(1), 1–14 (2019). [CrossRef]  

26. Z. Lin, S. Sun, J. Azana, W. Li, and M. Li, “High-speed serial deep learning through temporal optical neurons,” Opt. Express 29(13), 19392–19402 (2021). [CrossRef]  

27. J. Chang, V. Sitzmann, X. Dun, W. Heidrich, and G. Wetzstein, “Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification,” Sci. Rep. 8(1), 12324–10 (2018). [CrossRef]  

28. Y. R. Qu, H. Z. Zhu, Y. C. Shen, J. Zhang, C. N. Tao, P. T. Ghosh, and M. Qiu, “Inverse design of an integrated-nanophotonics optical neural network,” Sci. Bull. 65(14), 1177–1183 (2020). [CrossRef]  

29. E. Khoram, A. Chen, D. Liu, L. Ying, Q. Wang, M. Yuan, and Z. Yu, “Nanophotonic media for artificial neural inference,” Photonics Res. 7(8), 823–827 (2019). [CrossRef]  

30. J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice, and H. Bhaskaran, “Parallel convolutional processing using an integrated photonic tensor core,” Nature 589(7840), 52–58 (2021). [CrossRef]  

31. C. Qian, X. Lin, X. Lin, J. Xu, Y. Sun, E. Li, B. Zhang, and H. Chen, “Performing optical logic operations by a diffractive neural network,” Light Sci Appl 9(1), 59 (2020). [CrossRef]  

32. M. Li, Z. Lin, and X. Meng, “Temporal optical neurons for serial deep learning,” in 2021 IEEE Photonics Society Summer Topicals Meeting Series (SUM) (2021), pp. 1–2.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (6)

Fig. 1.
Fig. 1. The proposed photonics-enabled CNN. (A) The full neural network structure including an optical convolution layer and an electrical fully connected layer. (B) Schematic diagram of the photonics convolution processor. MLL: mode-locked laser; IM: intensity modulator; DCF: dispersion compensation fiber; BPD: balanced photodetector; OSC: oscilloscope; AWG: arbitrary waveform generator. a: temporal waveform encoded in AWG corresponding to the two-dimensional origin image flattened to one-dimensional serial vector X; b: spectrum (${\lambda _i}({i = 1, \ldots ,n} )$) of optical pulses from MLL; c: the reshaped spectrum ($({{\lambda_i},{W_i}} )({i = 1, \ldots ,4} )$) after amplitude weight adjustment with kernel matrix; d: temporal waveforms ($({{\lambda_i},{W_i}X} )({i = 1, \ldots ,4} )$) of each spectral component modulated by one-dimensional serial vector X; e: temporal distribution ($({{\lambda_i},{W_i}X({t - ({i - 1} )T} )} )({i = 1, \ldots ,4} )$) of spectral components via dispersion delay T in DCF; f-1, f-2: spectral spatial demultiplexing ($({{\lambda_i},{W_i}X} )({i = 2,4} )$ and $({{\lambda_i},{W_i}X} )({i = 1,3} )$) that makes the negative values and nonnegative values of one kernel divided into two physical channels; g: temporal waveform output from BPD.
Fig. 2.
Fig. 2. Experimental structure diagram of the photonics-enabled CNN for real-time image classification. VSG: vector signal generator; EA: electric amplifier; SOA: semiconductor optical amplifier; EDFA: erbium-doped fiber amplifier; PD: photodetector; PC: personal computer.
Fig. 3.
Fig. 3. The spectra for two convolution kernels at four output ports of the waveshaper. The orange and green lines represent the negative and nonnegative values of convolution kernel 1. The red and blue lines represent the negative and nonnegative values of convolution kernel 2.
Fig. 4.
Fig. 4. The normalized temporal waveforms from four PDs when 600 images from the training dataset of the MNIST database of handwritten digits perform the convolution operations. (A) and (B) are the temporal waveforms of negative and nonnegative values for kernel 1. (C) and (D) are the temporal waveforms of negative and nonnegative values for kernel 2. (E) shows the nonnegative values of kernel 2 for one image at 0 ns∼133.62 ns and the trigger signal at -1.83 ns∼0 ns. (F) Magnified part of (E) at 102 ns∼107 ns, where gray dots are sampled invalid values and orange dots are sampled valid values.
Fig. 5.
Fig. 5. The feature images of the right and bottom edges are theoretically calculated using the 64-bit computer and photonics convolution processor.
Fig. 6.
Fig. 6. (A) Changes in the recognition accuracy and cross-entropy loss during the training process and (B) the final confusion matrix for recognizing 10,000 images in the test set.

Equations (3)

Equations on this page are rendered with MathJax. Learn more.

1 f r e p 2 = 2 π | β ¨ | L T .
i 1 =   w 4 x 1   i 2 =   w 3 x 1   +   w 4 x 2 i 3 =   w 2 x 1   +   w 3 x 2   +   w 4 x 3 i 4 =   w 1 x 1   +   w 2 x 2   +   w 3 x 3   +   w 4 x 4
P  =  [ i 4 i 6 i 8 i 12 i 14 i 16 i 20 i 22 i 24 ] .
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.