Training of photonic neural networks through in situ backpropagation and gradient measurement

Tyler W. Hughes; Momchil Minkov; Yu Shi; Shanhui Fan

doi:10.1364/OPTICA.5.000864

1. INTRODUCTION

Artificial neural networks (ANNs), and machine learning in general, are becoming ubiquitous for an impressively large number of applications [1]. This has brought ANNs into the focus of research in not only computer science, but also electrical engineering, with hardware specifically suited to perform neural network operations actively being developed. There are significant efforts in constructing ANN architectures using various electronic solid-state platforms [2,3], but ever since the conception of ANNs, a hardware implementation using optical signals has also been considered [4,5]. In this domain, some of the recent work has been devoted to photonic spike processing [6,7] and photonic reservoir computing [8,9], as well as to devising universal, chip-integrated photonic platforms that can implement any arbitrary ANN [10,11]. Photonic implementations benefit from the fact that, due to the non-interacting nature of photons, linear operations—such as the repeated matrix multiplications found in every neural network algorithm—can be performed in parallel, and at a lower energy cost, when using light as opposed to electrons.

A key requirement for the utility of any ANN platform is the ability to train the network using algorithms such as error backpropagation [12]. Such training typically demands significant computational time and resources, and it is generally desirable for error backpropagation to be implemented on the same platform. This is indeed possible for the technologies of Refs. [2,13,14] and has also been demonstrated, e.g., in memristive devices [3,15]. In optics, as early as three decades ago, an adaptive platform that could approximately implement the backpropagation algorithm experimentally was proposed [16,17]. However, this algorithm requires a number of complex optical operations that are difficult to implement, particularly in integrated optics platforms. Thus, the current implementation of a photonic neural network using integrated optics has been trained using a model of the system simulated on a regular computer [11]. This is inefficient for two reasons. First, this strategy depends entirely on the accuracy of the model representation of the physical system. Second, unless one is interested in deploying a large number of identical, fixed copies of the ANN, any advantage in speed or energy associated with using the photonic circuit is lost if the training must be done on a regular computer. Alternatively, training using a brute force, in situ computation of the gradient of the objective function has been proposed [11]. However, this strategy involves sequentially perturbing each individual parameter of the circuit, which is highly inefficient for large systems.

In this work, we propose a procedure to compute the gradient of the cost function of a photonic ANN by use of only in situ intensity measurements. Our procedure works by physically implementing the adjoint variable method (AVM), a technique that has typically been implemented computationally in the optimization and inverse design of photonic structures [18–20]. Furthermore, the method scales in constant time with respect to the number of parameters, which allows for backpropagation to be efficiently implemented in a hybrid opto-electronic network. Although we focus our discussion on a particular hardware implementation of a photonic ANN, our conclusions are derived starting from Maxwell’s equations, and may therefore be extended to other photonic platforms.

The paper is organized as follows: in Section 2, we introduce the working principles of a photonic ANN based on the hardware platform introduced in Ref. [11]. We also derive the mathematics of the forward and backward propagation steps and show that the gradient computation needed for training can be expressed as a modal overlap. Then, in Section 3, we discuss how the adjoint method may be used to describe the gradient of the ANN cost function in terms of physical parameters. In Section 4, we describe our procedure for determining this gradient information experimentally using in situ intensity measurements. We give a numerical validation of these findings in Section 5 and demonstrate our method by training a model of a photonic ANN in Section 6. We provide final comments and conclude in Section 7.

2. PHOTONIC NEURAL NETWORK

In this section, we introduce the operation and gradient computation of a feed-forward photonic ANN. In its most general case, a feed-forward ANN maps an input vector to an output vector via an alternating sequence of linear operations and element-wise nonlinear functions of the vectors, also called “activations.” A cost function, $L$ , is defined over the outputs of the ANN, and the matrix elements involved in the linear operations are tuned to minimize $L$ over a number of training examples via gradient-based optimization. The “backpropagation algorithm” is typically used to compute these gradients analytically by sequentially utilizing the chain rule from the output layer backwards to the input layer.

Here, we will outline these steps mathematically for a single training example, with the procedure diagrammed in Fig. 1(a). We focus our discussion on the photonic hardware platform presented in Ref. [11], which performs the linear operations using optical interference units (OIUs). The OIU is a mesh of controllable Mach–Zehnder interferometers (MZIs) integrated in a silicon photonic circuit. By tuning the phase shifters integrated in the MZIs, any unitary $N \times N$ operation on the input can be implemented [21,22], which finds applications in both classical and quantum photonics [23,24]. In the photonic ANN implementation from Ref. [11], an OIU is used for each linear matrix-vector multiplication, whereas the nonlinear activations are performed using an electronic circuit, which involves measuring the optical state before activation, performing the nonlinear activation function on an electronic circuit such as a digital computer, and preparing the resulting optical state to be injected to the next stage of the ANN.

Fig. 1. (a) Schematic of the ANN architecture demonstrated in Ref. [11]. The boxed regions correspond to OIUs that perform a linear operation represented by the matrix ${\hat{W}}_{l}$ . Integrated phase shifters (blue) are used to control the OIU and train the network. The red regions correspond to nonlinear activations $f_{l} (\cdot)$ . (b) Illustration of operation and gradient computation in an ANN. The top and bottom rows correspond to the forward and backward propagation steps, respectively. Propagation through a square cell corresponds to matrix multiplication. Propagation through a rounded region corresponds to activation. $⊙$ is element-wise vector multiplication.

Download Full Size | PDF

We first introduce the notation used to describe the OIU, which consists of a number, $N$ , of single-mode waveguide input ports coupled to the same number of single-mode output ports through a linear and lossless device. In principle, the device may also be extended to operate on a different number of inputs and outputs. We further assume directional propagation, such that all power flows exclusively from the input ports to the output ports, which is a typical assumption for the devices of Refs. [11,21,22–26]. In its most general form, the device implements the linear operation

\hat{W} X_{in} = Z_{out},

where

X_{in}

and

Z_{out}

are the modal amplitudes at the input and output ports, respectively, and

\hat{W}

, which we will refer to as the transfer matrix, is the off-diagonal block of the system’s full scattering matrix,

(\begin{matrix} X_{out} \\ Z_{out} \end{matrix}) = (\begin{matrix} 0 & {\hat{W}}^{T} \\ \hat{W} & 0 \end{matrix}) (\begin{matrix} X_{in} \\ Z_{in} \end{matrix}) .

Here, the diagonal blocks are zero because we assume forward-only propagation, while the off-diagonal blocks are the transpose of each other because we assume a reciprocal system.

Z_{in}

and

X_{out}

correspond to the input and output modal amplitudes, respectively, if we were to run this device in reverse, i.e., sending a signal in from the output ports.

Now we may use this notation to describe the forward and backward propagation steps in a photonic ANN. In the forward-propagation step, we start with an initial input to the system, $X_{0}$ and perform a linear operation on this input using an OIU represented by the matrix ${\hat{W}}_{1}$ . This is followed by the application of an element-wise nonlinear activation, $f_{1} (\cdot)$ , on the outputs, giving the input to the next layer. This process repeats for each layer $l$ until the output layer, $L$ . Written compactly, for $l = 1 \dots L$ ,

X_{l} = f_{l} ({\hat{W}}_{l} X_{l - 1}) \equiv f_{l} (Z_{l}) .

Finally, our cost function

L

is an explicit function of the outputs from the last layer,

L = L (X_{L})

. This process is shown in Fig. 1(a).

To train the network, we must minimize this cost function with respect to the linear operators, ${\hat{W}}_{l}$ , which may be adjusted by tuning the integrated phase shifters within the OIUs. While a number of recent papers have clarified how an individual OIU can be tuned by sequential, in situ methods to perform an arbitrary, pre-defined operation [25–28], these strategies do not straightforwardly apply to the training of ANNs, where nonlinearities and several layers of computation are present. In particular, the training of ANN requires gradient information, which is not provided directly in the methods of Refs. [25–28].

In Ref. [11], the training of the ANN was done ex situ on a computer model of the system, which was used to find the optimal weight matrices ${\hat{W}}_{l}$ for a given cost function. Then, the final weights were recreated in the physical device, using an idealized model that relates the matrix elements to the phase shifters. Reference [11] also discusses a possible in situ method for computing the gradient of the ANN cost function through a serial perturbation of every individual phase shifter (“brute force” gradient computation). However, this gradient computation has an unavoidable linear scaling with the number of parameters of the system. The training method that we propose here operates without resorting to an external model of the system, while allowing for the tuning of each parameter to be done in parallel, therefore scaling significantly better with respect to the number of parameters when compared to the brute force gradient computation.

To introduce our training method, we first use the backpropagation algorithm to derive an expression for the gradient of the cost function with respect to the permittivities of the phase shifters in the OIUs. In the following, we denote $ε_{l}$ as the permittivity of a single, arbitrarily chosen phase shifter in layer $l$ , as the same derivation holds for each of the phase shifters present in that layer. Note that ${\hat{W}}_{l}$ has an explicit dependence on $ε_{l}$ , but all field components in the subsequent layers also depend implicitly on $ε_{l}$ .

As a demonstration, we take a mean squared cost function

L = \frac{1}{2} {(X_{L} - T)}^{†} (X_{L} - T),

where

T

is a complex-valued target vector corresponding to the desired output of our system given input

X_{0}

.

Starting from the last layer in the circuit, the derivative of the cost function with respect to the permittivity $ε_{L}$ of one of the phase shifters in the last layer is given by

\frac{d L}{d ε_{L}} = R {{(X_{L} - T)}^{†} \frac{d X_{L}}{d ε_{L}}},

= R {{(Γ_{L} ⊙ f_{L}^{'} (Z_{L}))}^{T} \frac{d {\hat{W}}_{L}}{d ε_{L}} X_{L - 1}},

\equiv R {δ_{L}^{T} \frac{d {\hat{W}}_{L}}{d ε_{L}} X_{L - 1}},

where

⊙

is element-wise vector multiplication, defined such that, for vectors

a

and

b

, the

i

-th element of the vector

a ⊙ b

is given by

a_{i} b_{i}

.

R {\cdot}

gives the real part,

f_{l}^{'} (\cdot)

is the derivative of the

l

th layer activation function with respect to its (complex) argument. We define the vector

δ_{L} \equiv Γ_{L} ⊙ f_{L}^{'}

in terms of the error vector

Γ_{L} \equiv {(X_{L} - T)}^{*}

.

For any layer $l < L$ , we may use the chain rule to perform a recursive calculation of the gradients

Γ_{l} = {\hat{W}}_{l + 1}^{T} δ_{l + 1},

δ_{l} = Γ_{l} ⊙ f_{l}^{'} (Z_{l}),

\frac{d L}{d ε_{l}} = R {δ_{l}^{T} \frac{d {\hat{W}}_{l}}{d ε_{l}} X_{l - 1}} .

Figure 1(b) diagrams this process, which computes the

δ_{l}

vectors sequentially from the output layer to the input layer. A treatment for non-holomorphic activations is derived in Supplement 1.

We note that the computation of $δ_{l}$ requires performing the operation $Γ_{l} = {\hat{W}}_{l + 1}^{T} δ_{l + 1}$ , which corresponds physically to sending $δ_{l + 1}$ into the output end of the OIU in layer $l + 1$ . In this way, our procedure “backpropagates” the vectors $δ_{l}$ and $Γ_{l}$ physically through the entire circuit.

3. GRADIENT COMPUTATION USING THE ADJOINT VARIABLE METHOD

In the previous section, we showed that the crucial step in training the ANN is computing gradient terms of the form $R {δ_{l}^{T} \frac{d {\hat{W}}_{l}}{d ε_{l}} X_{l - 1}}$ , which contains derivatives with respect to the permittivity of the phase shifters in the OIUs. In this section, we show how this gradient may be expressed as the solution to an electromagnetic adjoint problem.

The OIU used to implement the matrix ${\hat{W}}_{l}$ , relating the complex mode amplitudes of input and output ports, can be described using first-principles electrodynamics. This will allow us to compute its gradient with respect to each $ε_{l}$ , as these are the physically adjustable parameters in the system. Assuming a source at frequency $ω$ , at steady state, Maxwell’s equations take the form

[\hat{\nabla} \times \hat{\nabla} \times - k_{0}^{2} {\hat{ε}}_{r}] e = - i ω μ_{0} j,

which can be written more succinctly as

\hat{A} (ε_{r}) e = b .

Here,

{\hat{ε}}_{r}

describes the spatial distribution of the relative permittivity (

ε_{r}

),

k_{0} = ω^{2} / c^{2}

is the free-space wavenumber,

e

is the electric field distribution,

j

is the electric current density, and

\hat{A} = {\hat{A}}^{T}

, due to Lorentz reciprocity. Equation (12) is the starting point of the finite-difference frequency-domain (FDFD) simulation technique [29], where it is discretized on a spatial grid, and the electric field

e

is solved given a particular permittivity distribution,

ε_{r}

, and source,

b

.

To relate this formulation to the transfer matrix $\hat{W}$ , we now define source terms $b_{i}$ , $i \in 1 \dots 2 N$ , which correspond to a source placed in one of the input or output ports. Here, we assume a total of $N$ input and $N$ output waveguides. The spatial distribution of the source term, $b_{i}$ , matches the mode of the $i$ -th single-mode waveguide. Thus, the electric field amplitude in port $i$ is given by $b_{i}^{T} e$ , and we may establish a relationship between $e$ and $X_{in}$ , as

X_{in, i} = b_{i}^{T} e,

for

i = 1 \dots N

over the input port indices, where

X_{in, i}

is the

i

-th component of

X_{in}

, or, more compactly,

X_{in} \equiv {\hat{P}}_{in} e .

Similarly, we can define

Z_{out, i} = b_{i + N}^{T} e,

for

i + N = (N + 1) \dots 2 N

over the output port indices, or,

Z_{out} \equiv {\hat{P}}_{out} e,

and, with this notation, Eq. (1) becomes

\hat{W} {\hat{P}}_{in} e = {\hat{P}}_{out} e .

We now use the above definitions to evaluate the cost function gradient in Eq. (10). In particular, with Eqs. (10) and (17), we arrive at

\frac{d L}{d ε_{l}} = - R {δ_{l}^{T} {\hat{P}}_{out} {\hat{A}}^{- 1} \frac{d \hat{A}}{d ε_{l}} {\hat{A}}^{- 1} b_{x, l - 1}} .

Here,

b_{x, l - 1}

is the modal source profile that creates the input field amplitudes

X_{l - 1}

at the input ports.

The key insight of the AVM is that we may interpret this expression as an operation involving the field solutions of two electromagnetic simulations, which we refer to as the “original” (og) and the “adjoint” (aj):

\hat{A} e_{og} = b_{x, l - 1},

\hat{A} e_{aj} = {\hat{P}}_{out}^{T} δ,

where we have made use of the symmetric property of

\hat{A}

.

Equation (18) can now be expressed in a compact form as

\frac{d L}{d ε_{l}} = - R {e_{aj}^{T} \frac{d \hat{A}}{d ε_{l}} e_{og}} .

If we assume that this phase shifter spans a set of points, $r_{ϕ}$ in our system, then, from Eq. (11), we obtain

\frac{d \hat{A}}{d ε_{l}} = - k_{0}^{2} \sum_{r' \in r_{ϕ}} {\hat{δ}}_{r, r^{'}},

where

{\hat{δ}}_{r, r^{'}}

is the Kronecker delta.

Inserting this into Eq. (21), we thus find that the gradient is given by the overlap of the two fields over the phase-shifter positions:

\frac{d L}{d ε_{l}} = k_{0}^{2} R {\sum_{r \in r_{ϕ}} e_{aj} (r) e_{og} (r)} .

This result now allows for the computation in parallel of the gradient of the loss function with respect to all phase shifters in the system, given knowledge of the original and adjoint fields.

4. EXPERIMENTAL MEASUREMENT OF GRADIENT

We now propose a method to compute the gradient from the previous section through in situ intensity measurements. This represents the most significant result of this paper. Specifically, we wish to generate an intensity pattern with the form $R {e_{og} e_{aj}}$ , matching that of Eq. (23). We note that interfering $e_{og}$ and $e_{aj}^{*}$ directly in the system results in the intensity pattern:

I = {| e_{og} |}^{2} + {| e_{aj} |}^{2} + 2 R {e_{og} e_{aj}},

the last term of which matches Eq. (23). Thus, the gradient can be computed purely through intensity measurements if the field

e_{aj}^{*}

can be generated in the OIU.

The adjoint field for our problem, $e_{aj}$ , as defined in Eq. (20), is sourced by ${\hat{P}}_{out}^{T} δ_{l}$ , meaning that it physically corresponds to a mode sent into the system from the output ports. As complex conjugation in the frequency domain corresponds to time reversal of the fields, we expect $e_{aj}^{*}$ to be sent in from the input ports. Formally, to generate $e_{aj}^{*}$ , we wish to find a set of input source amplitudes, $X_{TR}$ , such that the output port source amplitudes, $Z_{TR} = \hat{W} X_{TR}$ , are equal to the complex conjugate of the adjoint amplitudes, or $δ_{l}^{*}$ . Using the unitarity property of transfer matrix ${\hat{W}}_{l}$ for a lossless system, along with the fact that ${\hat{P}}_{out} {\hat{P}}_{out}^{T} = \hat{I}$ for output modes, the input mode amplitudes for the time-reversed adjoint can be computed as

X_{TR}^{*} = {\hat{W}}_{l}^{T} δ_{l} .

As discussed earlier,

{\hat{W}}_{l}^{T}

is the transfer matrix from output ports to input ports. Thus, we can experimentally determine

X_{TR}

by sending

δ_{l}

into the device output ports, measuring the output at the input ports, and taking the complex conjugate of the result.

We now summarize the procedure for experimentally measuring the gradient of an OIU layer in the ANN with respect to the permittivities of this layer’s integrated phase shifters:

1. Send in the original field amplitudes $X_{l - 1}$ and measure and store the intensities at each phase shifter.
2. Send $δ_{l}$ into the output ports and measure and store the intensities at each phase shifter.
3. Compute the time-reversed adjoint input field amplitudes as in Eq. (25).
4. Interfere the original and the time-reversed adjoint fields in the device, measuring again the resulting intensities at each phase shifter.
5. Subtract the constant intensity terms from steps 1 and 2 and multiply by $k_{0}^{2}$ to recover the gradient as in Eq. (23).

This procedure is also illustrated in Fig. 2.

Fig. 2. Schematic illustration of our proposed method for experimental measurement of gradient information. The box region represents the OIU. The colored ovals represent tunable phase shifters, and we illustrate computing the gradient with respect to the red and the yellow phase shifters, labeled 1 and 2, respectively. (a) We send the original set of amplitudes $X_{l - 1}$ and measure the constant intensity terms at each phase shifter. (b) We send the adjoint mode amplitudes, given by $δ_{l}$ , through the output side of our device, recording $X_{TR}^{*}$ from the opposite side, as well as ${| e_{aj} |}^{2}$ in each phase shifter. (c) We send in $X_{l - 1} + X_{TR}$ , interfering $e_{og}$ and $e_{aj}^{*}$ inside the device and recovering the gradient information for all phase shifters simultaneously.

Download Full Size | PDF

5. NUMERICAL GRADIENT DEMONSTRATION

We numerically demonstrate this procedure in Fig. 3 with a series of FDFD simulations of an OIU implementing a $3 \times 3$ unitary matrix [21]. These simulations are intended to represent the gradient computation corresponding to one OIU in a single layer, $l$ , of a neural network with input $X_{l - 1}$ and delta vector $δ_{l}$ . In these simulations, we use absorbing boundary conditions on the outer edges of the system to eliminate back-reflections. The relative permittivity distribution is shown in Fig. 3(a) with the positions of the variable phase shifters in blue. For demonstration, we simulate a specific case where $X_{l - 1} = {[0 0 1]}^{T}$ , with unit amplitude in the bottom port, and we choose $δ_{l} = {[0 1 0]}^{T}$ . In Fig. 3(b), we display the real part of $e_{og}$ , corresponding to the original, forward field.

Fig. 3. Numerical demonstration of the time-reversal procedure of Section 4. (a) Relative permittivity distribution for three MZIs arranged to perform a $3 \times 3$ linear operation. Blue boxes represent where phase shifters would be placed in this system. As an example, we compute the gradient information for a layer with $X_{l - 1} = {[0 0 1]}^{T}$ and $δ_{l} = {[0 1 0]}^{T}$ , corresponding to the bottom left and middle right ports, respectively. (b) Real part of the simulated electric field $E_{z}$ corresponding to injection from the bottom left port. (c) Real part of the adjoint $E_{z}$ , corresponding to injection from the middle right port. (d) Time-reversed adjoint field as constructed by our method, fed in through all three ports on the left. (e) Gradient information $\frac{d L}{d ε_{l}} (x, y)$ as obtained directly by the adjoint method, normalized by its maximum absolute value. (f) Gradient information as obtained by the method introduced in this work, normalized by its maximum absolute value. Namely, the field pattern from (b) is interfered with the time-reversed adjoint field of (d), and the constant intensity terms are subtracted from the resulting intensity pattern. Panels (e) and (f) match with high precision.

Download Full Size | PDF

The real part of the adjoint field, $e_{aj}$ , corresponding to the cost function $L = R {δ_{l}^{T} {\hat{W}}_{l} X_{l - 1}}$ is shown in Fig. 3(c). In Fig. 3(d), we show the real part of the time-reversed copy of $e_{aj}$ as computed by the method described in the previous section, in which $X_{TR}^{*}$ is sent in through the input ports. There is excellent agreement, up to a constant, between the complex conjugate of the field pattern of Fig. 3(c) and the field pattern of Fig. 3(d), as expected.

In Fig. 3(e), we display the gradient of the objective function with respect to the permittivity of each point of space in the system, as computed with the adjoint method, described in Eq. (23). In Fig. 3(f), we show the same gradient information, but instead computed with the method described in the previous section. Namely, we interfere the field pattern from panel (b) with the field pattern from panel (d), subtract constant intensity terms, and multiply by the appropriate constants. Again, Figs. 3(b) and 3(d) agree with good precision.

We note that in a realistic system, the gradient must be constant for any stretch of waveguide between waveguide couplers because the interfering fields are at the same frequency and are traveling in the same direction. Thus, there should be no distance dependence in the corresponding intensity distribution. This is largely observed in our simulation, although small fluctuations are visible because of the proximity of the waveguides and the sharp bends, which were needed to make the structure compact enough for simulation within a reasonable time. In practice, the importance of this constant intensity is that it can be detected after each phase shifter, instead of inside of it.

Finally, we note that this numerically generated system experiences a total power loss of 41% due to scattering caused by very sharp bends and stair-casing of the structure in the simulation. We also observe approximately 5–10% mode-dependent loss, as determined by measuring the difference in total transmitted power corresponding to injection at different input ports. Minimal amounts of reflection are also visible in the field plots. Nevertheless, the time-reversal interference procedure still reconstructs the adjoint sensitivity with very good fidelity.

6. EXAMPLE OF ANN TRAINING

Finally, we use the techniques from the previous sections to numerically demonstrate the training of a photonic ANN to implement a logical XOR gate, defined by the following input to target ( $X_{0} \to T$ ) pairs:

{[\begin{matrix} 0 & 0 \end{matrix}]}^{T} \to 0, {[\begin{matrix} 0 & 1 \end{matrix}]}^{T} \to 1, {[\begin{matrix} 1 & 0 \end{matrix}]}^{T} \to 1, {[\begin{matrix} 1 & 1 \end{matrix}]}^{T} \to 0 .

This problem was chosen as demonstration of learning a nonlinear mapping from input to output [9] and is simple enough to be solved with a small network with only four training examples.

As diagrammed in Fig. 4(a), we choose a network architecture consisting of two $3 \times 3$ unitary OIUs. On the forward-propagation step, the binary representation of the inputs, $X_{0}$ , is sent into the first two input elements of the ANN and a constant value of 1 is sent into the third input element, which serves to introduce artificial bias terms into the network. These inputs are sent through a $3 \times 3$ unitary OIU and then the element-wise activation $f (z) = z^{2}$ is applied. The output of this step is sent to another $3 \times 3$ OIU and sent through another activation of the same form. Finally, the first output element is taken to be our prediction, $X_{L}$ , ignoring the last two output elements. Our network is repeatedly trained on the four training examples defined in Eq. (26) and using the mean-squared cost function presented in Eq. (4).

Fig. 4. Numerical demonstration of a photonic ANN implementing an XOR gate using the backpropagation algorithm and adjoint method described in this work. (a) Architecture of the ANN. Two layers of $3 \times 3$ OIUs with $z^{2}$ activations. (b) Mean-squared error (MSE) between the predictions and targets as a function of training iterations. (c) Absolute value of the network predictions (blue circles) and targets (black crosses) before training. (d) Absolute value of the network predictions after training, showing that the network has successfully learned the XOR function.

Download Full Size | PDF

For this demonstration, we utilized a matrix model of the system, as described in Refs. [21,22], with mathematical details described in Supplement 1. This model allows us to compute an output of the system given an input mode and the settings of each phase shifter. Although this is not a first-principle electromagnetic simulation of the system, it provides information about the complex fields at specific reference points within the circuit, which enables us to implement training using the backpropagation method as described in Section 2, combined with the adjoint gradient calculation from Section 3. Using these methods, at each iteration of training we compute the gradient of our cost function with respect to the phases of each of the integrated phase shifters, and sum them over the four training examples. Then, we perform a simple steepest-descent update to the phase shifters, in accordance with the gradient information. This is consistent with the standard training protocol for an ANN implemented on a conventional computer. Our network successfully learned the XOR gate in around 400 iterations. The results of the training are shown in Figs. 4(b)–4(d).

We note that this is meant to serve as a simple demonstration of using the in situ backpropagation technique for computing the gradients needed to train photonic ANNs. However, our method may equally be performed on more complicated tasks, which we show in Supplement 1.

7. DISCUSSION AND CONCLUSION

Here, we justify some of the assumptions made in this work. Our strategy for training a photonic ANN relies on the ability to create arbitrary complex inputs. We note that a device for accomplishing this has been proposed and discussed in Ref. [30]. Our recovery technique further requires an integrated intensity detection scheme to occur in parallel and with virtually no loss. This may be implemented by integrated, transparent photodetectors, which have already been demonstrated in similar systems [28]. Furthermore, as discussed, this measurement may occur in the waveguide regions directly after the phase shifters, which eliminates the need for phase shifter and photodetector components at the same location. Finally, in our procedure for experimentally measuring the gradient information, we suggested running isolated forward and adjoint steps, storing the intensities at each phase shifter for each step, and then subtracting this information from the final interference intensity. Alternatively, one may bypass the need to store these constant intensities by introducing a low-frequency modulation on top of one of the two interfering fields in Fig. 2(c), such that the product term of Eq. (24) can be directly measured from the low-frequency signal. A similar technique was used in Ref. [28].

We now discuss some of the limitations of our method. In the derivation, we had assumed the $\hat{W}$ operator to be unitary, which corresponds to a lossless OIU. In fact, we note that our procedure is exact in the limit of a lossless, feed-forward, and reciprocal system. However, with the addition of any amount of uniform loss, $\hat{W}$ is still unitary up to a constant, and our procedure may still be performed with the added step of scaling the measured gradients depending on this loss (see a related discussion in Ref. [30]). Uniform loss conditions are satisfied in the OIUs experimentally demonstrated in Refs. [11,26]. Mode-dependent loss, such as asymmetry in the MZI mesh layout or fabrication errors, should be avoided, as its presence limits the ability to accurately reconstruct the time-reversed adjoint field. Nevertheless, our simulation in Fig. 3 indicates that an accurate gradient can be obtained even in the presence of significant mode-dependent loss. In the experimental structures of Refs. [11,26], the mode-dependent loss is made much lower due to the choice of the MZI mesh. Thus, we expect our protocol to work in practical systems. Our method, in principle, computes gradients in parallel and scales in constant time. In practice, to get this scaling would require careful design of the circuits controlling the OIUs.

Conveniently, since our method does not directly assume any specific model for the linear operations, it may gracefully handle imperfections in the OIUs, such as deviations from perfect 50-50 splits in the MZIs. Last, while we chose to make an explicit distinction between the input ports and the output ports, i.e., we assume no backscattering in the system, this requirement is not strictly necessary. Our formalism can be extended to the full scattering matrix. However, this would require special treatment for subtracting the backscattering.

The problem of overfitting is one that must be addressed by “regularization” in any practical realization of a neural network. Photonic ANNs of this class provide a convenient approach to regularization based on “dropout” [31]. In the dropout procedure, certain nodes are probabilistically and temporarily “deleted” from the network during training time, which has the effect of forcing the network to find alternative paths to solve the problem at hand. This has a strong regularization effect and has become popular in conventional ANNs. Dropout may be implemented simply in the photonic ANN by “shutting off” channels in the activation functions during training. Specifically, at each time step and for each layer $l$ and element $i$ , one may set $f_{l} (Z_{i}) = 0$ with some fixed probability.

In conclusion, we have demonstrated a method for performing backpropagation in an ANN based on a photonic circuit. This method works by physically propagating the adjoint field and interfering its time-reversed copy with the original field. The gradient information can then be directly measured out as an in situ intensity measurement. While we chose to demonstrate this procedure in the context of ANNs, it is broadly applicable to any reconfigurable photonic system. One could imagine this setup being used to tune phased arrays [32], optical delivery systems for dielectric laser accelerators [33], or other systems that rely on large meshes of integrated optical phase shifters. Furthermore, it may be applied to sensitivity analysis of photonic devices, enabling spatial sensitivity information to be measured as an intensity in the device.

Our work should enhance the appeal of photonic circuits in deep learning applications, allowing for training to happen directly inside the device in an efficient and scalable manner. Furthermore, this method is broadly applicable to integrated and adaptive optical systems, enabling the possibility for automatic self-configuration and optimization without resorting to brute force gradient computation or model-based methods, which often do not perfectly represent the physical system.

Funding

Gordon and Betty Moore Foundation (GBMF4744); Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (SNF) (P300P2_177721); Air Force Office of Scientific Research (AFOSR) (FA9550-17-1-0002).

See Supplement 1 for supporting content.

REFERENCES

1. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521, 436–444 (2015). [CrossRef]

2. P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, I. Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M. D. Flickner, W. P. Risk, R. Manohar, and D. S. Modha, “A million spiking-neuron integrated circuit with a scalable communication network and interface,” Science 345, 668–673 (2014). [CrossRef]

3. M. Prezioso, F. Merrikh-Bayat, B. D. Hoskins, G. C. Adam, K. K. Likharev, and D. B. Strukov, “Training and operation of an integrated neuromorphic network based on metal-oxide memristors,” Nature 521, 61–64 (2015). [CrossRef]

4. Y. S. Abu-Mostafa and D. Pslatis, “Optical neural computers,” Sci. Am. 256, 88–95 (1987). [CrossRef]

5. S. Jutamulia, “Overview of hybrid optical neural networks,” Science 28, 59–72 (1996). [CrossRef]

6. D. Rosenbluth, K. Kravtsov, M. P. Fok, and P. R. Prucnal, “A high performance photonic pulse processing device,” Opt. Express 17, 22767–22772 (2009). [CrossRef]

7. A. N. Tait, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Broadcast and weight: an integrated network for scalable photonic spike processing,” J. Lightwave Technol. 32, 4029–4041 (2014). [CrossRef]

8. D. Brunner, M. C. Soriano, C. R. Mirasso, and I. Fischer, “Parallel photonic information processing at gigabyte per second data rates using transient states,” Nat. Commun. 4, 1364–1367 (2013). [CrossRef]

9. K. Vandoorne, P. Mechet, T. Van Vaerenbergh, M. Fiers, G. Morthier, D. Verstraeten, B. Schrauwen, J. Dambre, and P. Bienstman, “Experimental demonstration of reservoir computing on a silicon photonics chip,” Nat. Commun. 5, 3541 (2014). [CrossRef]

10. J. M. Shainline, S. M. Buckley, R. P. Mirin, and S. W. Nam, “Superconducting optoelectronic circuits for neuromorphic computing,” Phys. Rev. Appl. 7, 1–27 (2017). [CrossRef]

11. Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljacic, “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11, 441–446 (2017). [CrossRef]

12. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Parallel Distributed Processing (MIT, 1986), Vol. 1.

13. A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis, “Hybrid computing using a neural network with dynamic external memory,” Nature 538, 471–476 (2016). [CrossRef]

14. M. Hermans, M. Burm, T. Van Vaerenbergh, J. Dambre, and P. Bienstman, “Trainable hardware for dynamical computing using error backpropagation through physical media,” Nat. Commun. 6, 6729 (2015). [CrossRef]

15. F. Alibart, E. Zamanidoost, and D. B. Strukov, “Pattern classification by memristive crossbar circuits using ex situ and in situ training,” Nat. Commun. 4, 2072 (2013). [CrossRef]

16. K. Wagner and D. Psaltis, “Multilayer optical learning networks,” Appl. Opt. 26, 5061–5076 (1987). [CrossRef]

17. D. Psaltis, D. Brady, and K. Wagner, “Adaptive optical networks using photorefractive crystals,” Appl. Opt. 27, 1752–1759 (1988). [CrossRef]

18. N. Georgieva, S. Glavic, M. Bakr, and J. Bandler, “Feasible adjoint sensitivity technique for EM design optimization,” IEEE Trans. Microw. Theory Tech. 50, 2751–2758 (2002). [CrossRef]

19. G. Veronis, R. W. Dutton, and S. Fan, “Method for sensitivity analysis of photonic crystal devices,” Opt. Lett. 29, 2288–2290 (2004). [CrossRef]

20. T. Hughes, G. Veronis, K. P. Wootton, R. J. England, and S. Fan, “Method for computationally efficient design of dielectric laser accelerator structures,” Opt. Express 25, 15414–15427 (2017). [CrossRef]

21. M. Reck, A. Zeilinger, H. J. Bernstein, and P. Bertani, “Experimental realization of any discrete unitary operator,” Phys. Rev. Lett. 73, 58–61 (1994). [CrossRef]

22. W. R. Clements, P. C. Humphreys, B. J. Metcalf, W. S. Kolthammer, and I. A. Walsmley, “Optimal design for universal multiport interferometers,” Optica 3, 1460–1465 (2016). [CrossRef]

23. J. Carolan, C. Harrold, C. Sparrow, E. Martín-López, N. J. Russell, J. W. Silverstone, P. J. Shadbolt, N. Matsuda, M. Oguma, M. Itoh, G. D. Marshall, M. G. Thompson, J. C. Matthews, T. Hashimoto, J. L. O’Brien, and A. Laing, “Universal linear optics,” Science 349, 711–716 (2015). [CrossRef]

24. N. C. Harris, G. R. Steinbrecher, M. Prabhu, Y. Lahini, J. Mower, D. Bunandar, C. Chen, F. N. C. Wong, T. Baehr-Jones, M. Hochberg, S. Lloyd, and D. Englund, “Quantum transport simulations in a programmable nanophotonic processor,” Nat. Photonics 11, 447–452 (2017). [CrossRef]

25. D. A. B. Miller, “Self-aligning universal beam coupler,” Opt. Express 21, 6360–6370 (2013). [CrossRef]

26. D. A. B. Miller, “Self-configuring universal linear optical component,” Photon. Res. 1, 1–15 (2013). [CrossRef]

27. D. A. B. Miller, “Perfect optics with imperfect components,” Optica 2, 747–750 (2015). [CrossRef]

28. A. Annoni, E. Guglielmi, M. Carminati, G. Ferrari, M. Sampietro, D. A. Miller, A. Melloni, and F. Morichetti, “Unscrambling light-automatically undoing strong mixing between modes,” Light Sci. Appl. 6, e17110 (2017). [CrossRef]

29. W. Shin and S. Fan, “Choice of the perfectly matched layer boundary condition for frequency-domain maxwell’s equations solvers,” J. Comput. Phys. 231, 3406–3431 (2012). [CrossRef]

30. D. A. B. Miller, “Setting up meshes of interferometers; reversed local light interference method,” Opt. Express 25, 29233–29248 (2017). [CrossRef]

31. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res. 15, 1929–1958 (2014).

32. J. Sun, E. Timurdogan, A. Yaacobi, E. S. Hosseini, and M. R. Watts, “Large-scale nanophotonic phased array,” Nature 493, 195–199 (2013). [CrossRef]

33. T. W. Hughes, S. Tan, Z. Zhao, N. V. Sapra, Y. J. Lee, K. J. Leedle, H. Deng, Y. Miao, D. S. Black, M. Qi, O. Solgaard, J. S. Harris, J. Vuckovic, R. L. Byer, and S. Fan, “On-chip laser power delivery system for dielectric laser accelerators,” Phys. Rev. Appl. (to be published).

Training of photonic neural networks through in situ backpropagation and gradient measurement

Abstract

1. INTRODUCTION

2. PHOTONIC NEURAL NETWORK

3. GRADIENT COMPUTATION USING THE ADJOINT VARIABLE METHOD

4. EXPERIMENTAL MEASUREMENT OF GRADIENT

5. NUMERICAL GRADIENT DEMONSTRATION

6. EXAMPLE OF ANN TRAINING

7. DISCUSSION AND CONCLUSION

Funding

REFERENCES

Supplementary Material (1)

Cited By

Figures (4)

Equations (26)

Optica