Efficient method for accelerating line searches in adjoint optimization of photonic devices by combining Schur complement domain decomposition and Born series expansions

Nathan Z. Zhao; Salim Boutami; Salim Boutami; Shanhui Fan

doi:10.1364/OE.451718

1. Introduction

Optimization approaches based on gradient descent have been well established for the inverse design and optimization of both linear and nonlinear photonic devices [1–8]. In each step of the optimization, one starts with a candidate structure and computes the gradient in the design parameter space of its objective function. One then generates a new candidate structure by updating all parameters along the direction of the gradient. This process iterates until a structure with sufficiently good performance is found. What enables the efficient implementation of the gradient descent optimization is the development of the adjoint variable method [3,9], which allows one to determine the gradient of the objective function with respect to all design parameters with only two simulations of a given structure.

In each step of the gradient descent optimization, in addition to the information of the gradient, one would also need to provide the distance over which the design parameters need to be varied along the gradient direction. This distance is usually controlled by a hyperparameter known as the learning rate. The optimal learning rate can in principle be determined if one performs an exact line search along the gradient direction. Any line search, however, can be demanding computationally, as it requires simulations of many structures along the gradient direction. Therefore, in practice, the learning rate is often obtained heuristically. The use of an approximate learning rate as opposed to the optimal rate can significantly increase the number of iterations required in the gradient-based optimization in order to reach an optimal structure. For further improvement of the gradient descent optimization, it is certainly of interest to develop algorithms that can efficiently determine the optimal learning rate [10,11]. Moreover, given that in such optimizations many structures need to be evaluated, it would also be of interest to reduce the computational costs associated with simulating individual structures.

Our paper builds upon two recent advances in the development of gradient descent approaches for the optimization of photonic devices. First, in Refs. [10,11] it was shown that the Lippman-Schwinger equation and its corresponding Born series can provide an effective way to determine the optimal learning rate, by enabling an efficient line search along the gradient direction. Second, in Ref. [12] it was shown that a Schur’s complement approach can be efficient for reducing the computational cost associated with evaluating candidate structures. In many photonics optimization problems, one only optimizes a subset of degrees of freedom inside the entire computational domain. The Schur’s complement approach can achieve significant acceleration in the evaluation of the candidate structures by reducing the solution of the problem to only those degrees of freedom being optimized. In this paper, we show that these two advances can be combined together to significantly improve the gradient descent optimization of nanophotonic devices.

The remainder of the paper is organized as follows. In Section 2, we provide a problem set-up and the mathematical formalism. In Section 3, we verify our results numerically using an example system where applying the line search in Ref. [10] alone may not yield any speedup. In Section 4, we analyze the nonzero fill-in of the matrices used in Section 3 to see how our line search algorithm can experience significant acceleration. We conclude in Section 5.

2. Mathematical formalism

2.1 Evaluating the objectivefFunction: solving Maxwell’s equations

In a typical optimization procedure of a nanophotonic device, in order to evaluate the objective function, we simulate a device by solving the Maxwell’s equations, here written in the frequency domain:

(1)$$\bigg(\frac{1}{\mu_0}\nabla \times \nabla \times{-} \omega^2 \epsilon_0 \epsilon_r(\mathbf{r}) \bigg)\mathbf{E(r)} ={-}i\omega \mathbf{J(r)}$$

In Eq. (1), $\mathbf {E(r)}$ denotes the electric field, consisting of three components $E_x(\mathbf {r})$, $E_y(\mathbf {r})$, $E_z(\mathbf {r})$, $\mathbf {J(r)}$ is the source, $\epsilon _r(\mathbf {r})$ is the relative dielectric permittivity, $\epsilon _0$ is the permittivity of free space, and the magnetic permeability is assumed to be the permeability of free space $\mu _0$ in the entire computational domain. In Eq. (1), $\mathbf {E(r)}$ and $\mathbf {J(r)}$ are complex functions of the spatial coordinate $\mathbf {r}$.

By discretizing Eq. (1) via finite differences on a Yee grid [13], we can convert this into a system of linear equations:

(2)$$\hat A \mathbf{E} ={-}i\omega \mathbf{J}$$

with the solution:

(3)$$\mathbf{E} ={-}i\omega \hat G\mathbf{J}$$

where we now, for simplicity, suppress the spatial coordinate vector $\mathbf {r}$. $\hat A$ is a matrix which represents the discretized approximation of the left hand side of Eq. (1). ${\hat G}$ is the Green’s function corresponding to $\hat A$. The $\mathbf {E}$ in Eq. (2) is now a vector over the field components and over each spatial grid point. Likewise, $\mathbf {J}$ is also a vector over all spatial grid points containing information about the sources on this discretized domain.

To illustrate the computational technique developed, in this paper we will consider a design problem shown in Fig. 2. The device includes one input and two output waveguides. All these waveguides are single-moded. These waveguides are connected by a design region. The objective of the design to achieve high-efficiency, equal splitting of the input to the two output waveguides. In the simulation, the device is surrounded by a region consisting of perfectly matched layers (PML) [14,15]. The modes are excited by a source in the waveguide outside the design region, and the fluxes are computed along lines perpendicular to the output waveguides to determine the transmitted flux. For this problem, the matrix $\hat A$ describes the entire structure including the PML. $\mathbf {J}$ describes the excitation source in the input waveguide. $\mathbf {E}$ is the electric field distribution inside the computational domain.

2.2 Gradient-based optimization

To optimize an optical device, we typically would like to adjust the dielectric permittivity distribution $\epsilon _r(\mathbf {r})$ to improve an objective function. In our example of Fig. 1, the objective function $L$ is a function of the two fluxes in the two output waveguides, and we will adjust the dielectric function within the design region. For this purpose we will need to compute the gradient $\partial L/\partial \epsilon _r(\mathbf {r})$. In the finite-difference setting as described above $\epsilon _r(\mathbf {r})$ is measured on a set of cells which can be integer indexed as $(p,q,s)$ in three dimensions. As a result, $\epsilon _r(\mathbf {r}) \rightarrow \epsilon _{r}(p,q,s)$. Here for notation simplicity we suppress all such spatial indices $(p,q,s)$. To proceed we note that:

(4)$$\frac{\partial L}{\partial \epsilon_r} = \frac{\partial L}{\partial \mathbf{E}}\frac{\partial \mathbf{E}}{\partial \epsilon_r}$$

The gradient $\frac {\partial L}{\mathbf {\partial E}}$ can be determined analytically since the objective function $\hat L$ is typically an analytic function of $\mathbf {E}$. To determine the second term $\partial \mathbf {E}/ \partial \epsilon _r$, the derivative of a vector with respect to a scalar value, we take derivatives of Eq. (2) with respect to $\epsilon _r$ to obtain:

(5)$$\frac{\partial \mathbf{E}}{\partial \epsilon_r} ={-}\hat A^{{-}1} \frac{\partial \hat A}{\partial \epsilon_r} \mathbf{E}$$

The left-hand side of Eq. (5) is a vector assuming $\epsilon _r$ is isotropic or a Jacobian matrix if it is a vector. Likewise, the derivative on the right-hand side of Eq. (5) is a vector for an isotropic dielectric function. With a simple substitution of Eq. (5) into Eq. (4), we get:

(6)$$\frac{\partial L}{\partial \epsilon_r} ={-}\bigg(\frac{\partial L}{\partial \mathbf{E}} \hat A^{{-}1} \bigg) \frac{\partial \hat A}{\partial \epsilon_r} \mathbf{E} ={-} \mathbf{E}_{adj}^T\frac{\partial \hat A}{\partial \epsilon_r}\mathbf{E}$$

The term $\mathbf {E}_{adj}$ is the adjoint field and can be determined by solving:

(7)$$\hat A^T \mathbf{E}_{adj} = \bigg(\frac{\partial L}{\partial \mathbf{E}}\bigg)^T$$

where the term on the right hand side is the adjoint source. The combined solutions of $\mathbf {E}$ and $\mathbf {E}_{adj}$ are enough to compute the gradient of the objective function $\hat L$ with respect to all design variables:

(8)$$\frac{\partial L}{\partial \epsilon_r} ={-}\mathbf{E}_{adj}^T \frac{\partial \hat A}{\partial \epsilon_r} \mathbf{E}$$

Moreover, in many numerical techniques used to solve Eq. (2) and Eq. (7), such as through the use of LU decomposition, where a representation of $\hat G=\hat A^{-1}$ is acquired, one can directly reuse such a representation of $\hat G$ to calculate the adjoint field $\mathbf {E}_{adj}$ more efficiently.

Fig. 1. a) The starting structure for a waveguide mode power splitter. The light gray region has a relative permittivity $\epsilon _r = 2.25$, the dark gray region has a permittivity of $\epsilon _r = 6.25$. The light blue region consists of perfectly matched layer. The dotted red lines represent the lines along which we measure the objective function. b) Division of the system into the design region and the background. In the optimization process, we modify only the permittivity inside the design region.

Download Full Size | PDF

With the gradient determined, we can then update or modify all the design variables $\epsilon _r$ in the design region accordingly:

(9)$$\epsilon_r^{(i+1)} = \epsilon_r^{(i)} +\alpha\bigg(\frac{\partial L}{\partial \epsilon_r^{(i)}}\bigg)$$

the superscripts $i$ and $i+1$ denote the step/epoch number in the gradient-based optimization algorithm. $\alpha$ is the learning rate hyperparameter.

2.3 Line search in gradient descent

In the design parameter update step of every gradient descent iteration, we see that the parameter update in Eq. (9) involves a free hyperparameter $\alpha$ called the learning rate. Line searches solve for the optimal learning rate or step size at every iteration of a gradient-based algorithm:

(10)$$\alpha^* = \min_{\alpha}(L(\epsilon_r(\mathbf{r})+\alpha\Delta\epsilon_r(\mathbf{r})))$$

where $\Delta \epsilon _r(\mathbf {r})$ corresponds to any change in the dielectric distribution function in space, such as a gradient update as shown in Eq. (9). The problem in Eq. (10) is solved by evaluating the objective function for different $\alpha$ and taking $\alpha ^*$ to be the one which generates the best improvement in the objective function. To determine $\alpha ^*$ , one must sample the space of possible $\alpha$’s with sufficient density. Moreover, for each $\alpha$, the value of $L(\epsilon _r(\mathbf {r}) + \alpha \Delta \epsilon _r(\mathbf {r}))$ requires the solution or simulation of a new structure. Thus the computational cost of a line search is substantial.

In [10], a methodology was developed which circumvented determining $\hat G$ in Eq. (3) from scratch for each value of $\alpha$. The basis of this method is to express the solution to a line search as a perturbation theory problem. The equation that describes the system with a dielectric distribution $\epsilon _r(\mathbf {r}) + \alpha \Delta \epsilon _r(\mathbf {r})$ is:

(11)$$\bigg[ \bigg(\frac{1}{\mu_0}(\nabla \times \nabla \times )- \omega^2 \epsilon_0 \epsilon_r(\mathbf{r})\bigg) - \omega^2\epsilon_0\big(\alpha \Delta \epsilon_r(\mathbf{r})\big)\bigg] \mathbf{E(r)} ={-}i\omega \mathbf{J(r)}$$

By similarly discretizing this system in the frequency domain, we can solve for the new field as a function of $\alpha$:

(12)$$(\hat A+\alpha \hat V)\mathbf{E}(\alpha) ={-}i\omega \mathbf{J}$$

where $\hat V$ is a diagonal matrix with elements $V_{ij} = k_0^2\Delta \epsilon _r \delta _{ij}$ and $i$ and $j$ are integer indices corresponding to the row and column indices respectively of the matrix. We have again suppressed the spatial coordinate $\mathbf {r}$. $\delta _{ij}$ represents the Kronecker delta function. From Eq. (12), one can derive the Lippman-Schwinger equation:

(13)$$\mathbf{E}(\alpha) = \mathbf{E}(\alpha = 0)+ \alpha(\hat G \hat V)\mathbf{E}(\alpha)$$

where $\hat G = \hat A^{-1}$. Subsequently, Eq. (12) can be solved using a Born series as:

(14)$$\mathbf{E}(\alpha) =\mathbf{E}(0)+ \alpha (\hat G\hat V)\mathbf{E}(0)+ \alpha^2 (\hat G\hat V)^2 \mathbf{E}(0)+\cdots = \sum_{k=0}^{\infty}\alpha^k(\hat G\hat V)^k \mathbf{E}(0)$$

In Eq. (14), we can express the electric field for any $\alpha$ using the Green’s function and the corresponding field solution for $\alpha = 0$. Therefore, if the Green’s function for the current candidate structure is computed, one can use Eq. (14) to efficiently perform a line search.

In general, the convergence of the Born series can be slow or nonexistent, such as for large perturbations with high-index media [16]. On the other hand, it was also noted that the infinite sum in the Eq. (14) can be efficiently estimated for certain cases using the technique of the Shanks transformation [10]. We denote the sum of the first $n$ terms in Eq. (14) as $\mathbf {E}_n(\alpha )$. The Shanks transformation [17–19] of the series $\mathbf {E}_n(\alpha )$ has the form:

(15)$$\mathbf{T}(\mathbf{E}_n(\alpha)) = \frac{\mathbf{E}_{n+1}(\alpha)\mathbf{E}_{n-1}(\alpha) -\mathbf{E}_{n}(\alpha)}{\mathbf{E}_{n+1}(\alpha)-2\mathbf{E}_{n}(\alpha)+\mathbf{E}_{n-1}(\alpha)}$$

where $\mathbf {T}(\mathbf {E}_n)$ is an element-wise transformation of the vector $\mathbf {E}_n$, and all operations in Eq. (15) are performed element-wise. $\mathbf {T}(\mathbf {E}_n)$ can provide a very accurate estimation of the infinite sum in Eq. (14). Therefore, the use of Shanks transformation can allow us to perform a very efficient line search.

2.4 Reducing the cost of a line search via a Schur complement

With the use of the Shank transformation technique, the key computational cost in the line search becomes the evaluation of the Green’s function in each term of Eq. (14) for each candidate structure. To reduce the cost associated with evaluating Eq. (14), we observe that in typical inverse design problems, the design region consists of a limited subset of all degrees of freedom being simulated. This observation can be exploited to reduce the computational cost of determining the Green’s function for each candidate structure [12]. To do this, we partition the degrees of freedom on the grid into two subdomains. One consists of the design region and $\Omega _i$, and the other consists of a background which is the region of the computational domain outside of the design region. (See Fig. 2 for an illustration). Equation (2) can then be written in the form as:

(16)$$\begin{bmatrix} \hat A_O & \hat A_{OB} \\ \hat A_{BO} & \hat A_B \end{bmatrix}\begin{bmatrix} \mathbf{E}_O \\ \mathbf{E}_B \end{bmatrix} ={-}i\omega \begin{bmatrix} \mathbf{J}_O \\ \mathbf{J}_B \end{bmatrix}$$

The submatrices $\hat A_O$ and $\hat A_B$ are now the part of the matrix $\hat A$ restricted to the design region and the background, respectively. The off-diagonal blocks $\hat A_{OB}$ and $\hat A_{BO}$ represent the couplings between grid points of these two regions. This coupling occurs between the design region and the background region. $\mathbf {J}_B$ and $\mathbf {J}_O$ represent the sources in either subdomain.

Fig. 2. The spatial distribution of the dielectric $\epsilon _r$ of an optimized mode splitter after 800 iterations using a) gradient descent with a constant $\alpha = 0.02$ and b) using the optimal learning rate as determined by a line search at each iteration. Corresponding to a) and b) are the $|E_z|$ field profiles of the resulting structures. The perfectly matched layer boundary is omitted from view in a)-d).

Download Full Size | PDF

Using the second row of Eq. (16), we obtain:

(17)$$\hat A_{BO}\mathbf{E}_O + \hat A_B\mathbf{E}_B ={-}i\omega \mathbf{J}_B$$

We can solve for $\mathbf {E}_B$ in Eq. (17) and substitute the solution back into the first row of Eq. (16) to get an equation for $\mathbf {E}_O$ only:

(18)$$\hat S\mathbf{\mathbf{E}_O} ={-}i\omega \mathbf{J}_S$$

where:

(19)$$\hat S \equiv \hat A_O - \hat A_{OB}\hat A_B^{{-}1}\hat A_{BO}$$

(20)$$\mathbf{J}_S \equiv \mathbf{J}_O - \hat A_{OB} \hat A_B^{{-}1}\mathbf{J}_B.$$

Below, we will refer to Eq. (16) as the “original” system and Eq. (19) as its corresponding “reduced” system. Because $\hat S$ will typically be much smaller in dimension than $\hat A$, we can expect solving the linear system with $\hat S$ to be much faster in many situations as discussed in Ref. [12]

We now show that one can perform the line search using the reduced system as opposed to the original system. Since the only modification of the structure occurs in the design region, within each line search we need to solve for

(21)$$(\hat S+\alpha \hat V_S)\mathbf{E}_O(\alpha) ={-}i\omega \mathbf{J}_S$$

where $V_S$ is the same as $V$ but is reduced so the dimension matches that of $\hat S$. The solution of this equation can be written analogous to Eq. (14) as:

(22)$$\mathbf{E}_O(\alpha) =\mathbf{E}_O(0)+ \alpha (\hat G_S\hat V_S)\mathbf{E}_O(0)+ \alpha^2 (\hat G_S\hat V_S)^2 \mathbf{E}_O(0)+\cdots = \sum_{k=0}^{\infty}\alpha^k(\hat G_S\hat V_S)^k \mathbf{E}_O(0)$$

where $\hat G_S = \hat S^{-1}$. Thus, the technique of Shanks transformation can be applied to the Born series as well. Therefore, we have shown that the techniques of Schur’s complement and Shanks transformation can be combined to further reduce the computational cost associated with the line search. In the next section, we will provide a numerical demonstration of such a combination for an optimization problem in photonics.

3. Results

We consider a two-dimensional problem of a mode power splitter introduced in Section 1. In the center is a design region with a size of $2 \times 2$ $\mu$m. We simulate this device for the electric field polarization perpendicular to the 2D plane ($E_z$, $H_x$, $H_y$), at a wavelength of $\lambda = 1.55$ $\mu$m. Further specifications can be found in Table 1.

Table 1. Test system specifications.^a

View Table | View all tables in this article

The objective function for this mode power splitter is related to the overlap integral of the simulated field with the $\textrm{TE}_0$ modes propagating in the two output waveguide arms. For each arm in the left and right direction (normal to $y$-axis), as labelled by $i=l,r$:

(23)$$L_i(\epsilon_r) = \frac{1}{8A_i} \bigg|\int_{\Omega_{i}}{\bigg(E_{z}H_{y,\textrm{TE}_0}^*(x,y)+ E_{z,\textrm{TE}_0}^*(x, y)H_{y}(x,y)\bigg) d\Omega_{i} \bigg|^2}$$

where $H_{y,\textrm{TE}_0}(x,y)$ and $E_{y, \textrm{TE}_0}(x,y)$ are the $H_y$ and $E_z$ fields $\textrm{TE}_0$ mode of waveguide $i$ respectively. An asterisk denotes taking the complex conjugate and $E_z$ and $H_y$ correspond to the field obtained by the simulation using the current iteration of the device. $\Omega _i$ denotes an integration line that is perpendicular to waveguide $i$. $A_i$ is a normalization constant as defined by:

(24)$$A_i = \int_{\Omega_i}\textrm{Re}\bigg(E_{z, TE_0}(x,y)H_{y,TE_0}^*(x,y)\bigg) d\Omega_{i}$$

Using the $L_i$’s, we define an objective function $L(\epsilon _r)$ as:

(25)$$L(\epsilon_r) = 4L_l(\epsilon_r)L_r(\epsilon_r)$$

$L(\epsilon _r)$ reaches a maximum value of $1$ when the power is evenly split between the two arms with zero loss. The derivative of $L_i$ with respect to the dielectric permittivity $\epsilon _r$:

(26)$$\frac{\partial L_i}{\partial \epsilon_r} = 2 k_0^2 \textrm{Re}\bigg((E_z(x,y) {E}_{z, adj}(x,y)) L_i^*(\epsilon_r)\bigg)$$

where ${E}_{z, adj}$ is the $z$-component of the adjoint field as derived in Eq. (7). $k_0 = 2\pi /\lambda$.

In order to account for the fact that $\epsilon _r$ is bounded above and below in typical optimization problems, we modify the gradient:

(27)$$\textrm{proj}\bigg(\frac{\partial L}{\partial \epsilon_r}\bigg) = \frac{\partial L}{\partial \epsilon_r}\begin{cases} \frac{(\epsilon_r - \epsilon_{min})}{\max(|\frac{\partial L}{\partial \epsilon_r}|)}, & \frac{\partial L}{\partial \epsilon_r} \geq 0 \\ \frac{(\epsilon_{max} - \epsilon_{r})}{\max(|\frac{\partial L}{\partial \epsilon_r}|)}, & \frac{\partial L}{\partial \epsilon_r} < 0 \\ \end{cases}$$

In our design, we choose $\epsilon _{min}= 2.25$ and $\epsilon _{max} = 6.25$. Equation (27) amounts to a projection of the gradient based on the lower and upper bounds on the design parameter values [20]. The projection of the gradient allows us to effectively reparametrize the space of the hyperparameter $\alpha$ to be between (0,1]. As a result, the line search in Eq. (10) becomes a constrained optimization problem based on the bound constraints of the design parameters. The $\epsilon _r$ is initialized to a constant value of $4$. For the Born series described in Eq. (14), we use $n=5$ terms. Furthermore, since $\mathbf {T}(\mathbf {E}_n)$ produces another sequence of values, the Shanks transform can be applied multiple times on the same series. For $n=5$ terms, we apply the Shanks transform 3 times. With these sets of parameters, we also restrict the search space of $\alpha$ to $(0,0.6]$ to ensure the accuracy of the line search.

In Fig. 3, we compare the objective function with respect to the number of iterations for different learning rate strategies. The orange line is our baseline, which is a simple gradient descent with a constant fixed $\alpha = 0.02$, chosen so that the objective increases as close to monotonically throughout the optimization process. The green line represents the same simple gradient descent but the learning rate is optimized with a line search using Eq. (15). The maroon line represents the optimal learning rate determined at each iteration. Clearly, the use of the line search accelerates the convergence with respect to the number of iterations. Also, we see that the learning rate can vary significantly throughout the optimization over the possible range of 0 to 0.6. However, Fig. 3 does not compare the amount of time it takes for the optimization to complete.

Fig. 3. Comparison of the performance of a line search based gradient descent versus gradient descent with a constant learning rate over a single run. The green line represents the value of the objective function versus the number of iterations. In each iteration the optimal hyperparameter $\alpha ^*$ is used. The maroon line represents the optimal $\alpha ^*$ determined at each iteration. The orange line is the same as the green line, except a constant $\alpha = 0.02$ is used for each iteration. There is slight dip in the value of $L(\epsilon _r)$ at around iteration 500, indicating the given learning rate is still slightly too large.

Download Full Size | PDF

In Fig. 4, we compare the full time cost taken by our method compared to a variety of baselines. In this way, we can now evaluate the performance of our combined line search and Schur complement method. The green and orange lines correspond to the same setups as in Fig. 3. The red line is an "adaptive" gradient descent method [21,22] whereby we start out with $\alpha =0.6$ but successively decrease it by a geometric factor $\beta$ whenever an update of the parameters causes a decrease in the objective function. Note that there is still a hyperparameter here in choosing the size of the geometric decay factor $\beta$. In comparing such an adaptive method (red line), with the exact line search (green line), we note that while the initial performance of the adaptive method is good, in the later times the adaptive method underperforms as compared to the exact line search. This is because in the adaptive method the learning rate decreases as a function of time, but as we can see in Fig. 3, the optimal learning rate may still be large in later iterations. Moreover, we can see that the learning rate in the adaptive method is still too large at certain iterations as evidenced by the substantial decreases in objective function value between 20-40 seconds. The time taken for the line search based on the Shanks transformation can be further improved by combining it with the Schur complement domain decomposition. This reduces the size of the computational domain that needs to be solved at each iteration, leading to a substantial speed-up as can be seen by comparing the green line with the blue line

Fig. 4. Comparison of the time in seconds taken by our domain decomposition enhanced line search compared to various baselines. The plot is truncated once we reach a target value corresponding to 94% splitting efficiency to indicate how long it takes to reach a particular performance threshold across the four methods. The blue line shows our domain decomposition enhanced line search based gradient descent. The green line shows our gradient descent with the Shanks-based line search but no domain decomposition acceleration. The orange line is our baseline gradient descent using a constant learning rate ($\alpha = 0.02$). The red line represents gradient descent with the adaptive decay learning rate method using $\beta = 0.45$.

Download Full Size | PDF

In Table 2, we provide a more detailed examination of the time cost of various steps required when an exact line search is performed. The original system corresponds to the original system matrix A as described in Eq. (2). The reduced system corresponds to the case where the Schur complement domain decomposition method has been applied to generate a reduced system matrix $\hat S$, as described in Eq. (18). For each iteration step when performing the exact line search, the major components are the "LU decomposition" of the system matrix, the "back substitution" after the LU decomposition has been performed, and the application of the Shanks transformation which determines the optimal $\alpha$ in the "line search". (The quotations here correspond to various entries in Table 2). We note that performing domain decomposition significantly reduces the time cost for every major component in an iteration step.

Table 2. Measured times for different processes (measured in seconds (s)) within a single iteration step when performing the method of exact line search.^a

View Table | View all tables in this article

In comparing the usual gradient descent methods, with the exact line search, we note that while the exact line search reduces the number of iterations required as shown in Fig. 3, the computational cost for each iteration step is higher. Thus, in applying the method of exact line search, there is a particular advantage of using the domain decomposition method. We also note that in applying the domain decomposition method there is an initial overhead in forming the reduced system matrix $\hat S$. In our case, the cost of such overhead is not significant since the number of iteration steps is sufficiently large.

To further understand how the use of domain decomposition speeds up the line search algorithm, we can look at the numerical properties of the matrices that come out of Eq. (2) and Eq. (19). Numerically, these equations are solved through LU decomposition and back-substitution. Because of the fact that the matrices are sparse, the linear systems associated with Eq. (2) and Eq. (18) are solved using sparse direct solvers [23–25]. For our work we developed a code based on UMFPACK [23]. Additionally, we use a perfect electric conductor/perfect magnetic conductor (PEC/PMC) boundary (rather a periodic boundary) condition on the boundary edges that are covered with a perfectly matched layer (PML) [14].

Now, assume we have a factorization $\hat L$, $\hat U$ for $\hat A$ and $\hat L_S$, $\hat U_S$ for $\hat S$. To evaluate the numerical cost of forming Eq. (14) or Eq. (22), as is required for performing the exact line search, we need to determine the cost of back-substitution in the form of $\hat U^{-1}(\hat L^{-1}b)$ where $b$ is a vector. Thus, the time to determine the Born series in Eqs. (14) and (22) is proportional to the non-zeros in the LU factors of $\hat A$ and $\hat S$. In general, we expect $\hat S$ to have less fill-in than $\hat A$ for two reasons. One, $\hat S$ is smaller in dimension than $\hat A$, and two, $\hat S$ is block sparse with fewer non-zeros than $\hat A$, which further reduces the degree of fill-in during the factorization process. For the system studied in this paper, the reduced system itself actually contains a fewer number of nonzero elements than the original system as illustrated in Table 3. Thus, for the reduced system, the LU factors of the system matrix have far less non-zero elements as compared with that of that the original system. It is for this reason that the time cost of each iteration step of exact line search is far lower in the reduced system.

Table 3. Number of nonzeros in the system matrix and LU factors of the original and reduced systems for the system described in Table 3.

View Table | View all tables in this article

4. Conclusion

In this paper, we have demonstrated that the method of exact line search, as enabled by applying the Shanks transformation on a Born series resulting from the Lippman Schwinger formalism, can be further speed up with the use of a Schur complement domain decomposition method. In our examples we did not try to generate a binary structure. An optimal structure that is binary can be generated using an implicit density parameterization approach [26] or the level set method [27]. Our method here can be combined with either of these approaches. For future works, we note that our technique can be further enhanced by minimizing the evaluations needed to find the optimal learning rate, such as by using a back-tracking line search [28]. Additionally, this technique has significant implications for Quasi-Newton methods such as limited memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS) [29], where line searches take up the majority of the time in each iteration. Because Quasi-Newton methods require a line search to be done every iteration, our technique can effectively render every iteration of a L-BFGS the same cost as normal first order methods.

Funding

Air Force Office of Scientific Research (FA9550-17-1-0002, FA9550-21-1-0244).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. A. Y. Piggott, J. Lu, K. G. Lagoudakis, J. Petykiewicz, T. M. Babinec, and J. Vucković, “Inverse design and demonstration of a compact and broadband on-chip wavelength demultiplexer,” Nat. Photonics 9(6), 374–377 (2015). [CrossRef]

2. V. Liu, Y. Jiao, D. A. B. Miller, and S. Fan, “Design methodology for compact photonic-crystal-based wavelength division multiplexers,” Opt. Lett. 36(4), 591 (2011). [CrossRef]

3. C. M. Lalau-Keraly, S. Bhargava, O. D. Miller, and E. Yablonovitch, “Adjoint shape optimization applied to electromagnetic design,” Opt. Express 21(18), 21693 (2013). [CrossRef]

4. G. Veronis, R. W. Dutton, and S. Fan, “Method for sensitivity analysis of photonic crystal devices,” Opt. Lett. 29(19), 2288 (2004). [CrossRef]

5. N. Georgieva, S. Glavic, M. Bakr, and J. Bandler, “Feasible adjoint sensitivity technique for EM design optimization,” IEEE Trans. Microwave Theory Tech. 50(12), 2751–2758 (2002). [CrossRef]

6. Z. Lin, B. Groever, F. Capasso, A. W. Rodriguez, and M. Lončar, “Topology-optimized multilayered metaoptics,” Phys. Rev. Appl. 9(4), 044030 (2018). [CrossRef]

7. J. Lu and J. Vučković, “Objective-first design of high-efficiency, small-footprint couplers between arbitrary nanophotonic waveguide modes,” Opt. Express 20(7), 7221 (2012). [CrossRef]

8. G. Zhang, D.-X. Xu, Y. Grinberg, and O. Liboiron-Ladouceur, “Topological inverse design of nanophotonic devices with energy constraint,” Opt. Express 29(8), 12681–12695 (2021). [CrossRef]

9. R.-E. Plessix, “A review of the adjoint-state method for computing the gradient of a functional with geophysical applications,” Geophys. J. Int. 167(2), 495–503 (2006). [CrossRef]

10. S. Boutami, N. Zhao, and S. Fan, “Determining the optimal learning rate in gradient-based electromagnetic optimization using the shanks transformation in the lippmann–schwinger formalism,” Opt. Lett. 45(3), 595–598 (2020). [CrossRef]

11. S. Boutami, N. Zhao, and S. Fan, “Large permittivity increments for efficient predictive photonic devices optimization,” Proc. SPIE 11290, 112900Q (2020). [CrossRef]

12. N. Z. Zhao, S. Boutami, and S. Fan, “Accelerating adjoint variable method based photonic optimization with schur complement domain decomposition,” Opt. Express 27(15), 20711–20719 (2019). [CrossRef]

13. K. Yee, “Numerical solution of initial boundary value problems involving maxwell’s equations in isotropic media,” IEEE Trans. Antennas Propag. 14(3), 302–307 (1966). [CrossRef]

14. J.-P. Berenger, “A perfectly matched layer for the absorption of electromagnetic waves,” J. Comput. Phys. 114(2), 185–200 (1994). [CrossRef]

15. W. Shin and S. Fan, “Choice of the perfectly matched layer boundary condition for frequency-domain Maxwell’s equations solvers,” J. Comput. Phys. 231(8), 3406–3431 (2012). [CrossRef]

16. G. Osnabrugge, S. Leedumrongwatthanakun, and I. M. Vellekoop, “A convergent born series for solving the inhomogeneous helmholtz equation in arbitrarily large media,” J. Comput. Phys. 322, 113–124 (2016). [CrossRef]

17. D. Shanks, “Non-linear Transformations of Divergent and Slowly Convergent Sequences,” J. Math. Phys. 34(1-4), 1–42 (1955). [CrossRef]

18. S. Singh and R. Singh, “On the Use of Shanks’s Transform to Accelerate the Summation of Slowly Converging Series,” IEEE Trans. Microwave Theory Tech. 39(3), 608–610 (1991). [CrossRef]

19. N. Guérin, S. Enoch, and G. Tayeb, “Combined method for the computation of the doubly periodic green’s function,” IEEE Trans. Microwave Theory Tech. 15(2), 205–221 (2001). [CrossRef]

20. L. M. G. Drummond and A. N. Iusem, “A Projected Gradient Method for Vector Optimization Problems,” Comput. Optim. Appl. 28(1), 5–29 (2004). [CrossRef]

21. R. Ge, S. M. Kakade, R. Kidambi, and P. Netrapalli, “The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares,” arXiv:1904.12838 (2019).

22. K. You, M. Long, J. Wang, and M. I. Jordan, “How does learning rate decay help modern neural networks?” arXiv:1908.01878 (2019).

23. T. A. Davis, “Algorithm 832: UMFPACK V4.3—an unsymmetric-pattern multifrontal method,” ACM Transactions on Math. Softw. 30(2), 196–199 (2004). [CrossRef]

24. T. A. Davis, Direct Methods for Sparse Linear Systems (Society for Industrial and Applied Mathematics, 2006).

25. T. A. Davis, S. Rajamanickam, and W. M. Sid-Lakhdar, “A survey of direct methods for sparse linear systems,” Acta Numerica 25, 383–566 (2016). [CrossRef]

26. T. W. Hughes, I. A. Williamson, M. Minkov, and S. Fan, “Forward-mode differentiation of maxwell’s equations,” ACS Photonics 6(11), 3010–3016 (2019). [CrossRef]

27. S. Osher and R. P. Fedkiw, “Level set methods: An overview and some recent results,” J. Comput. Phys. 169(2), 463–502 (2001). [CrossRef]

28. S. Boyd and L. Vandenberghe, Convex Optimization (Cambridge University Press, 2004).

29. J. Nocedal, “Updating quasi-newton matrices with limited storage,” Math. Comput. 35(151), 773–782 (1980). [CrossRef]

Descriptor	Values
$(N_{x}, N_{y})$	(141,141)
$x_{r a n g e}$	(−3.525 3.525) $μ$ m
$y_{r a n g e}$	(−3.525 3.525) $μ$ m
$d x = d y$	0.05 $μ$ m
design region	(−1.0, 1.0) $μ$ m
PML thickness	1.5 $μ$ m

	Original System ( $\hat{A}$ )	Reduced system ( $\hat{S}$ )
LU factorization	0.11 s	0.01 s
Back-substitution	0.01 s	0.001 s
Line search	0.03 s	0.003 s

	Original System ( $\hat{A}$ )	Reduced System ( $\hat{S}$ )
linear system matrix	96609	33361
L/U factor of the linear system matrix	434960	48610

Descriptor	Values
$(N_{x}, N_{y})$	(141,141)
$x_{r a n g e}$	(−3.525 3.525) $μ$ m
$y_{r a n g e}$	(−3.525 3.525) $μ$ m
$d x = d y$	0.05 $μ$ m
design region	(−1.0, 1.0) $μ$ m
PML thickness	1.5 $μ$ m

	Original System ( $\hat{A}$ )	Reduced system ( $\hat{S}$ )
LU factorization	0.11 s	0.01 s
Back-substitution	0.01 s	0.001 s
Line search	0.03 s	0.003 s

Efficient method for accelerating line searches in adjoint optimization of photonic devices by combining Schur complement domain decomposition and Born series expansions

Abstract

1. Introduction

2. Mathematical formalism

2.1 Evaluating the objectivefFunction: solving Maxwell’s equations

2.2 Gradient-based optimization

2.3 Line search in gradient descent

2.4 Reducing the cost of a line search via a Schur complement

3. Results

4. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (4)

Tables (3)

Equations (27)

Optics Express