## Abstract

We develop a higher-order method for non-paraxial beam propagation based on the wide-angle split-step spectral (WASSS) method previously reported [Clark and Thomas, Opt. Quantum. Electron., 41, 849 (2010)]. The higher-order WASSS (HOWASSS) method approximates the Helmholtz equation by keeping terms up to third-order in the propagation step size, in the Magnus expansion. A symmetric exponential operator splitting technique is used to simplify the resulting exponential operators. The HOWASSS method is applied to the problem of waveguide propagation, where an analytical solution is known, to demonstrate the performance and accuracy of the method. The performance enhancement gained by implementing the HOWASSS method on a graphics processing unit (GPU) is demonstrated. When highly accurate results are required the HOWASSS method is shown to be substantially faster than the WASSS method.

© 2013 OSA

## 1. Introduction

Beam propagation methods (BPM) are a large class of numerical methods for solving the scalar Helmoltz equation, and are popular for simulating guided waves and laser beams, as they are typically both fast and efficient. Early methods made use of the paraxial approximation, which greatly simplifies the problem by reducing the propagation equation to first order. However, these methods were severely limited in their application. Any beam profile containing spatial frequencies with angles greater than a few degrees, with respect to the propagation axis, incur significant phase errors. Many methods have been developed to drop the paraxial approximation and include wide-angle waves. In several of these, the Helmholtz equation is formally rewritten as a first-order differential equation which includes the square root of an operator. The square root operator is either approximated using real or complex Padé approximants and a finite-difference or iterative method is used to solve the equation [1, 2], or the analytical solution is found which results in an exponential of the square root of an operator that is approximated with a Padé approximant [3]. Recently, Sharma extended an operator-splitting technique used on the paraxial wave equation to the non-paraxial wave equation (Helmholtz equation) [4]. The splitting allows diffraction and the refractive index variations to be handled separately. Various numerical methods can be used once the operator has been split, such as collocation or finite-difference [4, 5].

In a recent publication, we described a numerical beam propagation method that represents the beam profile in the basis of the eigenvectors of the Laplacian operator and uses a symmetric operator splitting technique to account for the refractive index variations, known as the wide-angle split-step spectroscopic (WASSS) method [6]. In general, the method provided a two-fold speedup to the finite-difference method reported by Sharma [4] This improvement could be increased by use of a fast Fourier transform algorithm. Here we develop a higher-order WASSS (HOWASSS) method that extends the approximation to higher-order, providing a more efficient method when high accuracies are required. We apply the HOWASSS method to the problem of waveguide propagation to demonstrate the performance and accuracy of the method. An additional performance enhancement is obtained by implementing the method on a graphics processing unit (GPU) using compute unified device architecture (CUDA) technology from NVIDIA™.

## 2. Formulation

Beam propagation in a medium with a non-uniform refractive index is described by the scalar Helmholtz equation,

*ψ*(

*z*,

**r**) is the complex scalar electric field,

*z*is a Cartesian coordinate in the direction of propagation,

**r**are the transverse coordinates, and ${\nabla}_{\mathbf{r}}^{2}$ is the transverse Laplace operator. As usual,

*k*

_{0}is the free space wavenumber,

*n*(

*z*,

**r**) is the refractive index distribution, and

*n̄*

^{2}= min [

*n*

^{2}(

*z*,

*x*)]. For simplicity we will only consider the two-dimensional (2D) case, however generalization to three dimensions (3D) follows quite trivially. In two dimensions, the electric field is denoted

*ψ*(

*z*,

*x*). Note that

*x*is not necessarily a Cartesian coordinate but could, for example, be a radial coordinate. Expanding

*ψ*(

*z*,

*x*) in terms of the eigenfunctions of the transverse Laplace operator we write where Multiplying the 2D form of Eq. (1) by ${\varphi}_{j}^{*}\left(x\right)$, integrating over the transverse coordinate, and applying Eqs. (3) and (4) we are left with

To discretize the problem space, let *N _{x}* and

*N*be the number of discrete grid points in x and z respectively, and let

_{z}*i*,

*j*∈ [1,

*N*], and

_{x}*l*∈ [1,

*N*]. We let

_{z}*x*=

_{i}*i*Δ

*x*+

*x*

_{0}and

*z*=

_{l}*l*Δ

*z*+

*z*

_{0}, where Δ

*x*and Δ

*z*are the corresponding step sizes. The matrix

**N**(

*z*) is defined by

**S**[6], transforms a vector into the spectral basis, while

**S**

^{−1}transforms vectors back into the spatial basis. In the case of Cartesian coordinates with hard boundary conditions this matrix becomes the discrete Fourier transform matrix, and in cylindrical coordinates with azimuthal symmetry and hard boundary conditions,

**S**takes the form of a discrete Hankel transform [7]. We also must define the constant matrix

**M**with elements Notice that if

*λ*>

_{i}*k*

_{0}

*n̄*then

*M*becomes imaginary. If we allow eigenvalues such that

_{i,j}*λ*>

_{i}*k*

_{0}

*n̄*, then the trigonometric functions appearing in

**P**will become hyperbolic (see Eq. (25)), making the method numerically unstable. For this reason, we will limit

*N*to satisfy

_{x}*λ*<

_{i}*k*

_{0}

*n̄*. However, we must include at least enough eigenfrequencies to sufficiently represent the initial conditions, otherwise large errors will be incurred. Using these definitions Eq. (5) becomes

The exact solution to Eq. (11) is given by a Magnus expansion [8],

**H**(

*t*

_{1}),

**H**(

*t*

_{2})] is the usual commutator. Note that Eq. (11) is essentially the time-dependent Schrödinger equation, in which context the Magnus expansion is more well-known as the time-ordered exponential operator [9]. However, we will refer to Eq. (12) as a Magnus expansion to be consistent with discussions concerning initial value problems of the form of Eq. (11).

**H**(*z*) can be written as the sum of two matrices, one that is constant and one that depends on *z*,

_{2}(

*z*

_{l}_{+1},

*z*) we first need to compute the commutator. After making use of Eq. (15) and simplifying we find

_{l}**H**

_{2}(

*z*

_{1}),

**H**

_{2}(

*z*

_{2})] = 0. We apply the trapezoid method twice to approximate the double integral in Eq. (14) and obtain

*z*)

^{3}as it involves a triple integral. Keeping terms up to (Δ

*z*)

^{3}

*z*terms in the exponential from the (Δ

*z*)

^{2}terms. Then we split the terms involving

**H**

_{1}from those involving

**H**

_{2}, and we are left with

**P**,

**Q**(

*z*), and

**C**(

*z*

_{1},

*z*

_{2}) can be expressed by a 2 × 2 block matrix where the matrices within each operator are diagonal. This allows the exponential operators to be simply evaluated by expanding the exponential into its power series

**P**is the operator corresponding to a propagation of the pulse through a homogenous medium with an index of refraction of

*n̄*.

**Q**(

*z*) takes into account the difference

*n*(

*z,x*) −

*n̄*, while

*C*(

*z*

_{1},

*z*

_{2}) depends on the change in the refractive index over the step taken. Thus, in this sense we would like to draw the analogy that

*Q*(

*z*) acts like the constant term of a Taylor’s series expansion, and

*C*(

*z*

_{1},

*z*

_{2}) acts in a manner similar to the first derivative term in such a series. We can save some additional matrix-vector multiplications by combining

**Q**(

*z*

_{l}_{+1})

**Q**(

*z*)

_{l}## 3. Numerical example

To test our method we will simulate beam propagation through a two-dimensional (Cartesian) symmetric Epstein-layer waveguide tilted at an angle *θ* from the positive *z* axis [5]. Hard boundary conditions where *ψ*(*z*, *x*_{0}) = 0 and *ψ*(*z*, *x _{f}*) = 0 are assumed. The eigenfunctions of the transverse Laplace operator are then given by

*x*

_{0}and

*x*are automatically satisfied. We do not need to include these points in our grid, defined by Δ

_{f}*x*= (

*x*−

_{f}*x*

_{0})/(

*N*+ 1) and Δ

_{x}*z*= (

*z*−

_{f}*z*

_{0})/

*N*. The refractive index profile is

_{z}*x̃*is a shifted coordinate to align the waveguide in the center of our computational region to make the hard boundary conditions as negligible as possible, Δ

*n*is the height of the refractive index shift,

*w*is the width of the waveguide. The shifted coordinate is given by

*x̃*=

*x*− (1/2)(

*x*−

_{f}*x*

_{0}) + (1/2)(

*z*−

_{f}*z*

_{0}) tan(

*θ*). The initial electric field is given by the ze-roth order mode of the Epstein-layer waveguide

*i*is the unit imaginary number, not to be confused with the counting index used elsewhere.

*W*and

*K*

_{0}are given by The exact solution for this tilted Epstein-layer waveguide is [10]

C and CUDA versions of both the WASSS and HOWASSS methods were implemented. The code was compiled using nvcc version 4.2 for CUDA code and gcc version 4.6.3 for the C code. The code was run on a workstation equipped with an Intel Xeon X5960 CPU, which is a hyper-threaded six-core CPU clocked at 3.47 GHz with 48 GB of RAM, and a NVIDIA QUADRO 6000 GPU, which has 448 CUDA cores at 574 MHz core clock speed (750 MHz memory clock speed) and 6 GB of dedicated GPU DDR5 RAM.

The simulation was run using the parameters listed in Table 1. The value of *k*_{0} was selected to ensure that we could include 1000 transverse modes while still keeping **M** real-valued. When the waveguide was aligned (*θ* = 0°) roughly an order of magnitude decrease in the error was observed with the HOWASSS method over the WASSS method (see Fig. 1). However, the improvement was more pronounced when we tilted the waveguide at an angle of 50 degrees, especially for large step sizes (see Fig. 2). Figure 3 shows the calculated electric field next to the exact solution to further illustrate the accuracy of the HOWASSS method. The simulations shown in Figs. 1, 2, and 3 were run using double precision accuracy numbers on the GPU. Both methods were capable of producing accurate results using single precision arithmetic, until about Δ*z* = 0.05*μ*m. At this point, round-off error began to affect the results of the HOWASSS method due to the increased number of matrix-vector multiplications required in each time step. The WASSS method does not suffer as quickly from this problem because it is computationally more simple. The limit where round-off error begins to affect the double precision code was not observed for either method.

## 4. Discussion

#### 4.1. Stability

By virtue of being a higher order method, the HOWASSS method has improved stability over the WASSS method (see Fig. 4). As we increase the tilt of the waveguide, the WASSS method performs more poorly as the angle increases, and at large step sizes even shows signs of being a bit unstable. However, the HOWASSS method actually becomes more accurate at larger angles. Traditionally, beam propagation methods struggle to obtain accurate results when there are rapid changes in the index of refraction. To simulate a rapidly changing index of refraction we increase the change in the index between the waveguide and the surrounding medium, Δ*n*, while keeping the width of the waveguide fixed. We see that at a 50° waveguide tilt the WASSS method actually becomes unstable with a large stepsize, while the HOWASSS is able to remain stable for at least an additional order of magnitude increase in Δ*n*. However, both methods do loose accuracy as the change in the refractive index becomes steeper, but by decreasing Δ*z* accurate results can still be obtained in a reasonable run time. This same analysis was done for a 0° tilted waveguide and the results were quite similar and so are not shown here.

#### 4.2. Speed

Computationally, the HOWASSS method requires the multiplication of matrices with vectors. If we restrict *N _{x}* so that the matrix

**M**is real then, for the example in Section 3, all the matrices will be real, however the vector

**A**will be complex. In general, the eigenfunctions and the refractive index could be complex. However, for simplicity we will assume these to be real.

Diagonal matrix multiplication is equivalent to element-by-element vector multiplication. Hence, multiplying an *N _{x}* ×

*N*diagonal matrix by an

_{x}*N*element complex vector requires 2

_{x}*N*operations. Multiplying an

_{x}*N*×

_{x}*N*dense matrix by an

_{x}*N*element complex vector is equivalent to performing 2

_{x}*N*dot products, each requiring

_{x}*N*multiplications and

_{x}*N*− 1 additions, for a total of $2{N}_{x}\left(2{N}_{x}-1\right)=4{N}_{x}^{2}-2{N}_{x}$ operations. Applying

_{x}**P**requires a total of 4 diagonal matrix-vector multiplications for a total of 8

*N*operations. Applying

_{x}**Q**(

*z*

_{l}_{+1})

**Q**(

*z*) requires 2 diagonal matrix-vector multiplications, 2 dense matrix-vector multiplications, 2 vector-vector additions, and 1 scalar-vector multiplication for a total of $8{N}_{x}^{2}+6{N}_{x}=2{N}_{x}\left(4{N}_{x}-3\right)$ operations. Applying

_{l}**C**(

*z*

_{l}_{+1},

*z*) requires 4 dense matrix-vector multiplications, 1 vector-vector addition, and 2 element-by-element exponential operations (assumed to only be 1 operation per element) for a total of $16{N}_{x}^{2}-2{N}_{x}=2{N}_{x}\left(8{N}_{x}-1\right)$ operations.

_{l}**P**is applied four times,

**Q**(

*z*

_{l}_{+1})

**Q**(

*z*) is applied twice, and

_{l}**C**(

*z*

_{l}_{+1},

*z*) is applied once each step, giving a total of $32{N}_{x}^{2}+42{N}_{x}=2{N}_{x}\left(16{N}_{x}+21\right)$ operations. Following the same logic for the WASSS method, we find that it requires $8{N}_{x}^{2}+6{N}_{x}=2{N}_{x}\left(4{N}_{x}+3\right)$ operations. Note that this operation count differs from the one reported by Clark and Thomas [6] because we are counting both multiplications and additions in this calculation.

_{l}Furthermore, to leading order in *N _{x}*, the HOWASSS method is approximately a factor of 4 slower for a given Δ

*z*(see Fig. 5). However, the initial hypothesis was that a significant improvement in accuracy may lead to a more efficient algorithm. To test this hypothesis, we executed the HOWASSS and WASSS methods using the GPU implementation with various step sizes, holding all other aspects of the problem constant. Figure 6 summarizes the result. For a propagation length of 100 micrometers, and a waveguide tilt angle of 50 degrees, we show the compute time per micron of propagation as a function of the maximum absolute error. For two different values of

*N*, the efficiency at constant error is better for the HOWASSS method, as indicated by shorter compute times. In fact, for error values smaller than about 10

_{x}^{−4}, the HOWASSS method is much better, with efficiency rapidly exceeding an order of magnitude for absolute error values of less than 10

^{−6}.

The data presented in this paper was generated using double precision arithmetic. This does not result in a significant difference in the run time when the code is run on a CPU. However, GPUs are intrinsically designed to deal with single-precision arithmetic. In order to compute one double-precision number the GPU must perform two single-precision calculations and some additional overhead to combine the result, leading to, at a minimum, a factor of two speed-up when single-precision numbers are used. This speed-up also depends on the specific graphics card used, as some have less capability than others when it comes to double precision arithmetic. Rounding errors can affect the HOWASSS method if *N _{z}* is large, around 2000 for the specific example considered in this paper. If the application does not require extreme accuracy, then a substantial speed-up can be achieved by using single-precision arithmetic on the GPU.

#### 4.3. Parallelization

Both the WASSS and HOWASSS methods readily lend themselves to parallelization. For our implementation we chose to use both OpenMP, to make use of multi core CPUs, and NVIDIA compute unified device architecture (CUDA) to make use of the processing power of the graphics processing unit (GPU). Modern GPUs possess orders of magnitude more computational power than the typical CPU. However, to completely utilize this power the algorithm must possess a very high level of parallelism to completely saturate the many processing units of the GPU.

Unfortunately, neither the WASSS nor the HOWASSS method are ideal for utilizing the full power of the GPU because many of the matrix-vector multiplications involve diagonal matrices. Faster than their dense counterparts, these computationally reduce to simple element-by-element vector multiplication. While this is a perfectly parallel operation, it does not offer the large number of independent calculations needed to saturate the GPU. These element-by-element vector multiplications suffer additionally from the fact that they require two memory reads and one memory write to compute only one multiplication. On the GPU reading and writing to memory is much slower than math operations. Given this, there is still a significant speed-up when the code is run on the GPU versus on a single core of the CPU (see Fig. 5). This speed-up was accomplished with little attention payed to optimization of the code. With further optimization an additional factor of two or more would be likely. Additionally, extending this technique to even higher-orders might yield additional improvements.

#### 4.4. Generalizations to other coordinate systems, boundary conditions, and higher dimensions

Both the WASSS and HOWASSS methods can be generalized to different coordinate systems or different boundary conditions. In fact the only complication is that the eigenfunctions and eigenvalues must be known. For a more detailed discussion on this topic see Clark and Thomas [6].

## 5. Conclusion

We have presented a method that is more efficient than previous non-paraxial beam propagation methods. The method casts the analytic solution of the Helmholtz equation as a Magnus expansion. Keeping terms up to (Δ*z*)^{3} in the Magnus expansion we use a symmetric operator splitting technique in order to analytically reduce the exponential matrices into a more simple form. The solution to the Helmholtz equation is then approximated via straightforward matrix multiplication. We have demonstrated the method in a simple geometry with 2D Cartesian coordinates with hard boundary conditions. The results obtained show our higher-order approach significantly improves the overall efficiency, when measured as compute time per distance of propagation, in cases where high accuracy results are required. The method can be easily extended to more generalized coordinate systems, higher dimensions, and various boundary conditions, provided that the eigenfunctions and eigenvalues of the transverse Laplace operator can be found for that geometry.

## Acknowledgments

The authors thank the Air Force Research Laboratory, and acknowledge support through the Human Effectiveness Directorate contract FA8650-08-D-6930 for this effort. This work was partially supported by National Science Foundation Grants ECCS-1250360, DBI-1250361, and CBET-125036. B. H. H. acknowledges the support of the High Energy Laser Joint Technology Office for their sponsorship of the Directed Energy Professional Society’s Summer Research Internship Program.

## References and links

**1. **G. R. Hadley, “Multistep method for wide-angle beam propagation,” Opt. Lett. **17**, 1743–1745 (1992) [CrossRef] [PubMed] .

**2. **K. Q. Le, R. Godoy-Rubio, P. Bienstman, and G. R. Hadley, “The complex Jacobi iterative method for three-dimensional wide-angle beam propagation,” Opt. Express **16**, 17021–17030 (2008) [CrossRef] [PubMed] .

**3. **Y. Y. Lu and P. L. Ho, “Beam propagation method using a [(*p*−1)/*p*] Padé approximant of the propagator,” Opt. Lett. **27**, 683–685 (2002) [CrossRef] .

**4. **A. Sharma and A. Agrawal, “New method for nonparaxial beam propagation,” J. Opt. Soc. Am. B **21**, 1082–1087 (2004) [CrossRef] .

**5. **A. Sharma and A. Agrawal, “Non-paraxial split-step finite-difference method for beam propagation,” Opt. Quantum. Electron. **38**, 19–34 (2006) [CrossRef] .

**6. **C. D. Clark and R. Thomas, “Wide-angle split-step spectral method for 2D or 3D beam propagation,” Opt. Quantum. Electron. **41**, 849–857 (2010) [CrossRef] .

**7. **M. Guizar-Sicairos and J. C. Gutiérrez-Vega, “Computation of quasi-discrete Hankel transforms of integer order for propagating optical wave fields,” J. Opt. Soc. Am. A **21**, 53–58 (2004) [CrossRef] .

**8. **W. Magnus, “On the exponential solution of differential equations for a linear operator,” Comm. Pure Appl. Math. **7**, 649–673 (1954) [CrossRef] .

**9. **M. Bauer, R. Chetrite, K. Ebrahimi-Fard, and F. Patras, “Time-ordering and a generalized Magnus expansion,” Lett. Math. Phys. **103**, 331–350 (2012) [CrossRef] .

**10. **M. J. Adams, *An Introduction to Optical Waveguides* (Wiley, 1981).