## Abstract

The article deals with a method of calculation of off-axis light propagation between parallel planes using discretization of the Rayleigh-Sommerfeld integral and its implementation by fast convolution. It analyses zero-padding in case of different plane sizes. In case of memory restrictions, it suggests splitting the calculation into tiles and shows that splitting leads to a faster calculation when plane sizes are a lot different. Next, it suggests how to calculate propagation in case of different sampling rates by splitting planes into interleaved tiles and shows this to be faster than zero-padding and direct calculation. Neither the speedup nor memory-saving method decreases accuracy; the aim of the proposed method is to provide reference data that can be compared to the results of faster and less precise methods.

©2010 Optical Society of America

## 1. Introduction

A fundamental tool of digital holography, or computer generated holography, is a numerical simulation of coherent light propagating in free space. We will use, as usual, scalar approximation of the vectorial nature of light, and will not consider time-dependent behavior of the light [1]. One of the most usual tasks is to calculate light propagation between two parallel planes; the Rayleigh-Sommerfeld integral [1], its mathematical equivalent angular spectrum [1,2] or their various approximations are used most often.

Those equations need to be discretized for numerical calculation, i.e. one has to both sample and spatially restrict optical fields involved. The discretization itself leads to various errors well described in literature [3,4]. The calculation tries to avoid those errors while trying to work as fast as possible.

It is, however, often hard to decide what is an error of discretization itself, what is an inherent error of a given fast method and what is just an error of a particular implementation. The goal of this article is to describe the reference numerical calculation of light propagation between parallel planes that has just one “error”, the discretization. Any other method can be then compared to this one. As we will assume fine sampling that does not lead to aliasing errors, we have to deal with a large amount of data. We will try to handle it as fast as possible while retaining the accuracy of the calculation.

The proposed method focuses on off-axis light propagation between two rectangular areas that share neither size nor sampling, or, between a spatial light modulator (SLM) and a camera sensor. Reference calculation is described, e.g., in [5], off-axis propagation, e.g., in [6,7]; different samplings are treated in [8] by coordinate system change, in [9] by shifted convolution kernel, in [10] by scaled Fourier transform; different sizes of SLM and sensor is solved in [10,11] using tiling while decreasing memory demands. This article solves all the requirements in a unified way by using the convolution approach and tiling; in contrast to references given, it deals with optimization too.

## 2. One-dimensional case

We are going to show all the principles in a one-dimensional case before showing the full 2D version. This means that we will calculate light propagation between two line segments instead of two rectangular areas. We will call them *Source* and *Target*.

In linear optics it is assumed that light sources do not influence each other and that every single point $t(x)$ of *Target* (e.g. a camera sensor) is affected by the shining of all points $s(x)$ of *Source* (e.g. a SLM, see Fig. 1a
). As mentioned in the introduction, we will assume scalar approximation of coherent light, i.e. the light source can be completely described by amplitude *A* and phase *ϕ*, or complex amplitude
$A\mathrm{exp}(\text{j}\varphi )$, where
${\text{j}}^{2}=-1$.

Light changes both amplitude and phase by propagation. This change can by described by multiplication with some complex number *p*. Light propagation is space invariant; therefore, it is just the mutual position of points on *Source* and *Target* that matters, not their absolute positions. It follows that the calculation will have a convolution form:

*z*axis between

*Source*and

*Target*because it is a constant and can be a part of the function

*p*.

We can discretize the equation by equidistant splitting of *Source* and *Target* into *M* and *N* basic elements $s[m]$ ($0\le m\le M-1$) and $t[n]$ ($0\le n\le N-1$), i.e. by uniform sampling. Therefore, the element $t[n]$ receives from the element $s[m]$ light with complex amplitude $s[m]p[n-m]$, see Fig. 1b. Equation (1) can be written in discrete form:

The calculation of all elements $t[n]$ is most often done by rewriting Eq. (2) as a cyclic convolution and subsequent use of the discrete Fourier transform. The cyclic convolution has a form:

*s*is zero-padded to the size

*C*). Notice that the important values of $t[n]$ are those for $0\le n\le C-M$; the others are damaged by the cyclic behavior of indices in arithmetic $(\text{mod}C)$. It is also worth mentioning that in cyclic convolution it is usually assumed that $M=N$. The proof of validity for the case $M\ne N$ consists just in the expansion of equations for $t[n]$. Finally,

*C*is an arbitrary number bigger than or equal to $M+N-1$. By choosing a suitable

*C*, we can speed up the computation significantly, becausewhere

*t*,

*s*and

*p*are

*C*-dimensional vectors (arrays) of complex numbers, $DFT$is a discrete Fourier transform of a vector, $IDFT$is an inverse discrete Fourier transform of a vector and × is the Hadamard product (element by element product). The speedup is expected due to the fact that the calculation of $DFT$, or $IDFT$, can be done in time $O(C\mathrm{log}C)$ [12]. Choosing a suitable

*C*is important because the actual calculation time is highly sensitive to its character.

Let us assume that the sampling rate of *Target* is twice as fine as the sampling rate of *Source*. Then the subset of even samples from *Target* has the same sampling rate as *Source*. Consequently, we can easily calculate the propagation of *Source* to even samples of *Target*. Obviously, we can do the same with odd samples. It means we can split the calculation of light propagation into two calculations; they are denoted in Fig. 1c by black and magenta arrows. It follows that the same principle can be applied when the sampling rate of *Target* is *τ*-times finer than the sampling rate of *Source*, where *τ* is a natural number; we have to split *Target* into *τ* “*interleaved tiles*”, i.e. the calculation has to be split into *τ* calculations.

The same situation appears when *Source* has *σ*-times finer sampling than *Target*. The idea can be generalized: if the sampling rates of *Source* and *Target* are in a ratio $\sigma :\tau $, where *σ* and *τ* are coprime natural numbers (i.e. *σ* elements of *Source* have the same size as *τ* elements of *Target*), we can split the calculation into $\sigma \times \tau $ independent calculations. More precisely, we have to *σ*-times calculate the propagation for the sampling rate ratio $1:\tau $ and sum the results. Usually we do not care about the exact value of the sampling rate; therefore, we can choose such *σ* and *τ* that approximate the desired sampling fairly well.

It follows from Eq. (3) that the vectors *s* and *t* have to be padded to size *C*. This means that for $M=N$, approximately 50% of elements are held in memory uselessly; in 2D convolution, it is as much as 75%. Therefore, for big *M* and *N* we can expect a lack of memory very soon, especially when using special hardware for $DFT$ calculation, e.g. GPU. For example a naive approach to propagation of a microscopic *Source* to an extended detector *Target* could require hundreds of gigabytes.

We need two memory spaces of size $C\ge M+N-1$ for the calculation of Eq. (4); in practice a restriction may appear: that just two spaces of size $P<C$ are available. Let us assume, for example, that $M=512$, $N=1024$ and $P=1024$. We cannot make the calculation directly because $C\ge 1535$; but we can split *Target* in the middle into two parts (let us call them “*common tiles*”) with ${N}^{\prime}=512$ elements and make two calculations. For them, we need only two spaces of size at least $M+{N}^{\prime}-1=1023$, and therefore we are not limited by $P=1024$. The same idea would apply if $M=1024$, $N=512$ and $P=1024$: we would calculate the light propagation of both small parts of *Source* to *Target* and sum the results. As in the “different sampling” case, even this idea can be generalized: *Source* can be split into *S* parts, *Target* into *T* parts and then we have to calculate $S\times T$ propagations.

## 3. Two-dimensional case

The extension of equations from section 2 to the 2D case is straightforward: instead of sums we just put double sums there. For calculation of light propagation of a part of *Source* to a part of *Target*, it is necessary to carefully calculate the convolution kernel, i.e. 2D array *p*. A practical aid is the fact that its element $p[0,0]$ describes light propagation from element $s[0,0]$ of a particular part of *Source* to element $t[0,0]$ of a particular part of *Target*. The equations for sampling rates, samples counts and offsets are simple but technically demanding, so we will not show them here.

We should, however, mention the calculation of convolution kernel values. The Rayleigh-Sommerfeld equation [2] for the light propagation from the plane $z=0$ to point $[x,y,z]$ is:

*λ*is a wavelength), and

*r*is the distance between points $[x,y,z]$ and $[\xi ,\eta ,0]$. This equation can be written in a convolution form:

The simplest method of convolution kernel discretization (the expression on the right side of the convolution operator in Eq. (5)) is a plain sampling, i.e. its calculation for a particular *x*, *y* (*z* is a constant). Alternatively, we can assume that a sample of *Source* in fact expresses – using some pixel spread function – the shining of a particular non-zero-area element of *Source* and alter the kernel accordingly. If we take non-zero area of *Target* (sensor) elements into account, we can pre-filter the kernel. If *Source* is a mathematical model of a real display, we can even measure the kernel. The proposed method therefore does not depend on particular features of the kernel; its only assumption is the description of light propagation as a convolution. In our implementation, we did not deal with advanced methods of kernel calculation, however, and we have chosen plain sampling.

## 4. Theoretical time of computation

The calculation works with arrays of size $C={C}_{x}\times {C}_{y}$samples. It consists of three $DFT$'s (more precisely, two forward and one backward), calculation of the convolution kernel *p* with complexity $O(C)$ and the Hadamard product of the same complexity. For a large *C*, only times spent on $DFT$'s matter. The time of calculation using the fast Fourier transform is therefore proportional [12] to

*Source*and

*Target*share the sampling. Then we can also split

*Target*into $T\times T$ common tiles of size $(N/T)\times (N/T)$. It follows that we calculate $DFT(Source)$ once and calculate $DFT(p)$ and $IDFT(DFT(Source)\otimes DFT(p))$ $T\times T$-times, while we assume the array sizes for $DFT$ as $C=(M+N/T)\times (M+N/T)$. Using Eq. (6), we get time of calculation

*T*it is faster for even smaller

*ω*. Figure 2a shows the graph for $N=1024$; for different

*N*it does not change very much.

A more interesting result arises when *Source* and *Target* have different sampling rates. Let us assume that *Target* has *τ*-times finer sampling than *Source*, i.e. $N=\tau M$. For direct calculation, *Source* must have the same sampling rate as *Target*, i.e. it must have $\tau M\text{\hspace{0.17em}}\times \text{\hspace{0.05em}}\tau M$ samples too. We can increase number of samples by interpolation or just by interleaving current ones by a suitable amount of zeros. By substituting $C=(\tau M+\text{\hspace{0.17em}}\tau M)\times (\tau M+\text{\hspace{0.17em}}\tau M)$ into Eq. (6), we get time of calculation

*Target*into $\tau \times \tau $ interleaved tiles, we have to calculate $\tau \times \tau $ light propagations from

*Source*to a part of

*Target*of size $M\text{\hspace{0.17em}}\times M$ samples. By substituting $C=(M+\text{\hspace{0.17em}}M)\times (M+\text{\hspace{0.17em}}M)$ into Eq. (6), we get time of calculation

*τ*for $M=1024$ is in Fig. 2b. Again, a different

*M*leads to a similar graph.

One can see in Fig. 2 that measured speedups are scattered around theoretical curves. A discussion of this fact is contained in the following section.

Until now we have assumed that *Target* has more samples than *Source*. That led to saving nearly one-third of all $DFT$'s. In the opposite situation this does not hold any more; however, the results of the analysis are similar, although not so outstanding. The analysis of general situations, where the ratio of sampling rates or sizes is a general fraction, leads to the following results: speedup is greater when the nominator and denominator of the sampling rate ratio are big too; splitting of *Source* and *Target* of the same sampling rate into common tiles in a complicated ratio leads rather to slowdown if we do not mind memory restrictions.

## 5. Implementation notes

The implementation of the proposed method is simple, although due to a number of variables quite demanding. The first step is to estimate the ratio of *Source* and *Target* sampling rates, and to split them into interleaved tiles of common sampling. In case we have enough memory to calculate light propagations between these tiles, we can calculate them (but see later). In another case, we are facing a still unresolved problem.

Let us assume that two memory spaces of size *C* are available. The interleaved tiles created by splitting *Source* and *Target* due to the requirement of common sampling have to be split into ${S}_{x}\times {S}_{y}$ common tiles of size ${M}_{x}\times {M}_{y}$, and ${T}_{x}\times {T}_{y}$ common tiles of size ${N}_{x}\times {N}_{y}$ (due to integer division one row and one column may be smaller) so that:

**DFT**of special array sizes up to 6 × faster than for comparable prime sizes. This is the first expected reason for scattered look of the measured data in Fig. 2. One can see in Fig. 2b that measured speedups are bigger than theoretical ones. This behavior starts to appear for $M>512$ in our implementation. We expect this behavior appears due to a following fact. When calculating a propagation without any tiling, we have to increase

*Source*side size

*τ*-times, so

**DFT**has to work with a big array. Using interleaved tiling,

**DFT**works with smaller arrays that fit into cache memory more easily, and therefore bigger speedup can be expected.

Implemented heuristic suggests splitting in this way. First, we have measured calculation times $time[i]$ of one-dimensional $DFT$'s of array sizes *i*. From these times we have picked “friendly-size” ones that satisfy condition $time[friendly\text{-}size]<time[i]$ for all measured $i>friendly\text{-}size$. Two-dimensional $DFT$ is separable, so it is expected that a two-dimensional array of *friendly-size* side sizes will be “friendly” too. Next, we have measured calculation times for those arrays. The propagation itself is then calculated in 2D “friendly-size” arrays. In case we need to split into common tiles due to memory restrictions, we choose tiles of maximum width (tiles are then in fact horizontal stripes); height of *Source* and *Target* tiles is chosen to be approximately equal and as big as possible. We have compared this heuristic to the optimal solution found using brute force; it suggests at most approximately $1.7\times $ worse tiling.

A topic that has not been discussed yet is the precision of the proposed algorithm compared to the precision of direct calculation without any tiling. We have tried to propagate the *Source* to the *Target* of the same size (for several different sizes) using different tiling schemes. We have found that the relative difference of the calculated complex amplitudes is in the order of 10^{−10} or smaller; the difference was calculated as the absolute value of complex amplitudes differences. This difference is so small that we have not tried to find where it comes from.

## 6. Conclusion

A method for reference calculation of light propagation between two rectangular areas of parallel planes has been presented. These areas do not have to have either the same sampling rate or the same size, see Fig. 3 . The calculation can be split into tiles to meet memory restrictions given; on the other hand, saving memory often leads to worse calculation times. We have shown that in the case of a simple sampling rates ratio, the proposed method actually speeds up the calculation up to $2\times $ by splitting the areas into interleaved tiles compared to zero-padding and direct calculation. We have shown as well that if one area is much bigger than the other, it is faster to split the bigger one into common tiles. A still unresolved problem is how to split the calculation into common tiles (due to memory restrictions) to get the fastest calculation. We have, however, proposed suboptimal heuristic to address this problem.

## Acknowledgments

This work has been supported by the Ministry of Education, Youth and Sports of the Czech Republic under the research program LC-06008 (Center for Computer Graphics). The author would like to thank Ivo Hanák, Václav Skala and the reviewers for their valuable comments.

## References and links

**1. **J. W. Goodman, *Introduction to Fourier Optics* (Roberts & Company Publishers, 2004), 3rd ed.

**2. **E. Lalor, “Conditions for the validity of the angular spectrum of plane waves,” J. Opt. Soc. Am. **58**(9), 1235–1237 (1968). [CrossRef]

**3. **L. Onural, “Exact analysis of the effects of sampling of the scalar diffraction field,” J. Opt. Soc. Am. A **24**(2), 359–367 (2007). [CrossRef]

**4. **L. Onural, A. Gotchev, H. Ozaktas, and E. Stoykova, “A survey of signal processing problems and tools in holographic three-dimensional television,” IEEE Trans. Circ. Syst. Video Tech. **17**(11), 1631–1646 (2007). [CrossRef]

**5. **V. Katkovnik, J. Astola, and K. Egiazarian, “Discrete diffraction transform for propagation, reconstruction, and design of wavefield distributions,” Appl. Opt. **47**(19), 3481–3493 (2008). [CrossRef] [PubMed]

**6. **N. Delen and B. Hooker, “Free-space beam propagation between arbitrarily oriented planes based on full diffraction theory: a fast Fourier transform approach,” J. Opt. Soc. Am. A **15**(4), 857–867 (1998). [CrossRef]

**7. **J.-L. Kaiser, E. Quertemont, and R. Chevallier, “Light propagation in the pseudo-paraxial fresnel approximation,” Opt. Commun. **233**(4-6), 261–269 (2004). [CrossRef]

**8. **E. Sziklas and A. Siegman, “Diffraction calculations using fast Fourier transform methods,” Proc. IEEE **62**(3), 410–412 (1974). [CrossRef]

**9. **R. P. Muffoletto, J. M. Tyler, and J. E. Tohline, “Shifted Fresnel diffraction for computational holography,” Opt. Express **15**(9), 5631–5640 (2007). [CrossRef] [PubMed]

**10. **F. Zhang, G. Pedrini, and W. Osten, “Reconstruction algorithm for high-numerical-aperture holograms with diffraction-limited resolution,” Opt. Lett. **31**(11), 1633–1635 (2006). [CrossRef] [PubMed]

**11. **K. Matsushima and S. Nakahara, “Extremely high-definition full-parallax computer-generated hologram created by the polygon-based method,” Appl. Opt. **48**(34), H54–H63 (2009). [CrossRef] [PubMed]

**12. **M. Frigo and S. G. Johnson, “The design and implementation of FFTW3,” Proc. IEEE **93**(2), 216–231 (2005) (Special issue on “Program Generation, Optimization, and Platform Adaptation”). [CrossRef]