## Abstract

In this paper, we show fast signal reconstruction for compressive holography using a graphics processing unit (GPU). We implemented a fast iterative shrinkage-thresholding algorithm on a GPU to solve the *ℓ*_{1} and total variation (TV) regularized problems that are typically used in compressive holography. Since the algorithm is highly parallel, GPUs can compute it efficiently by data-parallel computing. For better performance, our implementation exploits the structure of the measurement matrix to compute the matrix multiplications. The results show that GPU-based implementation is about 20 times faster than CPU-based implementation.

© 2016 Optical Society of America

## 1. Introduction

Compressive holography is an application of compressive sensing (CS) [1–4] to holography, which can reconstruct 3D scattering density from a single 2D hologram [5–9]. In holographic 3D imaging, tomographic images obtained from a conventional backpropagation technique suffer from out-of-focus objects, twin images, and squared terms, while compressive holography can remove these unwanted terms.

A problem in compressive holography is that signal reconstruction in a CS framework is computationally intensive. This is because the sizes of reconstructed signals in compressive holography (i.e., 3D scattering density) are large. For example, assuming that we have a hologram with 1000 × 1000 pixels and try to reconstruct a 3D scattering density with 10 different depths from it, the reconstructed signal has 10 million variables. Since CS reconstruction is performed through an optimization problem (e.g., *ℓ*_{1} norm minimization), we must solve a large-scale optimization problem that has 10 million variables.

To address this problem, we propose fast CS reconstruction for compressive holography using a graphics processing unit (GPU). A GPU is a massively parallel computing device and suitable for data-parallel computing; the same program is executed on many data elements in parallel [10–12]. Since CS reconstruction can be computed using a data-parallel model, GPUs can compute the reconstruction very efficiently. GPU-accelerated CS reconstruction has been studied in many fields (e.g., magnetic resonance imaging [13, 14], optical coherence tomography [15], and hyperspectral imaging [16]), and, to the best of our knowledge, this study is the first to show that GPU-accelerated computing works efficiently for compressive holography. We have implemented a GPU solver for *ℓ*_{1} and total variation (TV) [17, 18] regularized problems that are typically used in compressive holography (Section 3). Our implementation is based on a fast iterative shrinkage-thresholding algorithm (FISTA) [19, 20] (Section 3.1). In FISTA, we have accelerated the matrix computations by exploiting the structure of the measurement matrix of compressive holography (Section 3.2). We compared the computation times of GPU-based implementation and CPU-based implementation, and the results show that GPU-based implementation is about 20 times faster than CPU-based implementation (Section 4).

## 2. Compressive holography

We briefly explain compressive holography. Figure 1 illustrates a Gabor holography setup. The 3D object with the scattering density is illuminated by a plane wave, and then the Gabor hologram is recorded by an image sensor. Assuming that *N _{x}* ×

*N*is the number of lateral samples,

_{y}*N*is the number of axial samples, and the object of size

_{z}*N*×

_{x}*N*×

_{y}*N*is written as the vector $\mathit{f}={\left[\begin{array}{cccc}{\mathit{f}}_{1}^{T}& {\mathit{f}}_{2}^{T}& \cdots & {\mathit{f}}_{{N}_{z}}^{T}\end{array}\right]}^{T}\in {\u2102}^{{N}_{xyz}}$ where

_{z}*∈ ℂ*

**f**_{i}^{Nxy}is the

*i*-th depth slice of size

*N*×

_{x}*N*=

_{y}*N*and

_{xy}*N*×

_{x}*N*×

_{y}*N*=

_{z}*N*, the diffracted field on the image sensor

_{xyz}**∈ ℂ**

*u*^{Nxy}is formed as

**∈ ℂ**

*F*^{Nxy×Nxy}is the 2D Fourier transform matrix,

**∈ ℂ**

*Q*^{Nxy×Nxyz}consists of diagonal matrices

*P**∈ ℂ*

_{i}^{Nxy×Nxy}expressing the transfer function of wave propagation [21], and

**∈ ℂ**

*B*^{Nxyz×Nxyz}is the block diagonal matrix of

**where bdiag(·) denotes a block diagonal matrix. The Gabor hologram**

*F***∈ ℝ**

*g*^{Nxyz}is the intensity of the sum of the diffracted field

**and the reference beam**

*u***described as**

*r**x*) is the operator to take the real part of

*x*. For simplicity, we assume that (1) the reference light has a constant value on the measurement plane so that

*r*= 1 and (2) the DC term of the reference light |

_{i}*r*|

_{i}^{2}= 1 can be eliminated. This assumption simplifies Eq. (4) as

*u*|

_{i}^{2}can be regarded as a model error

*e*and may be eliminated by the reconstruction process [5]. Substituting Eq. (1) with Eq. (5), we obtain the linear system to express the measurement of the Gabor hologram:

_{i}The linear system Eq. (6) is an ill-posed inverse problem (i.e., *N _{xy}* <

*N*) and does not have a unique solution. In the CS framework, we exploit the sparsity of desired solutions in a certain space to find solutions from underdetermined systems [4]. In this paper, we infer the solution

_{xyz}**by solving the following unconstrained optimization problem**

*f*_{2}is the

*ℓ*

_{2}norm,

*ℰ*(

**) expresses the fidelity between the measurement**

*f***and the estimated value**

*g***,**

*f**ℛ*(

**) is a regularizer that promotes sparseness of the estimated value, and**

*f**τ*is the regularization parameter of

*ℛ*(

**). In this study, we used the**

*f**ℓ*

_{1}norm and total variation (TV) [17, 18] as regularizers, which are defined by

*D**∈ ℂ*

_{i}^{2×Nxy}is a 2D local finite difference operator for the

*i*-th element of the slice

*f**.*

_{j}## 3. GPU-accelerated CS reconstruction

#### 3.1. Fast iterative shrinkage-thresholding algorithm

For reconstruction through the optimization problem Eq. (7), we used FISTA, which is a simple and fast iterative algorithm [19, 20]. FISTA is based on the iterative shrinkage-thresholding algorithm and improves it by using the last two iterations at each iteration step. We used it due to its low computational cost, high scalability, and simplicity.

In FISTA, taking an initial guess *f*_{0} = **0**, *h*_{1} = *f*_{0} and a stepsize 1/*L*, we perform iterative update of the approximate solution at the *k*-th step *f** _{k}* by

*ℰ*is the gradient of

*ℰ*and prox

*is the proximal map [22] associated with the regularizer*

_{τℛ/L}*ℛ*defined by

*ℰ*can be written explicitly by [23]

*H**is adjoint to*

^{H}**. For**

*H**ℛ*(

**) = ‖**

*f***‖**

*f*_{1}, the proximal map is simplified to where

*𝒯*is the shrinkage operator (soft-thresholding) defined by where (

_{κ}*a*)

_{+}= max{0,

*a*}. For

*ℛ*(

**) = TV(**

*f***), since the proximal map cannot be written explicitly, we must solve Eq. (13) as the sub-problem at each iteration. This sub-problem can be solved by iterative schemes like Chambolle’s dual-based approach [18, 20], and we used the gradient projection (GP) algorithm in [20]. The GP requires internal iterations at each iteration of FISTA (in this paper, we set the number of iterations of the GP to 10 as described in Section 4); thus, TV regularization is more computationally intensive than**

*f**ℓ*

_{1}regularization.

#### 3.2. GPU implementation

Figure 2 illustrates a schematic diagram of our GPU-accelerated compressive holography. We used an NVIDIA GTX 980 and CUDA 7.5. GTX 980 has 2048 cores at 1126 MHz and 4 GB of GDDR5 RAM. For better performance of the GTX 980, we used single-precision data types. CUDA is a parallel computing platform and programming model invented by NVIDIA, which allows developers to use a CUDA-enabled GPU for general-purpose processing. In the CUDA programming model, the system consists of a host that is a CPU and one or more devices that are GPUs. We write a function called kernel that describes the work of a single thread on a device, and invoke it with a large number of threads from a host. These threads are executed in parallel on thousands of cores on a GPU. This programming model naturally fits data-parallel computing and supports developers in using GPUs as a massively parallel computing device. Our GPU-accelerated compressive holography shown in Fig. 2 first sends hologram data from the host to the device and invokes a kernel to compute the transfer functions. Then kernels for FISTA are invoked as many times as the number of required iterations. After the FISTA step, we obtain the solution (i.e., a scattering density) by transferring it from the global device memory to the host. Basically kernels for the transfer functions and FISTA are easily parallelized, because most operations in those kernels are vector operations where each element can be processed independently by a single thread.

In general, multiplying the measurement matrix ** H** (and its adjoint

*H**) in the gradient computation Eq. (14) is the most compute-intensive in FISTA. For compressive holography, multiplying*

^{H}**can be computed efficiently by exploiting its structure; multiplied**

*H***reduced to 2D fast Fourier transforms (FFTs), multiplications of the transfer functions, and the sum of the depth slices**

*H**. Figure 3 illustrates the multiplication of*

**f**_{i}**using its structure. We used the cuFFT library to execute 2D-FFTs on the GPU. As shown in Fig. 3, multiplying**

*H***and**

*B***can be computed in parallel with depth slices**

*Q**due to its separability with respect to*

**f**_{i}*. Note that this block separability is an advantage when the program is executed on a multi-GPU system.*

**f**_{i}## 4. Results

We evaluated the performance of our GPU-based implementation by comparing it with CPU-based implementation. Our CPU-based implementation uses gcc 4.8.3 for a C++ compiler, FFTW for an FFT library, and OpenMP for multi-threading, and is run on an Intel Core i7-4790K with four cores at 4 GHz using eight threads. Table 1 shows the computation times of the GPU-based and CPU-based implementations for *ℓ*_{1} and TV regularizations. The computation times are the average of 20 runs on the GPU and 10 runs on the CPU. We set the number of iterations of FISTA to 600, which produced moderate reconstruction results in our cases. It should be noted that the total computation times of our implementations grew linearly with the number of iterations. For TV regularization, we set the number of iterations of the GP to 10. We changed the size of the reconstructed signals to the following two situations: (1) the hologram size *N _{xy}* =

*N*×

_{x}*N*was changed and the number of depths

_{y}*N*was fixed and (2) the hologram size

_{z}*N*=

_{xy}*N*×

_{x}*N*was fixed and the number of depths

_{y}*N*was changed. In both cases, GPU-based implementation is about 20 times faster than CPU-based implementation. Figure 4 illustrates the relationship between the computation time and the data size. As shown in Fig. 4 (a), the computation time of GPU-based implementation grows almost linearly with the data size, which indicates that data-parallel computing by GPU-based implementation works efficiently. Figure 4 (b) indicates that the slight degradation of the speed-up rate in the case of

_{z}*N*×

_{x}*N*×

_{y}*N*= 512×512×40 in Table 1 is due to the performance improvement of CPU-based implementation.

_{z}We also evaluated each calculation step in the GPU-based implementation in the case of *N _{x}* ×

*N*×

_{y}*N*= 1024 × 1024 × 10 as shown in Fig. 5. Note that in Fig. 5, Data transfer (to device), Data transfer (to host), and Transfer function are executed only once in the GPU-based compressive holography, while Gradient computation, Proximal map (L1, soft), Proximal map (TV, 1 GP), Update FISTA variables (

_{z}*t*

_{k+1},

**h**_{k+1},

**f**_{k−1}) are executed as many times as the number of iterations of FISTA, and those computation times grow linearly with the number of iterations of FISTA. Moreover, in a single FISTA iteration, the proximal map for TV is executed iteratively, and its computation time also increases linearly with the number of iterations of GP. When the setting is the same as in Table 1 (600 iterations of FISTA and 10 GP), the gradient computation, proximal map for

*ℓ*

_{1}, proximal map for TV, and updating FISTA variables take 5.53 sec, 0.884 sec, 27.7 sec, and 1.20 sec, respectively. In this case, the total time of data transfers (14.7 ms) would be small enough to be negligible, and the most time-consuming step in

*ℓ*

_{1}case is the gradient computation, and in TV case is the proximal map through GP.

We identified bottlenecks of our implementation by profiling the kernels of the time-consuming steps: the gradient computation and the proximal map for TV (GP). For the gradient computation, the average compute resource utilization and memory bandwidth utilization were about 20% and 75%, respectively. For the GP, the average compute resource utilization and memory bandwidth utilization were about 26% and 75%, respectively. These results are low compute resource utilization and high bandwidth utilization case, and indicate that the kernels were memory bound and the bottleneck was the memory bandwidth between multiprocessors and device global memory in Fig. 2. The memory bandwidths of the CPU and GPU system in our evaluation environment were 12.8 GB/sec (DDR3-1600) and 224 GB/sec (GDDR5), respectively. If we fully used these bandwidths, we could achieve ×17.5 speed-up on the GPU. According to this memory bandwidth discussion, we assess the performance of our GPU-based implementation (about ×20 speed-up) as a reasonable result. This discussion also indicates that efficient memory access is important for the performance improvement. It should be noted that our GPU-based implementation has not been fully optimized. For example, a part of memory access in GP is not fully coalesced. This would be improved by using texture cache in GPUs, which is our future work. It should also be noted that when the data size is not sufficiently large, memory latency would be a bottleneck instead of the memory bandwidth. For example, when we tested the kernels of GP with *N _{x}* ×

*N*×

_{y}*N*= 64 × 64 × 5 data, the memory bandwidth utilization decreased to about 25%.

_{z}To check our GPU-based implementation, we show an illustrative example of compressive holography by the simulation. Figure 6 (a) shows the 3D object used in the simulation. The 3D object has *N _{x}* ×

*N*×

_{y}*N*= 1024 × 1024 × 5 samples with 9.0 μm lateral pitch and 2.0 mm axial pitch. It consists of 100 randomly distributed points with 10 × 10 pixels and four images with 256 × 256 pixels at different distances along

_{z}*z*axis. The sensor has 1024 × 1024 pixels with 9.0 μm pixel pitch, and is located

*z*= 100 [mm] away from the 3D object. A Gabor hologram where the DC term was eliminated as shown in Eq. (6) was measured by illuminating the 3D object with a plane wave with a 633 nm wavelength. Figures 6 (b) and 6 (c) show reconstructed images by backpropagation and compressive holography using our GPU-based implementation, respectively. While the reconstructed images by backpropagation were superposed with out-of-focus images, conjugate images, and squared fields, compressive holography suppressed these unwanted images from the reconstructed images. Figure 6 (c) was obtained through TV regularization with

*τ*= 0.01. We ran FISTA with 1000 iterations and 10 GP iterations, which took 29 sec on the GPU and 597 sec on the CPU.

## 5. Conclusion

In conclusion, we developed GPU-accelerated CS reconstruction for compressive holography. Our implementation uses FISTA to solve the *ℓ*_{1} and TV regularized problems that are typically used in compressive holography. In FISTA, multiplication of the measurement matrix is a compute-intensive task. This multiplication can be calculated efficiently by exploiting the structure of the measurement matrix; the measurement matrix in compressive holography consists of block 2D-FFTs, multiplications of the transfer functions, and the sum of the depth slices. We show the computation times of our GPU-based implementation and CPU-based implementation. The results show that GPU-based implementation is about 20 times faster than CPU-based implementation. We evaluated our GPU-based implementation and identified its bottleneck. This evaluation indicates that the most time-consuming step in *ℓ*_{1} case is the gradient computation and in TV case is the proximal map through GP. A bottleneck of both calculation steps was memory bandwidth, and efficient memory access would improve the performances.

The proposed GPU-acceleration scheme is applicable to other CS applications especially in digital holography [7, 24], for example, single-exposure in-line holography [25], compressive Fresnel holography [26], and multi-dimensional imaging [27].

## Acknowledgments

This work was partially supported by the Japan Society for the Promotion of Science Grant-in-Aid No. 25240015.

## References and links

**1. **D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory **52**(4), 1289–1306 (2006). [CrossRef]

**2. **E. J. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory **52**(2), 489–509 (2006). [CrossRef]

**3. **E. J. Candes and T. Tao, “Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies?” IEEE Trans. Inf. Theory **52**(12), 5406–5425 (2006). [CrossRef]

**4. **E. J. Candes and M. Wakin, “An Introduction To Compressive Sampling,” IEEE Signal Proc. Mag. **25**(2), 21–30 (2008). [CrossRef]

**5. **D. J. Brady, K. Choi, D. L. Marks, R. Horisaki, and S. Lim, “Compressive Holography,” Opt. Express **17**(15), 13040–13049 (2009). [CrossRef] [PubMed]

**6. **J. Hahn, S. Lim, K. Choi, R. Horisaki, and D. J. Brady, “Video-rate compressive holographic microscopic tomography,” Opt. Express **19**(8), 7289–7298 (2011). [CrossRef] [PubMed]

**7. **S. Lim, D. L. Marks, and D. J. Brady, “Sampling and processing for compressive holography [Invited],” Appl. Opt. **50**(34), H75–H86 (2011). [CrossRef] [PubMed]

**8. **Y. Rivenson, A. Stern, and J. Rosen, “Reconstruction guarantees for compressive tomographic holography,” Opt. Lett. **38**(14), 2509–2511 (2013). [CrossRef] [PubMed]

**9. **K. Choi, R. Horisaki, J. Hahn, S. Lim, D. L. Marks, T. J. Schulz, and D. J. Brady, “Compressive holography of diffuse objects,” Appl. Opt. **49**(34), H1–H10 (2010). [CrossRef] [PubMed]

**10. **J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with CUDA,” Queue **6**(2), 40–53 (2008). [CrossRef]

**11. **J. Nickolls and W. J. Dally, “The GPU Computing Era,” IEEE Micro **30**(2), 56–69 (2010). [CrossRef]

**12. **J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, “GPU Computing,” Proc. IEEE **96**(5), 879–899 (2008). [CrossRef]

**13. **D. S. Smith, J. C. Gore, T. E. Yankeelov, and E. B. Welch, “Real-Time Compressive Sensing MRI Reconstruction Using GPU Computing and Split Bregman Methods,” Int. J. Biomed. Imaging **2012**, 864827 (2012). [CrossRef] [PubMed]

**14. **Ç. Bilen, Y. Wang, and I. W. Selesnick, “High-Speed Compressed Sensing Reconstruction in Dynamic Parallel MRI Using Augmented Lagrangian and Parallel Processing,” IEEE J. Emerg. Sel. Topics Circuits Syst. **2**(3), 370–379 (2012). [CrossRef]

**15. **D. Xu, Y. Huang, and J. U. Kang, “GPU-accelerated non-uniform fast Fourier transform-based compressive sensing spectral domain optical coherence tomography,” Opt. Express **22**(12), 14871–14884 (2014). [CrossRef] [PubMed]

**16. **S. Bernabe, G. Martin, J. M. P. Nascimento, J. M. Bioucas-Dias, A. Plaza, and V. Silva, “GPU implementation of a hyperspectral coded aperture algorithm for compressive sensing,” in *Proceedings of IEEE International Geoscience and Remote Sensing Symposium* (IEEE, 2015), pp. 521–524.

**17. **L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Phys. D **60**(1–4), 259–268 (1992). [CrossRef]

**18. **A. Chambolle, “An Algorithm for Total Variation Minimization and Applications,” J. Math. Imaging Vis. **20**(1), 89–97 (2004). [CrossRef]

**19. **A. Beck and M. Teboulle, “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems,” SIAM J. Imag. Sci. **2**(1), 183–202 (2009). [CrossRef]

**20. **A. Beck and M. Teboulle, “Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems,” IEEE Trans. Image Process. **18**(11), 2419–2434 (2009). [CrossRef] [PubMed]

**21. **J. W. Goodman, *Introduction to Fourier Optics*, 3 Edition, (Roberts & Company, 2005).

**22. **N. Parikh and S. Boyd, “Proximal Algorithms,” Foundations and Trends in Optimization **1**(3), 127–239 (2014). [CrossRef]

**23. **D. H. Brandwood, “A complex gradient operator and its application in adaptive array theory,” IEE Proc. H Microwaves Opt. Antennas **130**(1), 11–16 (1983). [CrossRef]

**24. **Y. Rivenson, A. Stern, and B. Javidi, “Overview of compressive sensing techniques applied in holography [Invited],” Appl. Opt. **52**(1), A423–A432 (2013). [CrossRef] [PubMed]

**25. **Y. Rivenson, A. Stern, and B. Javidi, “Improved depth resolution by single-exposure in-line compressive holography,” Appl. Opt. **52**(1), A223–A231 (2013). [CrossRef] [PubMed]

**26. **Y. Rivenson, A. Stern, and B. Javidi, “Compressive Fresnel Holography,” J. Disp. Technol. **6**(10), 506–509 (2010). [CrossRef]

**27. **R. Horisaki, J. Tanida, A. Stern, and B. Javidi, “Multidimensional imaging using compressive Fresnel holography,” Opt. Lett. **37**(11), 2013–2015 (2012). [CrossRef] [PubMed]