## Abstract

Digital holographic microscopy (DHM) is a well-known powerful method allowing both the amplitude and phase of a specimen to be simultaneously observed. In order to obtain a reconstructed image from a hologram, numerous calculations for the Fresnel diffraction are required. The Fresnel diffraction can be accelerated by the FFT (Fast Fourier Transform) algorithm. However, real-time reconstruction from a hologram is difficult even if we use a recent central processing unit (CPU) to calculate the Fresnel diffraction by the FFT algorithm. In this paper, we describe a real-time DHM system using a graphic processing unit (GPU) with many stream processors, which allows use as a highly parallel processor. The computational speed of the Fresnel diffraction using the GPU is faster than that of recent CPUs. The real-time DHM system can obtain reconstructed images from holograms whose size is 512×512 grids in 24 frames per second.

© 2008 Optical Society of America

## 1. Introduction

Digital holographic microscopy (DHM) is a well-known powerful method allowing both the amplitude and phase of a specimen to be simultaneously observed[1, 2]. The technique can obtain a hologram whereby the information of a specimen is electronically recorded, via the use of a CCD (Charge-Coupled Device Image Sensor) and CMOS (Complementary Metal Oxide Semiconductor) image sensor. In order to obtain a reconstructed image from a hologram, numerous calculations for the Fresnel diffraction are required. The Fresnel diffraction can be accelerated by the FFT (Fast Fourier Transform) algorithm[3]. However, real-time reconstruction from a hologram is difficult even if we use a recent central processing unit (CPU) to calculate the Fresnel diffraction by the FFT algorithm. For example, if we obtain a reconstructed image from a hologram whose size is 512×512 using an Intel Core2Duo E6300 CPU, the calculation time for the Fresnel diffraction takes about 1 second.

In order to obtain greater computational speed for the Fresnel diffraction, using hardware is an effective means. For example, a research group developed an FPGA (Field Programmable Gate Array)-based board, FFT-HORN, in order to accelerate the Fresnel diffraction in DHPIV (Digital Holographic Particle Image Velocimetry) [4, 5]. Their latest machine, FFT-HORN2, could obtain reconstructed images from holograms whose size is 1024 × 1024 grids, which were captured by a DHPIV optical system, in about 33 milliseconds. The approach using the FPGA technology showed excellent computational speed; however, the approach has the following restrictions: the high cost of developing the FPGA board, long development time and the technical know-how needed for the FPGA technology.

On the other hand, recent GPUs (Graphic Processing Unit) with many stream processors allow us to use highly parallel processors. The stream processor is a simple scalar processor, which can operate 32-bit floating-point addition, multiplication, and multiply-add instructions.

The approach of accelerating numerical calculations using a GPU chip is referred to as GPGPU (General-Purpose computation on GPU) or GPU computing. The merits of GPGPU are the high computational power, low cost for the GPU board, and short development time. In the optics field, several researches using the GPGPU technique for fast calculation of CGH (Computer-Generated-Hologram) have been proposed [6, 7]. A well-known problem in CGH is the enormous calculation cost for generating a CGH from three-dimensional (3D) object data. These researches could solve this problem, and generate a CGH from a simple 3D object in real-time.

In this paper, we describe a real-time DHM system using the GPGPU technique. The computational speed of the Fresnel diffraction using the GPU is faster than that of recent CPUs. The real-time DHM system can obtain reconstructed images from holograms whose size is 512×512 grids in 24 frames per second.

## 2. Real-time digital holographic microscopy system

In this section, we describe our real-time DHM system using the GPU card.

#### 2.1. Outline of the real-time DHM system

Figure 1 shows the set-up for our real-time DHM system. The system mainly consists of two parts: the optical system for recording a hologram and the real-time calculation system using a GPU.

The optical system is a traditional DHM set-up[2]. As shown in the figure, we used a 5-mW He-Ne laser as a reference light. The wavelength of the laser is 632.8*nm*. ”BS” and ”ND” indicate a beam splitter and a neutral density filter, ”M” and ”MO” indicate a mirror and an objective lens. We used a CCD camera, which has a resolution of 1360×1024 and a pixel pitch of 4.65*μm* × 4.65*μm*. We also used a test target, USAF 1951, as a sample. These holograms are then transferred to a personal computer (PC) via the USB2.0 interface. The PC controls the GPU and the CCD camera. The GPU, ”GeForce 8800 GTS,” made by NVIDIA, can calculate the Fresnel diffraction at high speed, thus allowing us to obtain reconstructed images from holograms at about 24 frames per second.

#### 2.2. Rapid calculation of the Fresnel Diffraction using the GPU

Here, we briefly describe the Fresnel diffraction. The Fresnel diffraction is expressed as:

where (*x*,*y*) and (ξ,*η*) are coordinates on reconstruction plane *u*(*x*,*y*) and hologram *a*(*ξ*,*η*) captured by the CCD, respectively, *λ* is the wavelength of the reference light, and *z* is the distance from the hologram to the reconstruction plane. Using the convolution theorem, the Fresnel diffraction is expressed as [2,3]:

$$\phantom{\rule{10.2em}{0ex}}=C\times a(x,y)\ast h(x,y)=C\times {F}^{-1}[F\left[a(x,y)\right]\xb7F\left[h(x,y)\right]]$$

where *C* and *h*(*x*,*y*) define
$C=\frac{1}{i\lambda z}\mathrm{exp}\left(i\frac{2\pi}{\lambda}z\right)$
and
$h(x,y)=\mathrm{exp}\left(i\frac{\pi}{\lambda z}\left({x}^{2}+{y}^{2}\right)\right)$, and operators *F*[∙] and *F*
^{-1}[∙] indicate the forward and inverse Fourier transform, respectively. If we calculate the Fresnel diffraction using a computer or a GPU, we must discretize Eqs.(1) and (2), and, subsequently use the FFT algorithm.

Next, we describe rapid calculation of the Fresnel diffraction using the GPU. We used a GPU board made by GALAXY Technology. ”GeForce 8800 GTS” as the GPU chip is mounted on the GPU board. The specifications of the GPU board are a GPU clock of 1.2GHz, memory clock of 1.6GHz, 96 stream processors and a memory of 640 Mbytes. The stream processor used in this paper is a scalar processor, which can operate 32-bit floating-point addition, multiplication, and multiply-add instructions (Note that latest GPUs can operate 64-bit floating operations). Therefore, the GPU chip has a peak performance of 2*operates*/*SP* × 96S*Ps* × 1.2*GHz* = 2304*Gflops* (floating-point number operations per second). Thus, we can use the GPU chip as a highly parallel processor. We also used the CUDA (Compute Unified Device Architecture) as a programming environment for the GPU chip[8]. In comparison with CG (C for Graphics) language and HLSL (High Level Shader Language), the advantage of the CUDA is to be able to write a source code by a C-like language

In Eq. (2), the Fresnel diffraction involves two forward FFT and one inverse FFT. We can use the FFT library, CUFFT [9]. The library can very effectively operate the forward and inverse FFT on the GPU chip. The calculation process is as follows.

Firstly, we send the holograms captured by the CCD to the memory on the GPU board. Secondly, the GPU chip calculates term *F*[*a*(*x*,*y*)] in Eq. (2) using CUFFT and the result is stored in the memory on the board. Similarly, the GPU chip calculates term *F*[*h*(*x*,*y*)] and the result is stored in the memory. Thirdly, the GPU chip calculates complex multiplication of *F*[*a*(*x*,*y*)] and *F*[*h*(*x*,*y*)], and the result is stored in the memory. Fourthly, the GPU chip calculates *u*(*x*,*y*) = *F*
^{-1}[*F*[*a*(*x*,*y*)]*F*[*h*(*x*,*y*)]] using the inverse FFT, and the result is stored in the memory. Finally, the GPU chip calculates |*u*(*x*,*y*)|^{2}, and the host computer receives |*u*(*x*,*y*)|^{2}. On the PC, we process the normalization of |*u*(*x*,*y*)|^{2} and translate |*u*(*x*,*y*)|^{2} to an 8-bpp (bits per pixel) image. Here, we can ignore the term *C* in Eq.(2) because |*C*|^{2} is 1 after the calculation of |*u*(*x*,*y*)|^{2}. For the above process, we used our numerical calculation library for wave optics using the GPU, the GWO (GPU-based Wave Optics) library[10, 11].

#### 2.3. Multiple threading for the real-time DHM system

The GPU can calculate the Fresnel diffraction faster than recent CPUs. For more effective processing, we introduce the multiple-threading technique in the real-time DHM system. The multiple-threading technique is a method for a program to split itself into two or more simultaneously running tasks[12]. Figure 2 (a) and (b) show flowcharts of non-multiple threading and multiple threading, respectively. For the non-multiple-threading, the steps from capturing a hologram to displaying a reconstructed image are sequential.

While, for the multiple threading, we use two threads. The main thread processes control of the CCD and displaying a reconstructed image calculated by the GPU. The second thread processes control of the GPU. We can overlap the step of capturing the next hologram and the step of calculating a reconstructed image from a previous hologram using the GPU. Therefore, we can improve the reconstruction rate from holograms.

## 3. Performance and optical results

Table 1 shows a comparison of the reconstruction rate between the case of CPU only and that of the GPU. The unit of the reconstruction rate is frames per second (fps). The reconstruction rate involves the time from capturing a hologram to displaying a reconstructed image. We used Intel Core2Duo E6300 as the ”CPU” in the table, memory of 2Gbytes, and the operating system of Microsoft Windows XP Professional SP2. The hologram size is 512 × 512 grids. With this size, the GPU can obtain reconstructed images about 20 times faster than the CPU. In case of the hologram size, the single floating point operation on the GPU chip is enough to obtain a reconstruction image from a hologram.

Figure 3 and 4 show snapshots of reconstructed animation using CPU and the GPU chip, respectively. In the case of CPU only,we used the FFT library, FFTW[13], for FFT calculations.

## 4. Conclusion and future work

We succeeded in developing real-time DHM using the GPGPU technique. A reconstruction rate of 24 frames per second could be achieved from holograms. It is difficult to achieve this speed by using a CPU alone. In our next work, we are planning to develop a real-time DHM system based on the GPGPU technique, which can obtain 3D reconstructed images for a specimen, using a phase-unwrapping method[2].

## Acknowledgment

This research was partially supported by Yamagata Promotional Organization for Industrial Technology and the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Young Scientists (B), 19700082, 2007.

## References and links

**1. **U. Schnars and W. Juptner, “Direct recording of holograms by a CCD target and numerical Reconstruction,” Appl.Opt. **33**, 2, 179–181 (1994). [CrossRef] [PubMed]

**2. **U. Schnars and W. Jueptner, *Digital Holography - Digital Hologram Recording, Numerical Reconstruction, and Related Techniques* (Springer2005).

**3. **O. K. Ersoy, Diffraction, *Fourier Optics And Imaging* (Wiley-Interscience2006).

**4. **N. Masuda, T. Ito, K. Kayama, H. Kono, S. Satake, T. Kunugi, and K. Sato, “Special purpose computer for digital holographic particle tracking velocimetry,” Opt. Express **14**, 603–608 (2006). [CrossRef] [PubMed]

**5. **Y. Abe, N. Masuda, H. Wakabayashi, Y. Kazo, T. Ito, S. Satake, T. Kunugi, and K. Sato, “Special purpose computer system for flow visualization using holography technology,” Opt. Express, **16**, 7686–7692 (2008). [CrossRef]

**6. **N. Masuda, T. Ito, T. Tanaka, A. Shiraki, and T. Sugie, “Computer generated holography using a graphics processing unit,” Opt. Express, **14**, 587–592 (2008). [CrossRef]

**7. **L. Ahrenberg, P. Benzie, M. Magnor, and J. Watson, “Computer generated holography using parallel commodity graphics hardware,” Opt. Express, **14**, 7636–7641 (2006). [CrossRef]

**8. **
NVIDIA, “NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 1.1,” NVIDIA (2007).

**9. **
NVIDIA, “CUDA FFT Library Version 1.1 Reference Documenta-tion,” NVIDIA (2007).

**10. **T. Shimobaba, T. Ito, N. Masuda, Y. Abe, Y. Ichihashi, H. Nakayama, N. Takada, A. Shiraki, and T. Sugie, “Numerical calculation library for diffraction integrals using the graphic processing unit: the GPU-based wave optics library,” J. Opt. A: Pure Appl. Opt. **10**, 075308 (2008), http://www.iop.org/EJ/abstract/1464-4258/10/7/075308/. [CrossRef]

**11. **
The GWO library, http://sourceforge.net/projects/thegwolibrary/.

**12. **
Wikipedia, http://en.wikipedia.org/wiki/Thread_(computer_science).

**13. **
FFTW Home Page, http://www.fftw.org/.