## Abstract

In optical interferometers, fringe projection systems, and synthetic aperture radars, fringe patterns are common outcomes and usually degraded by unavoidable noises. The presence of noises makes the phase extraction and phase unwrapping challenging. Windowed Fourier transform (WFT) based algorithms have been proven to be effective for fringe pattern analysis to various applications. However, the WFT-based algorithms are computationally expensive, prohibiting them from real-time applications. In this paper, we propose a fast parallel WFT-based library using graphics processing units and computer unified device architecture. Real-time WFT-based algorithms are achieved with 4 frames per second in processing 256×256 fringe patterns. Up to 132× speedup is obtained for WFT-based algorithms using NVIDIA GTX295 graphics card than sequential C in quad-core 2.5GHz Intel(R)Xeon(R) CPU E5420.

© 2009 OSA

## 1. Introduction

In optical interferometers, fringe projection systems, and synthetic aperture radars, fringe patterns are common outcomes with unavoidable noises [1–3]. The presence of noises makes the phase extraction and phase unwrapping challenging. Among various techniques proposed for fringe pattern analysis, recently developed windowed Fourier transform (WFT) based algorithms [4,5] are considered in this paper, due to the following reasons: (i) noises in fringe patterns can be effectively suppressed [4,5]; (ii) quality-guided phase unwrapping along with WFT processing has been shown to be successful [6]; (iii) phase extraction from a single fringe pattern has been demonstrated with the concept of frequency-guided demodulation [7]. The WFT algorithms have been successfully used in the measurement of 3D profile [8], thickness of thin films [9], deformation of a force plate [10] and a dragonfly’s wings [11], refractive index [12], vibration [13], as well as used for comparison with newly developed algorithms [14,15]. However, the WFT algorithms are computationally expensive and time-consuming [4]. A fast solution is thus significant and expected especially in real-time applications. Many hardware devices have been proposed in photomechanics to accelerate numerical computation, including field programmable gate array (FPGA) [16], digital signal processor (DSP) [17], computer cluster [18], multicore computer [19] and graphics processing unit (GPU) [20–22]. In this paper, we propose fast parallel WFT-based algorithms using GPUs on a graphics card and computer unified device architecture (CUDA). With NVIDIA GTX295 graphics card, up to 132× speedup compared with the C codes has been achieved. As an example, a 256×256 fringe pattern can be effectively filtered with the speed of 4 frames per second (fps), which is fast enough for many real applications. With the fast development pace of the GPU, better acceleration performance can be expected in the near future. In the rest of the paper, the WFT algorithms are briefly introduced and the parallelization of the algorithms for GPUs and CUDA is described. Its application in processing digital holographic fringe patterns is demonstrated. The speedup and accuracy of the parallelized WFT-based algorithms are reported and discussed.

## 2. Windowed Fourier transform for fringe pattern analysis and its parallelization principles

A two dimensional windowed Fourier transform pair can be written as,

*ξ*and

*η*as its frequencies. Equation (1) decomposes a signal while Eq. (2) reconstructs it. Based on windowed Fourier transform, two algorithms have been proposed: the windowed Fourier filtering (WFF) and the windowed Fourier ridges (WFR). The parallelization of the WFF algorithm will be described in detail. The parallelization of the WFR algorithm is similar to that of the WFF and will not be elaborated to keep the paper concise. Nevertheless, the speedup performance of both the WFF and WFR are given at the end of the paper.

The WFF algorithm filters an input fringe pattern through selecting proper frequency bands as $\left(\xi ,\eta \right)\in \left[{\xi}_{l},{\xi}_{h}\right]\times \left[{\eta}_{l},{\eta}_{h}\right]$, and thresholding the windowed Fourier spectrum $Sf\left(u,v;\xi ,\eta \right)$ by setting the spectrum with small amplitude to zero. The discrete WFF algorithm can be written as follows,

*ξ*and

*η*respectively; $\overline{f\left(x,y\right)\otimes {h}_{{\xi}_{s},{\eta}_{s}}\left(x,y\right)}$ denotes the thresholded spectrum; $\overline{f}\left(x,y\right)$ denotes the reconstructed fringe pattern. More details of the WFF algorithm can be found in [4].

Knowing that a convolution can be realized by Fourier transforms, in order to reduce the computation time, Eq. (4) can be rewritten as

*ℑ*and ${\Im}^{-1}$ denote forward and inverse Fourier transforms respectively; the coordinates $\left(x,y\right)$on the right hand of the equation is omitted for conciseness. It can be seen that Eq. (5) consists of the following three major types of operations:

- (i) Pixel-based operations including addition, multiplication and thresholding. As the operations on one pixel are independent from those on other pixels, operations on different pixels can be assigned to different computational units for simultaneously execution. This is extremely efficient to be executed in a GPU with hundreds of computational units called stream processors.
- (ii) Forward and inverse Fourier transforms. By divide-and-conquer paradigm and matrix-vector notation, fast Fourier transform (FFT) has been developed into many parallel versions [23] and can be efficiently parallelized by CUDA CUFFT library [24]. It should be noted that, $\Im \left(f\right)$ can be pre-computed before the iteration starts; $\Im \left({h}_{{\xi}_{s},{\eta}_{s}}\right)$ appears twice but is only needed to be computed once for each iteration. Furthermore, since a Gaussian window is used, $\Im \left({h}_{{\xi}_{s},{\eta}_{s}}\right)$ can be theoretically expressed as

- (iii) Iterations for each $\left({\xi}_{s},{\eta}_{s}\right)\in \left[{\xi}_{l},{\xi}_{h}\right]\times \left[{\eta}_{l},{\eta}_{h}\right]$. These iterations are independent from one another so that different iterations can be assigned to different GPUs to be calculated at the same time if multiple GPUs are available in the computer.

## 3. Graphics computing Unit and computer unified device architecture

From the above analysis and discussion, it can be seen that GPUs are the ideal hardware for accelerating the WFF algorithm. As suggested by its name, GPU was designed to process thousands of threads simultaneously by underlying parallel stream processors for graphics rendering [25]. Recently GPGPU (General-Purpose computation in GPUs) was developed to leverage the powerful GPU capability for general computation acceleration, especially for image processing. Among all the hardware accelerators, GPU is outstanding because of its tremendous cost performance, performance per power consumption, and low learning curve. Designed for NVIDIA GPUs, CUDA is a C-based general-purpose parallel computing programming model, which resolves the programming difficulty and inefficient hardware mapping of traditional graphics APIs for GPGPU. The functions in CUDA codes are called kernels. Based on single-instruction, multiple-thread (SIMT) architecture [26], CUDA maps a single kernel to multiple threads to process different input data simultaneously. Using CUDA programming model, GPUs can be regarded as computation devices operating as coprocessors to the central processing unit (CPU). GPUs communicate with the CPU through fast PCI-Express port on motherboard of a personal computer (PC). Data-parallel and computationally intensive functions are offloaded to GPUs to get better performance than sequential execution in the CPU. Only one copy of source code is required with mixed C codes for the CPU and CUDA codes for GPUs. NVIDIA “nvcc” compiler [27] extracts and compiles the codes for GPUs while the rest codes are complied by standard C compiler.

## 4. Parallel windowed Fourier filtering based on CUDA

The WFF algorithm is such a typical data-parallel, computationally intensive algorithm that can benefit from the GPU hardware. As mentioned before, there are three types of operations in the WFF algorithm. Correspondingly, the following three approaches are explored to design the fast parallel WFF algorithm. Firstly, while the pixel-based operations are implemented by loops in a CPU, it can be implemented by a kernel in a GPU with the form “kernel_name<<<grid, block>>> ()”. After the kernel launched from the CPU, it is executed “block×grid” times by “block×grid” different light-weighted CUDA threads according to SIMT architecture. In this way, each iteration of pixel-based operations is implemented by a thread and all the threads are equally distributed among stream processors of the GPU for simultaneous execution [26]. The number of grid and block should be enough to hide the memory and instruction pipeline latency as well as to scale over many GPU models. To make full use of GPUs capacity on GTX295 graphics card, we conFig. 256 as block size and 90 as grid size to optimize the parallel WFF code.

Secondly, we adopt CUDA parallel FFT library called CUFFT to accelerate sequential FFT. The pseudo codes in Fig. 1 address the parallel WFT algorithm using one GPU. The functions in Fig. 1 are launched by CPU and passed on to GPUs for execution. It is worth mentioning that CUFFT library achieves best performance when the transform sizes are powers of a single factor [28]. Therefore, we extend the padding size of convolution to the nearest power of a single factor to optimize the parallel WFF algorithm.

Thirdly, the parallel WFF library can automatically detect the available GPUs in a PC and equally divide frequencies along *x* axis according to the number of GPUs. In Fig. 1, ${\xi}_{l}^{\text{'}}$ and ${\xi}_{h}^{\text{'}}$ indicate one of the equally divided frequency bands for each GPU. After loading a fringe pattern, each GPU calculates its own section of frequencies independently. Then CPU reads back the individual result from available GPUs and synthesizes all the results.

## 5. Experiments and performance of parallel WFF and WFR using CUDA and GPU

The experimental code is implemented in quad-core 2.5GHz Intel(R) Xeon(R) CPU E5420 and NVIDIA GTX295 graphics card with 2 GPUs. CUDA 2.3 and visual studio 2005 are adopted as the development software. We apply the proposed parallel WFT library based on GPUs, in both single-precision and double-precision arithmetic, to a digital holographic fringe pattern as shown in Fig. 2(a) . Sequential MATLAB® 2008a version (double precision) and “C” version (double precision) of WFT-based algorithms are also tested in the above CPU for comparison. The parameters are set as follows: for the parallel WFF algorithm, ${\xi}_{i}={\eta}_{i}=0.1$ and the threshold is 6; for the parallel WFR, ${\xi}_{i}={\eta}_{i}=0.025$; for both algorithms, ${\sigma}_{x}={\sigma}_{y}=10$ and $\left[{\xi}_{s},{\eta}_{s}\right]\in \left[-1,1\right]\times \left[-1,1\right]$.

Although the accuracy of GPU result is slightly sacrificed by using the single-precision arithmetic instead of double-precision in CUDA, the loss of the accuracy is tolerable. Figure 2(a) shows an input fringe pattern from a digital holographic interferometer. Figures 2(b) and 2(c) show the WFF results by CPU (double-precision) and GPU (single-precision). Their difference is shown in Fig. 2(d), which has a peak value of $1.03\times {10}^{-5}$radians. If the double-precision arithmetic is adopted for the GPUs, the peak difference decreases to ${10}^{-15}$radians. The parallel WFR algorithm based on GPUs and CUDA has the similar accuracy to the WFF for both single-precision and double-precision. Regarding the accuracy, though the double-precision is preferred, single-precision can be adopted for practical use.

Table 1 shows the computation time for the above experiments. It can be seen that for the WFF algorithm, GPUs with single-precision give up to 77× and 135× speedup compared with the sequential C codes and sequential MATLAB® codes, respectively. For the WFR algorithm, 132× and 306× speedup can be achieved compared with sequential C codes and MATLAB® codes, respectively. For a 256×256 fringe pattern, only 0.25 second is needed to process one frame using the parallel WFF, which enables its application to many real-time measurements. Note that in some applications the frequency band for the WFF and the WFR can be narrower and the proposed parallel WFT library based on GPUs can consequently achieve higher frame rate.

## 6. Conclusions and discussions

In this paper, parallel windowed Fourier transform algorithms for fringe pattern analysis using graphics processing unit (GPU) is proposed. Windowed Fourier filtering is used as an example to show the parallelization technique for GPU. Real-time windowed Fourier filtering is achieved with 4 fps for 256×256 digital holographic fringe patterns in GTX295 graphics card. The dramatic speedup performance shows the significance and importance of parallelizing an algorithm. The parallelization approach in GPU is not only technically sound but also practically attractive as a personal computer can simply be equipped with a CUDA supported graphics card through PCI-Express interface. In addition, the GPU solution is cost-effective and the most advanced CUDA supported graphics card costs only several hundred US dollars. Finally, the developed parallel WFT codes can run seamlessly in all CUDA supported graphics cards. With either a more advanced GPU or more available GPUs in a PC, better acceleration performance is expected.

## References and links:

**1. **D. W. Robinson, and G. T. Reid, in *Interferogram analysis: digital fringe pattern measurement techniques*, (Bristol, England: Institute of Physics 1993)

**2. **X. Su and W. Chen, “Fourier transform profilometry: a review,” Opt. Lasers Eng. **35**(5), 263–284 (
2001). [CrossRef]

**3. **D. C. Ghiglia, and M. D. Pritt, in *Two-dimensional phase unwrapping: theory, algorithms and software*, (John Wiley& Sons, Inc 1998).

**4. **Q. Kemao, “Two-dimensional windowed Fourier transform for fringe pattern analysis: Principles, applications and implementations,” Opt. Lasers Eng. **45**(2), 304–317 (
2007). [CrossRef]

**5. **Q. Kemao, H. Wang, and W. Gao, “Windowed Fourier transform for fringe pattern analysis: theoretical analyses,” Appl. Opt. **47**(29), 5408–5419 (
2008). [CrossRef] [PubMed]

**6. **Q. Kemao, W. Gao, and H. Wang, “Windowed Fourier-filtered and quality-guided phase-unwrapping algorithm,” Appl. Opt. **47**(29), 5420–5428 (
2008). [PubMed]

**7. **H. Wang and Q. Kemao, “Frequency guided methods for demodulation of a single fringe pattern,” Opt. Express **17**(17), 15118–15127 (
2009), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-17-17-15118. [CrossRef] [PubMed]

**8. **W. Chen, X. Su, Y. P. Cao, Q. C. Zhang, and L. Q. Xiang, “Method for eliminating zero spectrum in Fourier transform profilometry,” Opt. Lasers Eng. **43**(11), 1267–1276 (
2005). [CrossRef]

**9. **P. Hlubina, D. Ciprian, J. Lunacek, and R. Chlebus, “Phase retrieval from the spectral interference signal used to measure thickness of SiO2 thin film on silicon wafer,” Appl. Phys. B **88**(3), 397–403 (
2007). [CrossRef]

**10. **M. P. Arroyo, J. A. Bea, N. Andres, R. Osta, and M. Doblare, “Force plate for measuring small animal forces by digital speckle pattern interferometry,” Proc. SPIE **6616**, 66164D (
2007). [CrossRef]

**11. **P. Cheng, J. Hu, G. Zhang, L. Hou, B. Xu, and X. Wu, “Deformation measurements of dragonfly’s wings in free flight by using windowed Fourier transform,” Opt. Lasers Eng. **46**(2), 157–161 (
2008). [CrossRef]

**12. **W. Zhao, Y. Chen, L. Shen, and A. Y. Yi, “Refractive index and dispersion variation in precision optical glass molding by computed tomography,” Appl. Opt. **48**(19), 3588–3595 (
2009). [CrossRef] [PubMed]

**13. **Y. Fu, R. M. Groves, G. Pedrini, and W. Osten, “Kinematic and deformation parameter measurement by spatiotemporal analysis of an interferogram sequence,” Appl. Opt. **46**(36), 8645–8655 (
2007). [CrossRef] [PubMed]

**14. **J. A. Gómez-Pedrero, J. A. Quiroga, and M. Servín, “Adaptive asynchronous algorithm for fringe pattern demodulation,” Appl. Opt. **47**(21), 3954–3961 (
2008). [CrossRef] [PubMed]

**15. **S. Gorthi and P. Rastogi, “Numerical analysis of fringe patterns recorded in holographic interferometry using high-order ambiguity function,” J. Mod. Opt. **56**(8), 949–954 (
2009). [CrossRef]

**16. **T. Ito, N. Masuda, K. Yoshimura, A. Shiraki, T. Shimobaba, and T. Sugie, “Special-purpose computer HORN-5 for a real-time electroholography,” Opt. Express **13**(6), 1923–1932 (
2005), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-13-6-1923. [CrossRef] [PubMed]

**17. **X. Wang, X. Peng, and T. Jindong, “Tree-dimensional digital imaging based on temporal phase unwrapping with parallel DSP,” Proc. SPIE **6723** (2007)

**18. **T. W. Ng, K. T. Ang, and G. Argentini, “Temporal fringe pattern analysis with parallel computing,” Appl. Opt. **44**(33), 7125–7129 (
2005). [CrossRef] [PubMed]

**19. **W. Gao, Q. Kemao, H. Wang, F. Lin, and H. S. Seah, “Parallel computing for fringe pattern processing: A multicore CPU approach in MATLAB® environment,” Opt. Lasers Eng. **47**(11), 1286–1292 (
2009). [CrossRef]

**20. **N. Masuda, T. Ito, T. Tanaka, A. Shiraki, and T. Sugie, “Computer generated holography using a graphics processing unit,” Opt. Express **14**(2), 603–608 (
2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-2-603. [CrossRef] [PubMed]

**21. **T. Shimobaba, Y. Sato, J. Miura, M. Takenouchi, and T. Ito, “Real-time digital holographic microscopy using the graphic processing unit,” Opt. Express **16**(16), 11776–11781 (
2008), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-16-16-11776. [CrossRef] [PubMed]

**22. **S. Liu, P. Li, and Q. Luo, “Fast blood flow visualization of high-resolution laser speckle imaging data using graphics processing unit,” Opt. Express **16**(19), 14321–14329 (
2008), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-16-19-14321. [CrossRef] [PubMed]

**23. **C. V. Loan, *Computational frameworks for the fast Fourier transform data*, (SIAM 1992)

**24. **NVIDIA, “Tesla GPU computing solutions,” 2009 GPU workshop, http://www.idre.ucla.edu/events/2009/gpu-workshop/

**25. **R. Scott “Stream Processor Architecture,” Springer international series in Engineering and Computer Science, **664** (
2001).

**26. **NVIDIA, “ CUDA Programming Guide Version 2.3” (2009) http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/

**27. **NVIDIA, “The CUDA compiler Driver NVCC” (2009) http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/2.0/nvcc_2.0.pdf

**28. **NVIDIA, “CUDA CUFFT library 2.3” (2009), http://www.nvidia.com/object/cuda_develop.html