In this article, we propose a massively parallel, real-time algorithm for the estimation of the dynamic phase map of a vibrating object. The algorithm implements a Fourier-based quadrature transform and temporal phase unwrapping technique. CUDA, a graphic processing unit programming architecture was used to implement the algorithm. It was tested on a fringe pattern sequence using three devices with different capabilities, achieving a processing rate greater than 1600 frames per second (fps).
© 2011 OSA
A Fourier-based quadrature transform and temporal phase unwrapping technique for estimation of the dynamic phase map from a vibrating object recently became available . The new method has significant computational advantages over traditional methods. The method allows the automatic processing of large fringe sequences, however, despite all of its advantages, its computation on a conventional computer takes long enough to make it unsuitable for real-time applications.
High-speed digital recording equipment allows to acquire a large sequence of images that record the fluctuations of the projected fringe due to object deformations. Although computer power has increased dramatically in the last 20 years, it is sill a challenge to use general purpose computing platforms to perform real-time image processing tasks. Graphic processing units (GPU), on the other hand, are massively parallel computers that are designed to work like numeric computing engines and, when paralellization is possible, they are perfectly suitable to implement image processing tasks in real-time. In this paper, we propose a real-time algorithm to estimate dynamic phase maps using a graphic processing unit. We tested the algorithms on three different CUDA capable devices , achieving a 20x speedup compared to similar code executed on a conventional computer.
The rest of this document is organized as follows: the next section provides a general description of the quadrature and temporal phase unwrapping method. Section 3 contains a general review of the capabilities of graphic processing units to implement parallel algorithms and a description of the proposed parallel algorithm. We describe experiments used to test the proposed algorithm and present experimental results in section 4, which is followed by a final discussion, and conclusions in section 5.
2. Theoretical development
The equation that models the observed dynamic fringe pattern can be defined as
The phase term ψ (r ,t) is defined as:3]. The term ϕ(r) is the carrier frequency of the fringe pattern, which is defined as 3].
A sequence of N frames are captured from the dynamic movement of the object in such way that
2.1. The general n-dimensional quadrature transform
As pointed out in , an alternative method to Fourier-based demodulation that avoids the need to select the appropriate filter manually in order to isolate the desired frequency information for every fringe pattern, is to estimate the quadrature transform of every fringe pattern [4, 5]. Assuming that the constant background illumination a(r) is filtered from the fringe pattern defined in Eq. (2), the following fringe pattern results 
Quadrature estimation consists of transforming the fringe pattern of Eq. (4) into its quadrature term defined by
The n-dimensional quadrature transform for fringe patterns with carrier frequency is defined as Eq. (6) we can write the complex fringe pattern of Eq. (5) as
2.2. Temporal phase unwrapping
Once the quadrature term has been computed using Eq. (6) the wrapped phase is obtained with6].
The transient phase change occurring over time can be retrieved by unwrapping the phase maps. The phase map φM(r) can be computed by the sum of the M – 1 phase differences using the following equation [8, 9]
3. Parallel implementation of the quadrature transform and phase unwrapping on CUDA
Graphics processing units are processors that were originally designed to perform graphics computation for visualization purposes. They recently have taken a prominent role in scientific computing and other areas where computational requirements are large. In comparison to modern general-purpose processors, using multiple cores to attain a performance increase, GPU’s have a much larger number of cores, and therefore are designed to execute tasks where the parallelism is substantial, and throughput is more important than latency .
The programming model of a GPU is single-program multiple-data (SMPD), i.e. the same program is concurrently executed in multiple processors, each one processing different data. The GPU hardware is organized as a set of multiprocessors which contain many cores that are capable to execute thousands of threads concurrently. In particular, we implement the algorithm described in this article using the CUDA architecture, a general purpose computing and development platform for NVIDIA GPU’s. CUDA provides a valuable abstraction that allows the organization of threads in a two-level hierarchy; at its most basic level, threads are grouped in blocks, while at a higher level these are organized like a grid. A block can be conceived as a three-dimensional array of threads, while a grid is a two-dimensional array of blocks. Threads within a block shares local memory and registers and can be scheduled on any of the available processor cores, either concurrently or sequentially; this allows a compiled CUDA program to run on different CUDA capable devices .
A kernel is a c-like function that is executed in parallel in the GPU. Each thread in a block executes the same kernel, and is given two unique indexes that together identify the thread. These indexes are accessible within the kernel as two variables: threadId and blockId, a 3-component and 2-component vectors that identify respectively the thread within the block and the grid. Each kernel can determine the portion of data it should process using these two indexes. The programming model considers the GPU as an external device, and it distinguishes explicitly the code executed sequentially on the host computer and the one executed in parallel on the GPU (kernels). Therefore, CUDA programs run on at least two devices, the host computer and one (or more) GPU. The memory space of the GPU device and the host computer are different and, as a result, a program needs to transfer data between host and the device.
The input of the algorithm to estimate the dynamic phase map is a sequence of images captured from the vibrating surface where a fringe pattern has been projected; the algorithm output is an estimation of the fringe pattern phase. The algorithm has two well defined stages: application of the quadrature transform to a frame and estimation of the wrapped phase and the unwrapping of the temporal phase. The first stage involves first transforming the frame to the Fourier domain, applying a band-pass filter, and finally computing the quadrature transform and transform it back into the space domain. Applying the filter and computing the quadrature transform to the frame is done in a single step: each frame element is multiplied by its corresponding filter element, while the quadrature transform –as can be seen from Eq. (8), can be applied to the filtered frame by simply blocking half the spectral domain for which ω⃗ · q < 0, and doubling the other half . The evaluation of the quadrature transform at every frame element in the Fourier domain is expressed in a linear equation as
The second stage involves the calculation of the wrapped phase map using Eq. (9) and then performing the temporal phase unwrapping using Eq. (10). The computation of the unwrapped phase map at each step depends only on the current and last wrapped phase map calculated, plus the accumulated phase map φm(r) over the last k – 1 algorithm iterations.
A very important feature of this procedure is that since every element is processed independently, the main two stages of the algorithm can be easily parallelized. This allows the concurrent kernel execution on several GPU threads, thus speeding up the computation process.
The algorithm 1 shows an outline of the method proposed. Each algorithm step is preceded by a label indicating where is executed. The labels M, H, and G indicate respectively memory transfer between the host and the GPU, execution in the host, and execution in the GPU. Three different operations are executed on the GPU: the computation of the FFT, the execution of the quadrature transform, and the phase map kernels, labelled respectively as G_FFT, G_K1 and G_K2. Although the algorithm does not show explicit synchronization barriers, it is assumed that the host waits for the GPU to complete its task before proceeding with its execution.
4. Experimental results
The methodology proposed in this article was tested using both a sequential and a parallel implementation of the algorithm. The sequential implementation uses the FFTW library  to compute the discrete Fourier transform while the CUFFT  –the DFT CUDA native library–is used in the parallel implementation. The code of the sequential implementation was a hand-optimized version of the one used in Ref. . The algorithms were tested in four different platforms: a computer with 4 GB of memory, and Intel Core2 duo processor running at 2.53 GHz, a NVIDIA GeForce 9400M GPU (16 cores), a Tesla C1060 GPU (240 cores), and a TESLA C2050 GPU (448 cores).
The algorithm was tested processing a sequence of 402 640 × 64 frames captured from a vibrating cantilever . The entire sequence was stored in the computer memory while the execution of the programs took place. The implementation used single-precision floating point numbers, in order to make the results comparable, given that not all the GPU used to test the algorithm supported double-precision numbers. Even though we expected some degradation on the results for the lack of precision, the phase maps estimated were consistent with the ones reported in Ref. . The dynamic phase map computed was used to generate a real-time 3D animation of the vibrating cantilever with the fringe pattern projected on it; two frames of that animation are shown in Fig. 1; an animation of the complete sequence can be found here ( Media 1).
Table 1 shows the execution times of the Fourier transform, the computation of the quadrature transform, the estimation of the phase map and the data transfer time data between devices. Table 1 also shows the total time needed to compute the whole sequence and the achieved speed in frames per second (FPS). In the table, we observe a linear acceleration as the number of GPU cores increases; the most time consuming task in the algorithm is the memory transfer between the host and the GPU, and the task that did benefit the most from the parallelization was the dynamic phase map estimation stage. There is no improvement in the time needed to compute the FFT, which can be explained by the relatively small frame size used in this experiment (almost one eighth of a VGA frame).
5. Discussions and conclusions
In this paper, we propose a parallel implementation of a temporal phase-unwrapping and Fourier-based quadrature transform method for real-time dynamic phase map estimation. We tested the algorithm on a sequence of frames of a vibrating surface, generating a 3D visualization of the dynamic phase map. The speed achieved is fast enough to perform the dynamic phase map estimation on real-time from images captured on high-speed cameras; the proposed algorithm would allow to perform real-time analysis on mobile devices –the GeForce-9400M is a graphic processing unit on a consumer-grade laptop computer.
R. Legarda-Saenz was supported by Consejo Nacional de Ciencia y Tecnología (México) under grants 25793-CB-2005-01-49753 and 104870-SNI08.
References and links
2. NVIDIA, NVIDIA CUDA C Programming Guide (version 3.2), http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/ (2010).
3. K. J. Gåsvik, Optical Metrology, 3rd ed. (John Wiley & Sons Ltd., 2002). [CrossRef]
4. K. G. Larkin, D. J. Bone, and M. A. Oldfield, “Natural demodulation of two-dimensional fringe patterns. I. General background of the spiral phase quadrature transform,” J. Opt. Soc. Am. A 18, 1862–1870 (2001). [CrossRef]
5. M. Servin, J. A. Quiroga, and J. L. Marroquin, “General n-dimensional quadrature transform and its application to interferogram demodulation,” J. Opt. Soc. Am. A 20, 924–934 (2003). [CrossRef]
6. D. Ghigila and M. D. Pritt, Two-Dimensional Phase Unwrapping. Theory, Algorithms, and Software (John Wiley & Sons Ltd., 1998).
7. T. Kreis, Holographic Interferometry: Principles and Methods (Akademie Verlag, 1996).
9. S. De Nicola and P. Ferraro, “Fourier transform method of fringe analysis for moire interferometry,” J. Opt. A: Pure Appl. Opt. 2, 228–233 (2000). [CrossRef]
10. J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips “GPU computing,” Proc. IEEE 96(5), 879–899 (2008). [CrossRef]
11. M. Frigo and S. G. Johnson, “The design and implementation of FFTW3,” Proc. IEEE 93, 216–231 (2005). [CrossRef]
12. NVIDIA, NVIDIA CUDA CUFFT LIBRARY (PG-05327-032_V02), http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/ (2010).
13. C. Meneses-Fabian, R. Rodriguez-Vera, J. A. Rayas, F. Mendoza-Santoyo, and G. Rodriguez-Zurita, “Surface contour from a low-frequency vibrating object using phase differences and the Fourier-transform method,” Opt. Commun. 272(2), 310–313 (2007). [CrossRef]