## Abstract

This paper proposes a high-speed FPGA architecture for the phase measuring profilometry (PMP) algorithm. The whole PMP algorithm is designed and implemented based on the principle of full-pipeline and parallelism. The results show that the accuracy of the FPGA system is comparable with those of current top-performing software implementations. The FPGA system achieves 3D sharp reconstruction using 12 phase-shifting images and completes in 21 ms with 1024 × 768 pixel resolution. To the best of our knowledge, this is the first fully pipelined architecture for PMP systems, and this makes the PMP system very suitable for high-speed embedded 3D shape measurement applications.

© 2017 Optical Society of America

## 1. Introduction

The phase measuring profilometry (PMP) uses multiple fringe patterns with a phase-shifting algorithm to obtain accurate and high-resolution three-dimensional (3D) profiles [1,2], and it has been widely used in manufacturing, medical imaging, virtual reality, computer graphics, and other fields. In recent years, with the development of industrial intelligence, industrial fields have demanded increasingly and higher requirements for the embedded high-speed 3D measurement equipment. Many high-speed vision systems that can operate at hundreds or more frame-per-second (fps) have been developed [3–6] using high-speed cameras and projectors. However, because of the highly complex calculations and large memory consumption, it is very hard to realize a high-speed PMP algorithm calculation with an ordinary computer, and this is not a good option for embedded industrial field applications.

To solve this problem, many researchers have tried to boost PMP calculation speed by employing the graphics processing units (GPUs). Zhang et al. [3] reported a GPU-assisted real-time 3D shape measurement using a 2 + 1 phase shifting algorithm, achieving 25.56 fps at 532 × 500 pixels. Nguyen et al. [4] proposed a CPU and GPU hybrid architecture real-time 3D measurement system, which relied on combining three gray-scale phase-shifted fringe patterns into a single color image. This system can reach a speed of 45 fps at 640 × 480 pixels with two fringe frequency phase unwrapping, and 22.5 fps with four fringe frequency phase unwrapping for enhanced accuracy. We have also committed to researching the GPU acceleration of PMP and other vision algorithms [5,7]. Although GPUs have proved to be an attractive speedup platform for phase matching, high power consumption restricts its performance, and it is not feasible for embedded applications.

Compared with GPUs, an FPGA is a promising platform for stereo vision algorithms because it has two advantages: 1) reconfigurable processing units and customized memory hierarchies and 2) low power. Existing stereo vision algorithms have been executed on FPGAs, for example, Jin et al. [8] have built an FPGA-based stereo vision system using the census transform, which can provide dense disparity information with additional sub-pixel accuracy at 230 fps for 640 × 480 resolution. Wang et al. [9] presented the first FPGA design and implementation for Kinect-like depth sensing. With an elaborate full-pipeline hardware architecture, a 227 fps 3D reconstruction for 1024 × 768 resolution was achieved on a normal FPGA. These applications show the great potential and advantages of implementing the PMP algorithm on an FPGA platform. To the best of our knowledge, the PMP algorithm has never been implemented on FPGA platform because of its highly complex calculations and large memory consumption.

This paper presents a high-speed FPGA architecture for the PMP algorithm. In this architecture, phase/SNR calculation, phase unwrapping, phase rectification, phase matching and 3D reconstruction modules are designed and implemented based on the principles of full-pipeline and parallelism. We designed various strategies and data buffers to optimize resource consumption, accuracy, and the initial pipeline latency. The PMP algorithm has been re-implement at logic gate level with Verilog (a hardware description language), and the algorithm modules have been re-designed to meet the full-pipeline requirements. As a result, the proposed architecture achieved the whole PMP process in 21 ms at 1024 × 768 resolution, and it is intensively pipelined and synchronized with a system clock. The system can be more than 100 times faster than the same calculation in an ordinary CPU. These results also show that its accuracy is comparable with those of the current top-performing software implementations.

This paper makes the following contributions: 1) The FPGA design and implementation are of great significance for PMP 3D measurement sensors with embedded calculation units, like the implementation of the popular Kinect. This is the first implementation of the PMP algorithm on an FPGA. 2) It is not straightforward to migrate the general platform-based algorithm as a parallel fully pipelined design on an FPGA. Instead, we redesigned the whole algorithm implementation, creating a fully pipelined parallel PMP architecture. 3) This paper also reports the performance for accuracy, speed, and resource costs, and fully addresses the challenges of high-speed PMP measurement. This work is very suitable for the development of calculation unit embedded PMP 3D measurement sensors, without relying on high-end calculation platforms.

## 2. FPGA implementation of PMP

This section presents our implementation of PMP algorithm in a single FPGA. Figure 1 describes the overall system architecture. There are two main kinds of components in an FPGA, memory units (e.g., FIFOs, RAM, and buffers) and calculation units. The captured images and other data are stored in SDRAM. The memory controller is responsible for managing the DDR2 SDRAM and transferring data from the SDRAM to FIFOs so that the calculation units can read the required data from the FIFOs directly. The calculation units (phase/SNR calculation, phase unwrapping, phase rectification, phase matching, and 3D reconstruction) execute the PMP algorithm, and are described in detail in this section.

Compared with an instruction cycle mechanism in a computer, an FPGA can process data more efficiently via a customized data path. For efficient implementation, the 3D point coordinate for the corresponding pixel is generated synchronously with the system clock after the initial pipeline latency. Parallelism and pipelining are intrinsic features of an FPGA, and many pipelining and parallelism designs have been made at a reasonable resource cost. To achieve pipelining, we redesign many modules to improve computational performance. For example, the phase calculation unit contains an arc-tangent function, which usually consumes many instruction cycles on a CPU platform, and the numerous repetitive operations cause a long delay. To solve this problem, the CORDIC (COordinate Rotation DIgital Computer) algorithm is adopted to divide the arc-tangent function into simpler functional elements, ensuring that this module works by pipelining with a few system clock cycles. Phase calculation modules are also parallelized to ensure three different frequency phase values are obtain at the same clock. A detailed description of each module is given below.

#### 2.1 Phase/SNR calculation module

In our FPGA-based PMP architecture, the basic idea of phase calculation is well known and familiar to us; by capturing a series of phase-shifted fringe patterns, the initial phase value can be calculated to obtain the 3D information.

As described in the literature [10], in this paper, the standard four-step phase shift method is adopted for calculating the principal phase value. The four patterns are 0, π/2, π, 3π/2, and their intensity and principal phase calculation functions are:

Correspondingly, the ratio of *B* and *A* can indicate the signal-to-noise ratio (SNR) of the current pixel with a value range of [0, 1]. SNR is important value for the availability of each pixel phase value, the availability can be judged by comparing the SNR value and a custom threshold. The SNR $\gamma $ can be calculated as following equation, which the $B$ and $A$ means signal and noise separately:

We designed the phase/SNR calculation module on an FPGA base using Eqs. (2) and (3) above. Because arc-tangent function calculation is difficult to achieve directly on the FPGA, CORDIC is adopted in this module. CORDIC is an algorithm that disassembles arc-tangent calculation into the unitive shift and plus-minus operations [11]. For two values to be calculated, CORDIC takes them as a vector (x, y) and rotates input the vector iteratively with a specified angle${\beta}_{i}$, aiming to align the resultant vector with the x-axis while reducing the *y* coordinate to zero. The rotation equations are:

*n*iterations, rotation produces the following result:

The main results of the vectoring operation are the angle of rotation and the magnitude of the original vector. ${A}_{n}$is the processor gain which is only related to the number of iterations and is independent of input vector, so it can be computed in advance and hard-coded into the source. The phase value ${\text{z}}_{n}={\mathrm{tan}}^{-1}(y/x)$ is directly computed if ${z}_{0}$ the angle accumulator is initialized with zero. Besides, the result of $\sqrt{{x}^{2}+{y}^{2}}$, which is one part of SNR’s calculation, can be achieved by dividing the ${x}_{n}$ component by ${A}_{n}$. It’s much simpler to perform on hardware structure than Pythagorean processor.

The pipelined structure of CORDIC processor is shown in Fig. 2(b). Each pipeline stage can simultaneously process an iteration synchronously, and each stage requires a shift register and an adder/subtractor. For different stages, the number of shift operation is different. The specified angles are set in advance and used as constant input for each adder in the angle accumulator chain. With the increase of the number of iterations, the result’s precision can be improved while the hardware resources-occupied increase. As the Fig. 2(a) show, the input of CORDIC unit are ${x}_{0}$ equaled to $\text{}{I}_{3}-{I}_{1}\text{}$ and ${y}_{0}$ equaled to $\text{}{I}_{4}-{I}_{2}$. The phase value φ and corresponding SNR value are generated after the latency of initial pipeline and updated in every clock cycle.

To evaluate the phase/SNR calculation accuracy on different platforms, we compared the FGPA FPGA and MATLAB platforms. Figure 3(a) is a captured image of a stand standard ceramic plane under a modulated phase shifting pattern. To further discuss the relationship between accuracy and iteration times, the operations of different iterations were executed to compare the phase/SNR values on the FPGA. The deviation from MATLAB's results are shown in Fig. 4. Considering the accuracy and resource consumption, we used 12 iterations in the design. Figures 3(b) and 3(c) are the phase and SNR results obtained by the FPGA. The average value of the deviation is less than 0.00015 rad in the phase calculation and 0.0003 in the SNR calculation.

#### 3.2 Phase unwrapping module

The principal phase in Eq. (2) will result in a value range of [0, 2π] with 2π discontinuities. To obtain the unique absolute phase, the phase unwrapping module is adopted after the phase calculation. Many phase unwrapping algorithms [12–15] have been reported in recent years; this paper uses the multi-frequency heterodyne method [16], which has a high robust capability in industry application. As Fig. 1 shows, three sets of wrapped phase values ${\varphi}_{\text{1}},{\varphi}_{\text{2}},{\varphi}_{\text{3}}$, which are calculated from 12 images, are accessed in this module synchronously. Three simultaneous heterodyne operations are executed to determine the absolute phase. Note that the filter operation has a special design to achieve module pipelined.

Between the heterodyne operations, the original phase value needs to be preprocessed by a Gaussian filter. This is responsible for improving the smoothness of the phase value, avoiding erroneous 2π jumps [17]. Because these erroneous 2π jumps are close to 0 or 2π, it is not effective to directly use the weighted average method to deal with phase value. For correcting such jump errors, we express the phase value by sine and cosine functions. They are complex trigonometric functions and calculated by CORDIC arithmetic as an arc-tangent operation. Subsequently, the weighted average sub-module calculates the average sine and cosine results separately in a 3 × 3 block. To obtain this block, we use a shift register buffer and window buffer. As shown in Fig. 5(a), a shift register includes two line buffers that are the length of the image width, ensuring 3-pixel data in one column are exported simultaneously. A windows buffer is a group of nine registers to buffer the pixels. The windows buffer data and kernel are calculated by the method of weighted averages. This kernel is shown in Fig. 5(b) and the calculation can be realized by a bit-shift operation instead of multiplication or division. Note that the pixels near the image boundaries are addressed by adding extra data. Finally, the arc-tangent sub-module restores the phase from the average sine and cosine results, and the output is a filtered phase value.

Here, we check the unwrapping results of different computing platforms. Based on the data described in the last section, this FPGA module obtains the phase unwrapping result shown in Fig. 6(a). The calculation deviation for MATLAB is presented in Fig. 6(b). This figure shows that there are no 2π period errors in the phase unwrapping calculation.

#### 3.3 Rectification module

A rectification operation [18,19] is applied to the unwrapped phase map obtained by the last module. The rectified images can be thought of as acquired by a new stereo rig, obtained by rotating the original cameras around the optical center. The most important advantage of rectification is changing the 2D search into a 1D search problem, reducing the complexity of the corresponding points matching. Hence, we can obtain the correspondence points easily along the horizontal raster lines of the rectified images.

The rectification coefficient matrix is generated once the camera has been calibrated, so the relative mapping coordinates of an entire image must be calculated off-line and stored in external memory. The target pixel’s address is determined by the relative coordinate and base pixel address. Because the target address has subpixel accuracy, we should obtain four neighboring pixels’ phase values to evaluate the subpixel's phase.

The framework of this module is shown in Fig. 7. It consists of three main parts: RAM, rectify address calculation, and bilinear interpolation. To meet the pipeline design, we need a RAM to store the unwrapped phase value. However, it is too wasteful to store an entire image in the memory. According to an analysis, we assume that the target pixel’s address is adjusted within ± 35 scan-lines at maximum. As a result, we use a dual-port RAM that can store 70 scan-lines pixels to buffer the unwrapped phase value circularly.

The phase value is written in RAM in order while the write address of RAM increases one by one. If the write address exceeds the length of the RAM, it starts from zero again. Note that the write address is not the base pixel address. They are separated by a distance of 35 lines. For instance, if the first pixel of the 34th row in RAM is supposed to be a base pixel, the write address should point to the first pixel of the 69th row. This distance ensures that ± 35 lines of data around the base pixel exist in the RAM. Hence, base pixel address can be computed by writing the address of plus or minus 35 lines’ length. Then, the base pixel address is sent to rectify the address calculation with a reference coordinate. After processing, the results are four read addresses, which point to the four neighboring pixels of the target pixel in RAM. To ensure that data input and output are synchronous, four pixels’ phase values need be read out while a new pixel’s phase is written in. For this reason, the frequency of the read clock is four times the write clock’s frequency. After reading out the four pixels' data, the bilinear interpolation sub-module takes over and outputs subpixel’s phase value of the target pixel.

The rectification operation obtained by FPGA is shown in Fig. 8(a). We also compare it with the MATLAB result, which is shown in Fig. 8(b). We can see that the largest phase deviation is less than 0.05 rad, and this result satisfies the requirements for the system accuracy.

#### 3.4 Corresponding points matching module

By searching for the closest phase value in the 1D direction in the projector corresponding line, the corresponding sub-pixel positions between the camera and projector can be worked out after rectification. The projector pixels’ phase value for searching is stored in the reference phase FIFO. Because the reference phase increases progressively at each line, a binary search algorithm is practical in this design. We use the line buffer as a lookup table and fill it with a corresponding line’s phase value, which can be read from the reference FIFO. Because a line buffer can only be occupied by one pixel at a time, we use N numbers of match sub-modules for N pixels, where N is the number of clock cycles that one pixel’s entire lookup process costs. Hence, each pixel occupies a sub-module in order. The framework of this module is illustrated in Fig. 9. When the sub-module of pixel (1) finishes searching, pixel (N + 1) uses this sub-module subsequently. Furthermore, the same group of line buffers is needed for the ping-pong operation. When one group is occupied by searching, the other one is updated with the next scan-line's data. The ping-pong design can decrease time consumption. After searching, the rectified pixel’s phase value is between two neighboring phase values in line buffer, and positions of the two neighboring phase can be obtained. Then, a linear interpolation is adopted to calculate the sub-pixel position. Finally, the sub-pixel corresponding point positions are sent to the 3D reconstruction sub-module in order.

The last 3D reconstruction sub-module can obtain 3D point coordinates using corresponding points and the triangulation principle. We can achieve 3D coordinates easily using Eq. (6) in FPGA. Parameters ${c}_{x}^{l},{c}_{x}^{r},f,T$ are invariable parameters determined by vision system calibration. Positions *x _{l}* and

*x*are the horizontal positions of the points in the left and right image where the disparity is defined simply by $d={x}_{l}-{x}_{r}$. Note that the SNR results determine whether to calculate each 3D point.

_{r}Figure 10(a) shows the 3D point clouds obtained by FPGA and MATLAB in different colors respectively, and the two points clouds coincide well in the same coordinate system. Figure 10(b) shows the distribution of the corresponding point distances between the two platform calculation results, the mean absolute deviation is 0.033 mm and 98% corresponding point distances are less than 0.1 mm. In summary, the proposed FPGA architecture works well, and has a high accuracy.

## 3. Experimental results

To evaluate the FPGA architecture described above, we set up a stereo vision system consisting of a CCD camera (Basler Aca2000-340km), a DLP projector (TI Lightcrafter4500), and a single FPGA processor (Terasic DE3 FPGA Development Board with Altera Stratix III EP3SL150). There is a synchronous circuit (built in FPGA) that synchronizes the projector and camera. The projector modulated the fringe patterns onto the object, while the camera captured the image of the object (1024 × 768). The image data were sent to the CameraLink circuit (Terasic CLR_HSMC Daughter Card) and stored in the DDR2 SDRAM. The PMP computing unit obtained the data from RAM and started to calculate the 3D coordinates. Figure 11 shows the schematic diagram and snapshot of the implemented system. The design tool is Quartus II 13.1 and the timing simulation tool is Modelsim 10.1a.

To compare the performance and quality of the proposed platform with other platforms, the corresponding systems software programs were implemented by Microsoft Visual Studio 2010 with OpenCV 2.4.2 library, MATLAB R2015a, and GPU programming with the CUDA platform. The experimental environment for the platform is described as follows: Microsoft Windows 8.1 X64 operation system, Intel(R) Core(TM) i7-4770K CPU, NVidia GTX740 (384 cores) and 8GB DDR3 1600 SDRAM. The experimental FPGA system work in capture and calculation modes separately, that is, the capture modules obtain the images from the cameralinks and stores these data in a fixed address in DDR2 SDRAM, then the calculation modules start the PMP processes.

The performance of these systems is shown in Table 1, and we can see that the FPGA system has a high-performance boost. The system clock frequency on the FPGA is 50 MHz, and the image process speed can achieve a maximum of 48 fps at a resolution of 1024 × 768. The front end’s parallel processing and pipelined architecture have a substantial impact on performance. Circular function calculations at each pixel spend substantially more resources and time in the software implementation, while the CORDIC module can work out these functions within a few clocks. Hence, the FPGA implementation is nearly 100 times faster than these conventional methods. As the pipelined architecture and timing constraints, FPGA platform presents the same time at difference point numbers. Table 2 shows the used slices and memory bits. If the size of the image increases, the system will use more resources.

Finally, we show some measurement examples in Fig. 12. The 3D results do not have obvious differences with those of conventional methods. We hence conclude that our architecture can handle the PMP method well.

## 4. Conclusions

In this paper, we proposed a full-pipeline PMP architecture and its implementation for FPGA. The phase calculation, phase unwrapping, rectification, phase matching, and 3D reconstruction modules have been re-implement in at gate-level logic circuits and have been re-designed to meet the full-pipeline requirements. Moreover, the performance and accuracy of each module were evaluated and presented in this paper. Finally, the resource costs and calculation times are shown in the experiments section, which verify the feasibility of the system. In conclusion, this FPGA architecture implements a high-performance PMP algorithm with a reasonable resource cost and can meet the needs of industrial embedded high-speed 3D measurement.

In addition, there are some other aspects to explore in our future work. First, we plan to optimize the data width and buffer design, which will further improve our measurement accuracy while reducing the resource cost. Second, based on this FPGA design framework, we plan to improve and integrate the front-end camera capture module and back-end data processing and transmission in the hope of realizing a calculation unit embedded 3D measurement senor.

## Funding

National Science Foundation of China (NSFC) (51675165, 51505169), National Science and Technology Major Project of the Ministry of Science and Technology of China (2013ZX02104004-003_IC), Major program of Science and Technology Planning Project of Hubei province (2013AEA003), China Postdoctoral Science Foundation (2014M552036, 2016T90688)

## Acknowledgment

The authors would like to thank other members in their research group.

## References and links

**1. **Y. Surrel, “Design of algorithms for phase measurements by the use of phase stepping,” Appl. Opt. **35**(1), 51–60 (1996). [CrossRef] [PubMed]

**2. **V. Srinivasan, H. C. Liu, and M. Halioua, “Automated phase-measuring profilometry of 3-D diffuse objects,” Appl. Opt. **23**(18), 3105–3108 (1984). [CrossRef] [PubMed]

**3. **S. Zhang, D. Royer, and S.-T. Yau, “GPU-assisted high-resolution, real-time 3-D shape measurement,” Opt. Express **14**(20), 9120–9129 (2006). [CrossRef] [PubMed]

**4. **H. Nguyen, D. Nguyen, Z. Wang, H. Kieu, and M. Le, “Real-time, high-accuracy 3D imaging and shape measurement,” Appl. Opt. **54**(1), A9–A17 (2015). [CrossRef] [PubMed]

**5. **K. Zhong, Z. Li, X. Zhou, Y. Shi, and C. Wang, “Hybrid parallel computing architecture for multiview phase shifting,” Opt. Eng. **53**(11), 112214 (2014). [CrossRef]

**6. **Y. Gong and S. Zhang, “Ultrafast 3-D shape measurement with an off-the-shelf DLP projector,” Opt. Express **18**(19), 19743–19754 (2010). [CrossRef] [PubMed]

**7. **X. Liu, H. Zhao, G. Zhan, K. Zhong, Z. Li, Y. Chao, and Y. Shi, “Rapid and automatic 3D body measurement system based on a GPU-Steger line detector,” Appl. Opt. **55**(21), 5539–5547 (2016). [CrossRef] [PubMed]

**8. **S. Jin, J. Cho, X. D. Pham, K. M. Lee, S.-K. Park, M. Kim, and J. W. Jeon, “FPGA Design and Implementation of a Real-Time Stereo Vision System,” IEEE Trans. Circ. Syst. Video Tech. **20**(1), 15–26 (2010). [CrossRef]

**9. **J. Wang, Z. Xiong, Z. Wang, Y. Zhang, and F. Wu, “FPGA Design and Implementation of Kinect-Like Depth Sensing,” IEEE Trans. Circ. Syst. Video Tech. **26**(6), 1175–1186 (2016). [CrossRef]

**10. **Z. Li and Y. Li, “Gamma-distorted fringe image modeling and accurate gamma correction for fast phase measuring profilometry,” Opt. Lett. **36**(2), 154–156 (2011). [CrossRef] [PubMed]

**11. **R. Andraka, “A Survey of CORDIC Algorithms for FPGA Based Computers,” in *Proceedings of the 1998 ACM/SIGDA Sixth International Symposium on Field Programmable Gate Arrays*, FPGA ’98 (ACM, 1998), pp. 191–200. [CrossRef]

**12. **M. A. Herráez, D. R. Burton, M. J. Lalor, and M. A. Gdeisat, “Fast two-dimensional phase-unwrapping algorithm based on sorting by reliability following a noncontinuous path,” Appl. Opt. **41**(35), 7437–7444 (2002). [CrossRef] [PubMed]

**13. **T. Weise, B. Leibe, and L. Van Gool, “Fast 3d scanning with automatic motion compensation,” in *Computer Vision and Pattern Recognition**,**2007**. **CVPR’07. IEEE Conference on* (IEEE, 2007), pp. 1–8. [CrossRef]

**14. **R. Cusack and N. Papadakis, “New robust 3-D phase unwrapping algorithms: application to magnetic field mapping and undistorting echoplanar images,” Neuroimage **16**(3 Pt 1), 754–764 (2002). [CrossRef] [PubMed]

**15. **T. Tao, Q. Chen, J. Da, S. Feng, Y. Hu, and C. Zuo, “Real-time 3-D shape measurement with composite phase-shifting fringes and multi-view system,” Opt. Express **24**(18), 20253–20269 (2016). [CrossRef] [PubMed]

**16. **Z. Li, Y. Shi, and C. Wang, “Real-Time Complex Object 3D Measurement,” in 2009 International Conference on Computer Modeling and Simulation (2009), pp. 191–193. [CrossRef]

**17. **C. Zuo, L. Huang, M. Zhang, Q. Chen, and A. Asundi, “Temporal phase unwrapping algorithms for fringe projection profilometry: A comparative review,” Opt. Lasers Eng. **85**, 84–103 (2016). [CrossRef]

**18. **G. Bradski and A. Kaehler, *Learning OpenCV: Computer Vision with the OpenCV Library* (O’Reilly Media, Inc., 2008).

**19. **D. V. Papadimitriou and T. J. Dennis, “Epipolar line estimation and rectification for stereo image pairs,” IEEE Trans. Image Process. **5**(4), 672–676 (1996). [CrossRef] [PubMed]