## Abstract

Multiple works have applied deep learning to fringe projection profilometry (FPP) in recent years. However, to obtain a large amount of data from actual systems for training is still a tricky problem, and moreover, the network design and optimization is still worth exploring. In this paper, we introduce graphic software to build virtual FPP systems in order to generate the desired datasets conveniently and simply. The way of constructing a virtual FPP system is described in detail firstly, and then some key factors to set the virtual FPP system much closer to reality are analyzed. With the aim of accurately estimating the depth image from only one fringe image, we also design a new loss function to enhance the overall quality and detailed information is restored. And two representative networks, U-Net and pix2pix, are compared in multiple aspects. The real experiments prove the good accuracy and generalization of the network trained by the diverse data from our virtual systems and the designed loss, providing a good guidance for real applications of deep learning methods.

© 2021 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

Fringe projection profilometry (FPP) is a classic solution for 3D shape scanning. It projects coded fringes to an object and captures the deformed fringe images modulated by the object’s surface. Then the 3D shape is reconstructed by demodulating the fringe signals, and the 3D point cloud is obtained through further calibration algorithms. Although FPP has been applied to multiple scenarios [1–4], it still faces the difficulty to balance the accuracy and speed. N-step phase-shifting algorithm [5] is precise and commonly used, but it requires projecting and capturing more than three different fringe images. This process is time-consuming for dynamic measurements. With a single-shot fringe image, Fourier-transform profilometry (FTP) [6] extracts the carrier spectra for 3D shape reconstruction. Unfortunately, its accuracy would be affected by spectra overlaps when processing complex shapes. To solve this problem, windowed Fourier transform [7], wavelet transform [8] such like spectra analysis methods are introduced, but they need heavy calculations and presetting parameters. In addition, the methods above usually obtain the phase wrapped into a range of 2π, thus unwrapping algorithms are needed to further restore the true 3D shapes. Similarly, temporal phase unwrapping methods, such as gray-code method [9,10] and multi-frequency method [11,12], are simple to calculate, but they require projecting additional multiple fringe images. Spatial phase unwrapping methods, such as branch-cut method [13], flood method [14], and Laplacian operator method [15], can perform unwrapping from a single image, but they need massive calculation and the accuracy is sensitive to the noise, shadow, or height jump.

In recent years, deep learning presents powerful performance with the improvement of the neural network structure and the computing power. Plenty of studies have proved that deep learning performs superior to traditional algorithms in terms of speed and robustness, which are used for fringe denoising [16–18], fringe analysis [19,20], and phase unwrapping [21–23]. However, these works conduct experiments with limited datasets and only focus on a single step of the FPP system, which means that multiple networks must be integrated to construct a complete system. Then the training process and the preparation of training sets are troublesome for integrating networks, and such integration inevitably accumulates errors. Refreshingly, some researches directly map a single fringe image to its height/depth image with a single network [24–27]. Furthermore, some explorations to simulate training samples are also conducted to improve the performance of the trained model. In [24], a large number of pairs of fringe pattern and height map are simulated with mathematical expressions, obviously, this approach is difficult to generate data close to reality. Naturally, to simulate an FPP system becomes an optimal solution for generating data conveniently [25,26]. In [26], a digital twin of a real FPP system is even proposed innovatively. It strictly copies the calibration parameters of a real system to build a virtual system for rendering dataset, which enables the trained network model to be used in this specific real system. However, many other factors in reality cannot be copied accurately, such as the measuring environment or the property of objects and etc., which also affect the accuracy of the model.

The selection of suitable network and parameters is another major issue for deep learning methods. Existing works used in FPP basically choose the convolutional neural networks (CNNs). For instance, an optical fringe pattern de-noising convolutional neural network (FPD-CNN) model is proposed in [18]; a CNN model for wrapped-phase calculation is proposed in [20]; the U-Net [28] is improved for phase-unwrapping in [23]. To guide the selection of a suitable network for the single-shot FPP, a comparison of three CNNs is conducted in [27], including fully convolutional networks (FCN), Autoencoder networks (AEN) and U-Net, and U-Net is concluded performing the best due to its symmetric structure and feature map concatenation. In fact, except for CNN, other network models also emerge with powerful performance recently, such as the adversarial generative network (GAN) [29], which generates data in a certain style through an adversarial procedure. Among GANs, pix2pix [30] establishes the conversion from an image to an image and shows excellent performance in generating images with details.

With in-depth research, we realized that the simulation of FPP systems is significant and simulating diverse interference factors from reality to the virtual FPP system is necessary. Therefore, in this paper, the methods to construct the virtual system are given in detail, and furthermore, variable factors interfering with the FPP system in reality and the way of mapping these factors to the virtual systems are researched thoroughly. With different combinations of these factors being set, our simulated virtual FPP system renders different large datasets that are investigated and evaluated their influence on the accuracy and generalization of the network. In addition, a new loss function is designed, which considers the structure similarity of objects and the detail information to improve the overall and detailed accuracy of the result. U-Net and pix2pix, the representatives of CNNs and GANs respectively, are compared by multiple experiments to explore the better solution for estimating the depth image. The real experiment further verifies the accuracy and the generalization ability of our method.

## 2. Construction of a virtual FPP system and the rendering of datasets

Sufficient training data are the guarantee of excellent performance for deep learning networks. Recently, computer graphics has been successfully introduced for dataset generation [31–33]. For FPP technique, the graphic software has even been applied for simulating a system to establish diverse datasets conveniently [25,26]. This section introduces the details of constructing a virtual FPP system and rendering data samples.

#### 2.1 Selection of 3D models

The virtual objects used in the virtual FPP system can be selected from existing 3D model datasets, such as ModelNet [34], ShapeNet [35], ABC [36], Thingi10K [37], etc. Considering the effective working distance of FPP in visible light (within 1∼2m), we select the Thingi10K dataset that contains various 3D models of common objects, such as sculptures, vases, and dolls, as shown in Fig. 1. The variety and the magnitude of these models help to generate large-scale and diverse data samples as needed.

#### 2.2. Construction of a virtual FPP system

Computer graphics is good at presenting the real-world scene in a virtual form. Among various graphics software, Blender is an open-source 3D creation suite, which is powerful and can generate images by Python in batch. In Blender, a virtual camera and a virtual projector can be placed in the “Layout”, as shown in Figs. 2(a) and 2(b). The virtual system works the same as a real FPP system, i.e., the projector projects sinusoidal fringes onto an object, and the deformed fringes are captured by a camera. Blender renders fringe images by setting the compositing node “Render Layers” to “Image”, and renders depth images by setting it to “Depth” [shown in Fig. 2(c)].

Some elements of our virtual FPP system in Blender include:

**1) Camera:**the type is set to “perspective”, and its position and rotation angle can be adjusted;**3) Objects:**3D models are loaded in and are scaled to a proper size;**5) Rendering:**the rendering engine is set as the physically-based path tracer “Cycles”, and the sampling integrator is set as “Branched path tracing”.**6) File format:**any common image format is permitted for fringe images, but the depth images should be saved in Open_EXR format to retain the original depth information.

#### 2.3 Factors enhancing the reality of the virtual FPP system

To enhance the generalization of our network, not only the 3D models to construct the training set should be rich and diverse, the settings of the virtual system also have to be adjusted to accord with various possible measurement environments. The input of our network is a sinusoidal fringe image with a usual mathematical description as

where*a*(

*x*,

*y*) is the amplitude intensity,

*f*is the frequency deciding the fringe period,

*φ*(

*x*,

*y*) is the phase describing the shape of an object,

*b*(

*x*,

*y*) is the background, and

*n*(

*x*,

*y*) is the noise. These parameters are changed in different measurements, leading to the change of fringe images and the consequent change of output depth images. Thus the main factors influencing these parameters in practice should be taken into consideration in the virtual system settings, and they are analyzed as follows.

### 2.3.1 Period of fringes

According to the classical calibration theory [40], the image coordinate [*x*, *y*] is described as

*f*] are the focal lengths of camera (projector), [

_{x}, f_{y}*X*,

*Y*,

*Z*] denote the camera (projector) coordinate and [

*c*] are the optical centers of camera (projector).

_{x}, c_{y}In practice, the optical centers and focal lengths vary from different devices. The difference of optical centers leads to different imaging locations for the object in an image, which will not cause errors for the task of this paper. However, the changes of focal lengths would cause the fringe period of the captured fringe image to be zoomed, which may correspond to the change of *f* or *φ*(*x*, *y*) in Eq. (1). This type of change influences the depth extraction and so is necessary to be considered. In the virtual FPP system, this can be simulated by setting various periods of the projected fringes (adjusting the “scale-X” in the 2nd “mapping” node in Fig. 3).

### 2.3.2 Pose between the camera and the projector

The space geometry relation between the camera coordinate [*X _{c}, Y_{c}, Z_{c}*] and the projector coordinate [

*X*] can be described as

_{p}, Y_{p}, Z_{p}*R*and

*t*denote the rotation and the translation matrixes, respectively. To simulate this relationship, we rotate the projected fringes around the optical axis of the projector (by adjusting the “rotation-Z” in the 2nd “mapping” node in Fig. 3) and set different angles between the optical axes of the camera and the projector to simulate different

*R*and

*t*.

### 2.3.3 Amplitude intensity and background

The amplitude intensity of a fringe image, corresponding to *a*(*x*, *y*) in Eq. (1), is generally decided by the material/texture of objects, the power of projector, and the brightness of background, which can be set conveniently in the virtual FPP system (by adjusting the “strength” in the “background” node in Fig. 5). And the environment map can be shifted or rotated multiple times to simulate that the objects are located in different backgrounds (by adjusting the “rotation-Z” in the “mapping” node in Fig. 5).

With the factors above adjusted, the virtual FPP system can be set to generate the data much close to the ones from a practical system, which thus helps to improve the practicability of the trained network.

## 3. Networks and the designed loss function

U-Net has been proved the best compared with some other CNNs [27] used for FPP techniques. In Recent years, GAN reveals powerful ability in image generation, so it is explored whether it performs better on the depth estimation in this paper. Below the architectures of the U-Net and a conditional GAN (cGAN) named pix2pix [30] are introduced simply, and the design of our new loss function is also explained.

#### 3.1 Network architecture

### 3.1.1 U-Net

U-Net [27] follows an encoder-decoder structure. The encoder down-samples the input images to extract features, and the decoder up-samples the feature maps to obtain a high-resolution output image. U-Net also has a special structure of skip-connection so that larger-scale feature maps can be directly sent to the up-sampling process, and therefore, the output process and input process share the low-level information. Based on these structures, U-Net learns with less data but achieves higher precision. As U-Net is a part of pix2pix, its structure is given as follows.

### 3.1.2 pix2pix

pix2pix [30] contains a generator and a discriminator. The generator produces fake images, and the discriminator tries to identify the fake ones, guiding the generator to produce a fake image much closer to the target output. Figure 6 presents the architecture of pix2pix. The network of pix2pix is shown in Fig. 7, where the generator has a U-Net shape and the discriminator is “patchGAN”, a multi-layer CNN.

#### 3.2 Proposed new loss function

A loss function defines the convergence form of a network and so is the key to the quality of outputs. The mean absolute error (*L*1 loss) and the mean square error (*L*2 loss) are the most commonly used loss functions. However, any of the two only evaluates average errors, thus, the resulted outputs quite possibly show low quality in some local regions.

The task in this paper is to retrieve a depth image that records the 3D shape of a scanned object; hence the geometry and the spatial structure is a good constraint to keep the overall effect of the outputs. An index called Structure SIMilarity (SSIM) [41] leverages the structural information to evaluate image quality, which is defined as

*μ*is the mean of the evaluated image

_{u}*u*,

*μ*is the mean of the ground truth

_{v}*v*,

*σ*and

_{u}*σ*are the variances of

_{v}*u*and

*v*, respectively,

*σ*${\; }$is the covariance of

_{uv}*u*and

*v*, and

*c*

_{1}and

*c*

_{2}are two constants to avoid division by zero. The SSIM ranges in [0, 1], and it is scored low if the evaluated image is compressed, blurred or noise contaminated. With this good ability to measure overall structure of 3D shapes, we take the index SSIM as a term of the loss function. The detailed description of this term is where

*I*is an input fringe image, G(

*I*) is the fake depth image generated by U-Net or pix2pix’s generator, and

*d*is the ground truth of G(

*I*).

With the overall accuracy ensured, the local detail is another essential factor to the restoration of a 3D shape. As details are always embedded in the edges of an image, we add another term involving a common tool for edge detection, the Laplacian operator, to the loss function to estimate the detail’s similarity between G(*I*) and *d*. The added term is described as

Based on the above analysis, for the U-Net, we replace the commonly used *L*1 loss or *L*2 loss with our new loss function below:

*L*

_{T}_{1}and

*L*

_{T}_{2}have been given in Eqs. (5) and (6), respectively, and

*λ*

_{1}and

*λ*

_{2}are the adjustable weights.

For the pix2pix, we define the loss function as

In Eq. (8), the last two terms are the same to the ones in Eq. (7), and *L _{cGAN}* is a unique term for cGAN to assess the accuracy of the discriminator’s output by

*I*) and its ground truth

*d*by comparing their relationships to the input fringe image

*I*. Note that the

*L*2 loss is exploited in Eq. (9) to replace the cross-entropy loss used in original pix2pix [30] since it shows better performance in improving the quality of the result and the stability of the training process [42].

## 4. Experiments

#### 4.1 Dataset rendering and data preprocessing

In this paper, we choose 624 models from Thingi10K, which covers a rich variety of items with various complexities. To ensure the generalization of the trained model, we separate the 624 models into 13 groups and set different rendering parameters analyzed in section 2.3 for each group to simulate the possible various situations in practice. The variation range we set for each parameter in Blender is shown in Table 1 and the rendered fringe images varying with different parameters are displayed in Fig. 8.

To enrich the dataset, we render multiple images for each model. In the camera coordinate system shown in Fig. 9, the model is firstly rotated around the *y*-axis by 12 times with each time 30°, and then for each rotation around the *y*-axis, another 12 times of rotation are repeated around the *z*-axis by 5° each time. Thus, there are 144 fringe images rendered for each object, and a depth image is rendered corresponding to each fringe image. In total, 89856 pairs of images are obtained to create the dataset. We randomly allocate the 624 models to the training and test sets in a ratio 8.5:1.5, and so there are no identical objects in the training and test sets.

Before training, each fringe image and depth image *I* is normalized to [-1, 1] by

#### 4.2 Comparison of different loss functions

We implement ablation experiments to compare the effect of different loss functions and their combinations based on U-Net and pix2pix. During the training process, we use Adam optimizer with momentum parameters *β*_{1}=0.5 and *β*_{2}=0.999. The batch size is 4, with a learning rate of 0.0003. The size of the SSIM window is 8. All the networks with different loss functions are trained for 13 epochs, which is enough for convergence. We set *λ*_{1} and *λ*_{2} in Eq. (7) and Eq. (8) as 100 and 10, respectively, which are the best empirical values.

Figure 10 illustrates the qualitative comparison of different loss functions. All fringe images in Fig. 10 are chosen from the test set, in which the objects have not been seen by the network during training. Figures 10(a) and 10(b) represent the rendered fringe images and depth images (ground truth) respectively. Figures 10(c)–10(f) are the results of U-Net with the proposed loss SSIM + Laplace, the loss with only SSIM term, only *L*1 loss and only *L*2 loss respectively. Similarly, Figs. 10(g)–10(j) show the corresponding results using pix2pix. No matter for U-Net or for pix2pix, our proposed loss function performs the best in eliminating the artifacts (the red boxes in the first two rows) and keeping details (the red boxes in the last two rows). To display clearly, Fig. 11 gives the amplified results in the red boxes for the U-Net in Fig. 10.

To quantify the effect of the results, we further compute the mean absolute error (MAE) and the mean standard deviation of errors (MSDE) for all the results in the test sets, as listed in Table 2. The results in Table 2 accord with Fig. 10 basically. They all prove that our proposed loss function is much effective. Figure 10 and Table 2 also show that U-Net generally performs better than pix2pix both in quality and in quantity.

Due to the best accuracy, our loss function SSIM + Laplace is adopted for the following experiments.

#### 4.3 Relationship of the generalization ability and the accuracy to the diversity level of the dataset

In this section, we explore the impact of different rendering parameters on the accuracy of the depth images generated by the network. With the models separated into 13 groups, the datasets are rendered in three different cases:

Appendix B gives the other common settings of D_{1}, D_{2}, and D_{3}. Each dataset above is divided into a training set and a test set by 8.5:1.5 and then U-Net and pix2pix are trained by D_{1}, D_{2}, and D_{3} separately.

Figure 12 illustrates the depth images generated from a real captured fringe image by U-Net and pix2pix trained by D_{1}, D_{2}, and D_{3}, respectively, where the fringe image is captured by an arbitrary FPP system under arbitrary indoor lighting, and the object “Venus” has not appeared in any training datasets. It is obvious that the results are very poor and even cannot be used if the factors representing the measurement environments are not considered (in D_{1} and D_{2}). Therefore, the generalization ability of the network shows much better if more system variables and environment variables are considered when rendering the dataset. Table 3 records the MAE and MSDE of the test set in each dataset. It shows that for both U-Net and pix2pix, the accuracy decreases as the complexity of the dataset increases.

#### 4.4 Other unique interference factors in practical use

In section 2.3, we analyze the common factors in reality that may influence the performance of the network model, based on the components of a fringe image expressed in Eq. (1). These factors are recommended to be considered for rendering the dataset by the virtual system. In this section, we investigate other factors from the view of the measured objects. One of the factors considered is that objects may have various colorful surfaces and textures, the other one is that objects may be placed casually and multi isolated objects are captured together. The examples of the two cases are displayed as Fig. 13. With these two factors, we create another two datasets:

**D**images rendered with colorful 3D models in ShapeNet and images in D_{4}:_{3};**D**images rendered with multi objects and images with colorful 3D models in D_{5}:_{4}.

For dataset D_{4}, 320 models in ShapeNet [35] are selected and allocated to the training set and the test set by the ratio of 8.5:1.5, so the objects in the test set are totally not identical with the ones in the training dataset. Then each model is rotated 144 times, and 46,080 new image pairs are rendered by setting with different parameters in Table 1. These images and the images in D_{3} form the dataset D_{4}.

For dataset D_{5}, 624 objects selected from Thingi10K form 312 multi-object pairs. We allocate the 312 multi-object pairs to the training set and test set by the ratio 8.5:1.5, in order that the tested objects have not ever appeared in the training dataset. Then each multi-object pair in the training set is rendered 144 times according to Table 1 and the setups of D_{3.} The new images and the images rendered by ShapeNet in D_{4} form D_{5}.

The testing results of the model trained by D_{3} and D_{4} are shown in Fig. 14, where the objects are casually taken from daily necessities. Similar to Fig. 12, the comparison in Fig. 14 illustrates that the dataset rendered by the virtual system had better consider the characteristics of the real application scenarios (such as the color of the objects). Also, the depth images in Fig. 14 generated by pix2pix present more obvious strip-like artifacts, illustrating that the U-Net still performs better than pix2pix. Furthermore, we evaluate MAE and MSDE of the results from U-Net on D_{4}, which are 0.0230 and 0.0663, respectively. This result maintains good accuracy.

In Fig. 15, we use a real fringe image to test the U-Net trained by D_{5}. To quantify the overall error, we also compute the MAE and MSDE of the test set in D_{5}, and they are 0.0153 and 0.0629, respectively. The results are satisfactory both visually and quantitatively.

## 5. Discussion

#### 5.1 Comparison between U-Net and pix2pix

The comparisons from the sections 4.2 to 4.4 all illustrate that U-Net performs better than pix2pix. The main reason is that U-Net is much adept at extracting features and then predicting an output, while pix2pix is better at synthesizing an image with reasonable complex patterns from a simple image. For the task of this paper, the depth is embedded in the fringe and there is a mapping between the fringe and depth. Consequently, U-Net is more suitable to be used in this paper.

#### 5.2 Generalization ability of deep learning

The purpose of simulating a virtual FPP system is to conveniently generate a large-scale diverse data close to the reality. This is also the reason why we researched the factors interfering with the FPP systems in real applications. Only if the data are more diverse, the generalization of the trained network is much better, otherwise, the trained model may be even failed to work. However, the interference factors are complex and diverse in reality, which cannot be listed all here, and the more diversity of the dataset will inevitably be accompanied by much lower accuracy, as evaluated in section 4.3. Therefore, the generalization ability can only be achieved relatively, which is the inherent defect of deep learning methods.

In this paper, we analyze the common factors influencing the FPP system in reality based on the analysis of the mathematical expression of a fringe image, and we also list another two unique factors from the view of the objects. The common factors are recommended to be considered in most cases, while the unique factors should be analyzed and selected. Users can customize the datasets according to their needs by adding some unique factors (if exists) to the common ones, as the design of D_{4} or D_{5} in section 4.4.

Furthermore, some default factors are not considered in this paper, for example, the objects in all experiments are supposed to be Lambertian, and the lens distortion is neglected or thought to be corrected by pre-calibration, etc.

#### 5.3 Scale ambiguity

The scale ambiguity problem is caused by the inconsistency of the scale space for the depth value in the virtual system and the real system. As shown in Fig. 16, the point A and the point B on two objects are imaged at the same point C by the camera, but the corresponding points are “imaged” in the projector as P_{A} and P_{B}, respectively. Therefore, the trained network can distinguish the depths of A and B by pairing C to P_{A} or P_{B} in a single fringe image under a fixed system setting. However, once the system settings are changed, the points P_{A} and P_{B} corresponding to C will be “imaged” differently, and this is the cause of the ambiguity.

In order to eliminate this ambiguity, one solution is to completely copy the real system to form a virtual system, as proposed in [26], which is limited since the rendered images and the trained network model can only be used to a fixed real FPP system. One of our undergoing study is to firstly render a large-scale dataset with the virtual FPP system being set by varying calibration parameters, and then train a network with the fringe images and the corresponding calibration parameters as the network input. However, in this paper, this problem has not yet been solved. Thus, the virtual system has to be set with the same calibration parameters if the real size of the 3D reconstruction is needed, otherwise, we can only get a shape without the real size.

## 6. Conclusion

Sufficient and diverse data is the guarantee of the application scope to the learning-based methods. Therefore, in this paper, we build a virtual FPP system with Blender for conveniently generating data, and we also analyze the key factors being able to be set in the virtual FPP systems to render images much close to the reality. To enhance the accuracy of the output depth image, we also propose an effective new loss function combining the SSIM index and Laplace operator. Abundant experiments are conducted, which prove that U-Net performs better in the task of depth image estimation and our proposed loss function improves the overall and detailed accuracy of the result. Furthermore, the experiments investigate the relationship of the generalization ability and accuracy to the diversity level of datasets. These works all provides good reference for improving the deep learning methods used in FPP.

## Appendix A

The following explains the shading tree nodes of the projector in Blender:

- 1) Geometry-normal: “normal” refers to the vector pointing from the projector to a certain point on the surface of the object.
- 2) Mapping-Point: the rotation (X/Y/Z) here decides the direction of the emitting light.
- 3) SeparateXYZ-Divide-CombineXYZ: project the 3D vector to the
*xy*-plane, so that the projected pattern is only related to the*x*and*y*coordinates, not to the*z*coordinate. - 4) The second Mapping-point: change the position, direction, and size of the projection pattern. Because the origin of the projection pattern is in the upper left corner, the “X” and “Y” coordinates of the “Location” are offset by 0.5 meters.
- 5) sin0.bmp: set the projection pattern (fringe image).
- 7) Emission: add Lambertian luminous shader for light output.
- 8) Light Output: light output.

## Appendix B

**Common settings:**

Camera mode: Perspective

Camera field of view: 7°

Projector size: 0.001m

Position of the background wall when rendering depth image: (0, 0.05m, 0)

Position of the 3D model: (0, 0, -0.02m)

Position the projector: (0, -1.5m, 0)

Size of the 3D model: the maximal dimension is scaled to 0.14m.

The distance from the camera to the object: 1.55m

The intersection of the optical axis of the camera and the optical axis of the projector: (0, 0, 0)

Background of the fringe images in D_{1} and D_{2}: all white

**Note:** When rendering depth images, keep the positions of the camera and the object unchanged, and import a plane behind the object, otherwise the depth image would record the depth of the background regions as infinite.

## Funding

National Natural Science Foundation of China (61828501); Basic Research Program of Jiangsu Province (BK20192004C); Natural Science Foundation of Jiangsu Province (BK20181269).

## Disclosures

The authors declare no conflicts of interest.

## References

**1. **F. Tsalakanidou, F. Forster, S. Malassiotis, and M. G. Strintzis, “Real-time acquisition of depth and color images using structured light and its application to 3D face recognition,” RTI **11**(5-6), 358–369 (2005). [CrossRef]

**2. **J. I. Laughner, S. Zhang, H. Li, C. C. Shao, and I. R. Efimov, “Mapping cardiac surface mechanics with structured light imaging,” Am. J. Physiol. Heart Circ. Physiol. **303**(6), H712–H720 (2012). [CrossRef]

**3. **J. Xu, P. Wang, Y. Yao, S. Liu, and G. Zhang, “3D multi-directional sensor with pyramid mirror and structured light,” Opt. Lasers Eng. **93**, 156–163 (2017). [CrossRef]

**4. **J. Burke, T. Bothe, W. Osten, and C. F. Hess, “Reverse engineering by fringe projection,” Proc. SPIE **4778**, 312 (2002). [CrossRef]

**5. **V. Srinivasan, H. C. Liu, and M. Halioua, “Automated phase-measuring profilometry of 3-D diffuse objects,” Appl. Opt. **23**(18), 3105–3108 (1984). [CrossRef]

**6. **M. Takeda and K. Mutoh, “Fourier transform profilometry for the automatic measurement of 3-D object shapes,” Appl. Opt. **22**(24), 3977–3982 (1983). [CrossRef]

**7. **K. Qian, “Two-dimensional windowed Fourier transform for fringe pattern analysis: Principles, applications and implementations,” Opt. Lasers Eng. **45**(2), 304–317 (2007). [CrossRef]

**8. **J. Zhong and J. Weng, “Spatial carrier-fringe pattern analysis by means of wavelet transform: wavelet transform profilometry,” Appl. Opt. **43**(26), 4993–4998 (2004). [CrossRef]

**9. **G. Sansoni, S. Corini, S. Lazzari, R. Rodella, and F. Docchio, “Three-dimensional imaging based on Gray-code light projection: characterization of the measuring algorithm and development of a measuring system for industrial applications,” Appl. Opt. **36**(19), 4463–4472 (1997). [CrossRef]

**10. **D. Zheng, Q. Kemao, F. Da, and H. Seah, “Ternary Gray code-based phase unwrapping for 3D measurement using binary patterns with projector defocusing,” Appl. Opt. **56**(13), 3660–3665 (2017). [CrossRef]

**11. **J. M. Huntley and H. Saldner, “Temporal phase-unwrapping algorithm for automated interferogram analysis,” Appl. Opt. **32**(17), 3047–3052 (1993). [CrossRef]

**12. **M. Zhang, Q. Chen, T. Tao, S. Feng, Y. Hu, H. Li, and C. Zuo, “Robust and efficient multi-frequency temporal phase unwrapping: optimal fringe frequency and pattern sequence selection,” Opt. Express **25**(17), 20381–20400 (2017). [CrossRef]

**13. **R. M. Goldstein, H. A. Zebker, and C. L. Werner, “Satellite radar interferometry: Two-dimensional phase unwrapping,” Radio Sci. **23**(4), 713–720 (1988). [CrossRef]

**14. **S. Zhang and S. Yau, “High-resolution, real-time 3D absolute coordinate measurement based on a phase-shifting method,” Opt. Express **14**(7), 2644–2649 (2006). [CrossRef]

**15. **M. A. Schofield and Y. Zhu, “Fast phase unwrapping algorithm for interferometric applications,” Opt. Lett. **28**(14), 1194–1196 (2003). [CrossRef]

**16. **K. Yan, Y. Yu, C. Huang, L. Sui, K. Qian, and A. Asundi, “Fringe pattern denoising based on deep learning,” Opt. Commun. **437**, 148–152 (2019). [CrossRef]

**17. **F. Hao, C. Tang, M. Xu, and Z. Lei, “Batch denoising of ESPI fringe patterns based on convolutional neural network,” Appl. Opt. **58**(13), 3338–3346 (2019). [CrossRef]

**18. **B. Lin, S. Fu, C. Zhang, F. Wang, and Y. Li, “Optical fringe patterns filtering based on multi-stage convolution neural network,” Opt. Lasers Eng. **126**, 105853 (2020). [CrossRef]

**19. **S. Feng, Q. Chen, G. Gu, T. Tao, L. Zhang, Y. Hu, W. Yin, and C. Zuo, “Fringe pattern analysis using deep learning,” Adv. Photonics **1**(02), 1 (2019). [CrossRef]

**20. **S. Feng, C. Zuo, W. Yin, G. Gu, and Q. Chen, “Micro deep learning profilometry for high-speed 3D surface imaging,” Opt. Lasers Eng. **121**, 416–427 (2019). [CrossRef]

**21. **J. Zhang, X. Tian, J. Shao, H. Luo, and R. Liang, “Phase unwrapping in optical metrology via denoised and convolutional segmentation networks,” Opt. Express **27**(10), 14903–14912 (2019). [CrossRef]

**22. **T. Zhang, S. Jiang, Z. Zhao, K. Dixit, X. Zhou, J. Hou, Y. Zhang, and C. Yan, “Rapid and robust two-dimensional phase unwrapping via deep learning,” Opt. Express **27**(16), 23173–23185 (2019). [CrossRef]

**23. **K. Wang, Y. Li, Q. Kemao, J. Di, and J. Zhao, “One-step robust deep learning phase unwrapping,” Opt. Express **27**(10), 15100–15115 (2019). [CrossRef]

**24. **S. Van der Jeught and J. Dirckx, “Deep neural networks for single shot structured light profilometry,” Opt. Express **27**(12), 17091–17101 (2019). [CrossRef]

**25. **C. Wang, Q. Guan, and F. Wang, “Single stripe projection measurement method based on graphics and deep learning,” Chinese Invention Patent 201911260063 (10 Dec 2019).

**26. **Y. Zheng, S. Wang, Q. Li, and B. Li, “Fringe projection profilometry by conducting deep learning from its digital twin,” Opt. Express **28**(24), 36568–36583 (2020). [CrossRef]

**27. **H. Nguyen, Y. Wang, and Z. Wang, “Single-Shot 3D Shape Reconstruction Using Structured Light and Deep Convolutional Neural Networks,” Sensors **20**(13), 3718 (2020). [CrossRef]

**28. **O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention (MICCAI) (2015), pp. 234–241.

**29. **I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in 27th International Conference on Neural Information Processing Systems (NIPS) (2014), pp. 2672–2680.

**30. **P. Isola, J. Zhu, T. Zhou, and A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition. (CVPR) (2017), pp. 1125–1134.

**31. **F. Gomez-Donoso, A. Garcia-Garcia, J. Garcia-Rodriguez, S. Orts-Escolano, and M. Cazorla, “LonchaNet: A sliced-based CNN architecture for real-time 3D object recognition,” in 2017 International Joint Conference on Neural Networks (IJCNN) (2017), pp. 412–418.

**32. **Y. Li, A. Dai, L. Guibas, and M. Nießner, “Database-assisted object retrieval for real-time 3d reconstruction,” Comput. Graph. Forum **34**(2), 435–446 (2015). [CrossRef]

**33. **P. Stavroulakis, S. Chen, C. Delorme, P. Bointon, G. Tzimiropoulos, and R. Leach, “Rapid tracking of extrinsic projector parameters in fringe projection using machine learning,” Opt. Lasers Eng. **114**, 7–14 (2019). [CrossRef]

**34. **34. Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D ShapeNets: A deep representation for volumetric shapes,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 1912–1920.

**35. **A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An information-rich 3d model repository,” arXiv: 1512.03012v1 (2015).

**36. **S. Koch, A. Matveev, Z. Jiang, F. Williams, A. Artemov, E. Burnaev, M. Alexa, D. Zorin, and D. Panozzo, “ABC: A big CAD model dataset for geometric deep learning,” in 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 9601–9611.

**37. **Q. Zhou and A. Jacobson, “Thingi10K: A dataset of 10,000 3D-printing models,” arXiv: 1605.04797v2 (2016).

**38. **O. Yakovlyev, “Artist Workshop,” HDRI Haven, https://hdrihaven.com/hdri/?c=indoor&h=artist_workshop.

**39. **J. Versluis, “How to rotate a HDRI in Blender,” https://www.versluis.com/2020/07/rotate-hdri-in-blender/.

**40. **Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Mach. Intell. **22**(11), 1330–1334 (2000). [CrossRef]

**41. **Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process. **13**(4), 600–612 (2004). [CrossRef]

**42. **X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, “Least Squares Generative Adversarial Networks,” in 2017 IEEE International Conference on Computer Vision (ICCV) (2017), pp. 2813–2821.