We discuss physical and information theoretical limits of optical 3D metrology. Based on these principal considerations we introduce a novel single-shot 3D movie camera that almost reaches these limits. The camera is designed for the 3D acquisition of macroscopic live scenes. Like a hologram, each movie-frame encompasses the full 3D information about the object surface and the observation perspective can be varied while watching the 3D movie. The camera combines single-shot ability with a point cloud density close to the theoretical limit. No space-bandwidth is wasted by pattern codification. With 1-megapixel sensors, the 3D camera delivers nearly 300,000 independent 3D points within each frame. The 3D data display a lateral resolution and a depth precision only limited by physics. The approach is based on multi-line triangulation. The requisite low-cost technology is simple. Only two properly positioned synchronized cameras solve the profound ambiguity problem omnipresent in 3D metrology.
© 2017 Optical Society of America
Although highly desired, there is, surprisingly, no optical three-dimensional (3D) sensor that permits the single-shot acquisition of 3D motion pictures with a dense point cloud. There are approaches for real-time 3D data acquisition [1–14], but these are either multi-shot approaches, or they are not ‘local’, which means that the density of uncorrelated (independent) 3D points leaves room for improvement.
Obviously, the acquisition of a dense 3D point cloud within one single camera frame is difficult. We will discuss the physical and information limits involved. Eventually, we will introduce a novel 3D sensor principle and a 3D camera that indeed combines single-shot ability with a dense 3D point cloud. The camera works close to the discussed limits: The lateral resolution and depth precision are as good as physics allows . In other words, the camera performance is limited only by the sampling theorem and by shot noise or speckle noise. Neither color- nor spatial encoding is exploited. The camera displays almost the best possible 3D point cloud density in relation to the available camera pixels. With 1-megapixel cameras the sensor is able to deliver about 300,000 3D points within each single shot. The single-shot ability allows for the very fast acquisition of 3D data by flash exposure. This enables the capture of 3D motion pictures in which each camera frame includes the full-field 3D information, with free choice of the viewpoint while watching the movie (see Visualization 1).
To the best of our knowledge, no such 3D camera is currently available. Along the wide spectrum of optical 3D sensors, there are sensors with high precision, there are sensors that deliver a dense point cloud, and there are (a few) sensors that allow for single-shot acquisition of only sparse data. What are the obstacles for a single-shot 3D camera that offers dense 3D data and physically limited precision at the same time?
The key term is the “dense point cloud”. We could naively demand that each of the Npix camera pixels delivers a 3D point, completely independent of its neighbors. However, to avoid aliasing, we have to ensure that the image at the camera chip satisfies the sampling theorem, so a certain correlation between neighboring points is unavoidable. This is what has to be remembered when “point cloud density” is discussed: 100% is impossible because it contradicts linear systems theory. Indeed, all 3D sensors display artifacts at sharp edges, where the sampling theorem is violated. Satisfying the sampling theorem is also necessary in exploiting subpixel interpolation for high distance precision, as depicted in Fig. 1(a).
Keeping in mind that 100% is impossible, we nevertheless calculate the “point cloud density” ρ of a 3D sensor from ρ = N3D/Npix, were N3D is the number of independent 3D pixels and Npix is the number of pixels on the camera chip. For a 1-megapixel camera (1000 × 1000 pixels), a density of, say, 30% will yield 300,000 independent 3D points. A low-density point cloud implies a reduced lateral resolution of the sensor. For ρ ˂ 1, and more so for ρ ˂˂ 1, only a “pseudo dense” surface reconstruction is possible – which is commonly prettified via a posteriori interpolation and high-resolution texture mapping. Looking more close at existing single-shot solutions, such as those obtained by Artec  and Kinect One , we find that they lack high lateral resolution. The reason for the low density is that any type of triangulation requires the identification of corresponding points, whether this be for classical stereo or for the methods described above. The necessary encoding devours space-bandwidth that is lost for high lateral resolution. In , for example, the width of the projected stripes is encoded piecewise in the stripe direction. In [3,4], the period of projected lines is encoded and combined with different colors (the pros and cons of color encoding are discussed below). In , a pseudo-random pattern of dots is projected. Classical stereo exploits “natural” salient but spacious “features.” Is this fundamental for single shot sensors? The bad news is: yes.
To better understand this phenomenon, let us start with the paradigm multi-shot principle that delivers a point cloud with virtually 100% density, the so-called “fringe projection triangulation”, sometimes called “phase-measuring triangulation” (PMT) . This principle is local. Each pixel delivers information widely independent from its neighboring pixels (within the limits of the sampling theorem). As mentioned, this feature has its price: At least three exposures are required for one 3D height map because the local distance has to be deciphered from the three unknowns––the ambient illumination, object reflectivity, and the fringe phase––individually for each camera pixel. The three (or more) exposures are commonly taken in a way to later decipher the data by phase shifting algorithms. From only one exposure, this is impossible. Single-shot workarounds, such as single sideband demodulation , were suggested. However, these demand a spatial bandwidth of the object smaller than 1/3 of the system bandwidth to avoid overlap of the base band with the modulated signal. This restriction is even more severe if the carrier frequency is not equal to the optimal 2/3 of the system bandwidth. The same argument, by the way, holds for holographic “3D imaging”, where the carrier frequency must be sufficiently large, to avoid overlap of the reconstructions and the base band. The arguments above clarify: A single-shot 3D camera for arbitrary surface shapes with a pixel-dense point cloud is impossible if no additional modality is exploited. One camera pixel can principally not deliver sufficient information.
So far about a multi-shot sensor with the best possible data density. At the other end of the spectrum of sensors is the “perfect single-shot camera”: light-sectioning triangulation . Instead of fringes, only one narrow line is projected onto the object. Along this line, a perfect 3D profile can be calculated from one camera image.
Note that even light sectioning with only one line is not perfectly local, as the calculation of the line position with subpixel precision exploits at least three neighboring pixels perpendicular to the line direction, as illustrated in Fig. 1(a). It should be noted as well that light sectioning allows for high lateral resolution along the line direction. Of course, light sectioning with only one line displays a very low data density, e.g., ρ ≈0.1%, for the 1-megapixel camera example. According to the profound ambiguity problem  multi-line light sectioning principles are commonly not able to project more than about 10 lines at the same time (ρ ≈1%), if no line codification is exploited. It should be mentioned that the ambiguity problem occurs in multi-shot fringe projection as well. Here, the ambiguity is commonly solved by temporal phase unwrapping [20,21] which, however, requires even more exposures.
There is a workaround to this problem, we called it “Flying Triangulation” [22,23]. Flying Triangulation involves the use of a single-shot sensor with about 10 projected lines. The data from each exposure are sparse, but the gaps between the measured lines can be filled within seconds by on-line registration of subsequent exposures, while the sensor is guided (even by hand) around the object. Eventually, Flying Triangulation delivers dense high-quality data, even of moving objects. However, as the dense point cloud is accumulated by subsequent exposures, it demands for rigid objects: speaking or walking people cannot be acquired. The obvious question is: how many lines can be maximally projected in order to obtain a high point cloud density already within each single shot?
If the significant ambiguity problem is neglected for a moment, one can estimate the maximum possible number of lines. The considerations are illustrated by Fig. 1(b). To localize each line with the best subpixel accuracy, the line images must be as narrow as possible  but wide enough to satisfy the sampling theorem. And there must be sufficient space between the lines. With Fig. 1 we find that for subpixel interpolation, the linewidth must be wider (but not much wider, to avoid overlap) than 4p (where p is the pixel pitch). With a half-width of a little more than 2p, precise subpixel interpolation is ensured. These numbers are consistent with theoretical and experimental experience . We note as well that the line image at the three evaluated pixels (Fig. 1(a)) must not be disturbed by abrupt variation of the object height or texture. This tells us again that it is not possible to acquire completely independent data within a small area.
Figure 1(b) shows that the distance between two lines must be at least 3p for low crosstalk. Fine details between the lines are not resolved, and, with proper band limitation, should not occur. Consequently, a camera with Nx pixels in the x-direction permits a maximum of L ≈Nx/3 lines, and the sensor can acquire Npix/3 valid 3D points within one single shot, yielding a density of ρ ≈33% or 330,000 3D pixels with a 1-megapixel camera.
Why are we not surprised to find this limit for multi-line triangulation? Considerations in object space (see fringe triangulation) tell us that there are three unknowns to be deciphered; considerations in Fourier space (single sideband encoding) tell us that we must limit the object bandwidth to less or equal than 33% of the available system bandwidth.
Obviously, we have to sacrifice at least 2/3 of the available space-bandwidth. This may explain why the magic number “3” is frequently encountered in this paper.
However, the absolute limit of ρ ≈33% can only be reached for relatively flat, untilted surfaces. For highly tilted areas and a large triangulation angle, the line distance in the camera image may shrink according to perspective contraction. This requires a larger line distance of 6p to 7p to be projected. The resulting density of ρ ≈16%, leads to 160,000 3D points for the 1-megapixel camera. We demonstrate herein that this density is realistic for objects with significant depth variation and it can be technically achieved without extreme requirements for the calibration and mechanical stability.
As to the crucial question of how to correctly identify (i.e., how “to index”), say, 330 lines - or more modestly, 160 lines, - we conclude that this formidable problem cannot be solved by spatial encoding of the lines if we want to exploit the full channel capacity for the acquisition of precise, high-resolution 3D data.
The two core questions to be answered by this paper are therefore:
- To what extent can we improve ρ for light sectioning, and what does it cost?
- How to make a sensor that approaches the maximum possible data density, without loss of precision and lateral resolution.There are a few nearly perfect solutions exploiting color or time of flight as an additional modality. The so-called “rainbow sensors” [25,26] use a projected color spectrum to encode the distance via triangulation. A color camera decodes the shape from the hue of each pixel. Color-encoded triangulation may have a density of ρ = 100% if a three-chip color camera is employed for the acquisition (no spatial interpolation of the Bayer-pattern). We notice, by the way, that there are three color channels in a three-chip color camera, which permit faster measurement by virtue of greater space-bandwidth ( = more pixels). Although the concept of rainbow sensors (and other color encoding approaches) has long been known, it is not yet well established, possibly because it prevents color texture acquisition in many cases [10,27]. Another, nearly “single-shot” solution exploits the time of flight (TOF) from the camera to the object and back. For each pixel, the distance is deciphered from a fast temporal sequence of exposures by temporal phase shifting. TOF is not a genuine single shot principle, but the image sequence can be taken quite fast. So the method is virtually suited for non static objects such as walking people. A popular implementation is the latest Microsoft Kinect sensor, which, however displays poor precision in the millimeter range and a limited number of pixels, not allowing photo realistic 3D images or precise 3D metrology.
From the “three chip” camera for color encoding it is a small step to ask whether we can replace the three (red, green, and blue) sensors using a couple of synchronized black-and-white cameras. With a multitude of cameras, the identification of each projected “line” or “pixel” may become much easier.
The idea of using many cameras was proposed in  approximately 10 years ago, and a principal solution was demonstrated. The authors project a pattern with binary stripes onto the object, and C cameras are required to distinguish 2C depths. By selecting the triangulation angles properly (in exponential sequence), each stripe can be uniquely identified.
This was (to the best of our knowledge) the first “proof of principle” for a single-shot 3D camera with a potentially dense point cloud. It demonstrates that unique triangulation can be achieved from several images obtained at the same time - as opposed to a timed sequence of images. As explained above, this method may improve the density of the 3D point cloud without however reaching the 100% of phase-measuring triangulation.
Approaches which exploit several synchronized cameras have been suggested for the purpose of reducing the number of sequential exposures without attempting to obtain a single-shot solution. For example, multiple cameras are exploited for PMT, to speed up measurements by avoiding multi-frequency phase shifting [11–13]. The method described in , based on Fourier transform profilometry, exploits two cameras to facilitate the phase matching.
How does the idea proposed in  match our considerations above? For a setup comparable to ours (160 lines, >500 distinguishable distances), the method described in  requires a multitude of cameras. We demonstrate in this paper that two cameras are sufficient to measure up to 300,000 3D pixels. Moreover, due to proper subpixel interpolation, the precision of our method is limited only by coherent noise or electronic noise and not by the number of cameras.
2. A single-shot 3D movie camera with unidirectional lines
As discussed in the previous section, a single-shot principle does not supply sufficient information to provide data with 100% density (disregarding “rainbow triangulation”). If pattern encoding comes into play, the density will be even less. Our single-shot camera is based on multi-line triangulation, without any encoding to identify the lines - the decoding is performed just by combining the images of two properly positioned cameras. Compared to approaches that rely on line encoding, our approach preserves fine details and edges because no additional space bandwidth is consumed.
Two approaches with different projected patterns are described. The first approach exploits a projected pattern of straight, narrow lines (160 lines for a 1-megapixel sensor). The object is observed from different triangulation angles by two cameras. The projected slide displays binary lines. The projected lines at the object surface are low-pass filtered by the projector lens, which helps to satisfy the sampling theorem, as discussed above.
The next question is how to manage the necessary “indexing” of that many lines. The corresponding ambiguity problem was discussed in a previous paper . As the novel solution is an extension of these results, they are briefly summarized:
For common multi-line triangulation, the achievable line density L/Δx (where L is the number of projected lines and Δx is the width of field; see Fig. 2) is related to the triangulation angle θ and the unique measurement depth Δz as described by the following expression:
A violation of Eq. (1) results in false 3D data, observed as outliers (see Fig. 4(e)). Reference  explains how these outliers can be detected and corrected using data from a second camera positioned at a second angle of triangulation. The basic idea is to (virtually) project the data from the first camera back onto the camera chip of the second camera. The correctly evaluated data can be detected, as they necessarily coincide at the camera chip (but, commonly, not the outliers). However, with increasing line density, more outliers from one camera accidentally coincide with data from the second camera, and the achievable (unique) line density is only moderate. As in Flying Triangulation, registration of several frames is required for sufficient density.
Here an effective improvement of the “back-projection” idea is introduced which allows for an about 10 times higher line density without generating any outliers, by the proper choice of a small and a large triangulation angle. In contrast to , where outliers were detected with a probabilistic method, this new approach allows for a deterministic identification of the line indices. We demonstrate that thoughtfully designed optics for illumination and observation, in combination with moderately sophisticated software, can solve the problem.
The basic idea is as follows: a narrow line pattern with L lines is projected onto the object. The object is observed by two cameras (C1 and C2) at two triangulation angles θ1 and θ2 (see Fig. 3). The first camera C1 and the projector P create a triangulation sensor with a very small triangulation angle θ1. This first sensor delivers noisy but unique data within the demanded measurement volume Δz1, according to Eq. (1). The data are noisy, as the precision is δz ~1/(SNR sinθ1) , with SNR being the signal-to-noise ratio (the dominant source of noise for line triangulation is, commonly, speckle noise [29–32]). The second camera C2 and the projector create a second triangulation sensor with a larger triangulation angle θ2. This second sensor delivers more precise but ambiguous data. As both sensors look at the same projected lines, the first sensor can “tell” the second sensor the correct index of each line, via a back-projection mechanism similar to that described in .
The evaluation procedure is illustrated in Fig. 4. The observed line images deviate from a straight line, depending on the triangulation angle. The lines seen by camera C1 are nearly straight and can be easily indexed, as shown in Fig. 4(b). In the sketch, the index is illustrated by a color code. The directly calculated 3D model (Fig. 4(d)) displays correct indexing but high noise. 3D points, directly calculated from the image of C2, display low noise but ambiguity errors (Fig. 4(e)). To solve this problem, both sets of information are merged: the points of Fig. 4(d), including their index information, are back-projected onto the chip of C2. With precise calibration , the back-projections overlap with the line signal (Fig. 4(f)). Eventually, the back-projected line indices of C1 are assigned to the corresponding lines on the chip of C2 (Fig. 4(g)), leading to unique data with high precision (Fig. 4(h)).
As the reader might guess from Fig. 4(f), θ1 and θ2 cannot be chosen independently. Noise has to be taken into account. The correct index can be assigned uniquely if the back-projected noisy line images of C1 do not crosstalk with the neighboring line images on C2. More precisely, the back-projected (noisy) lines should not overlap with lines other than the corresponding lines seen by the second camera. This is the case if Eq. (2) is satisfied:
To demonstrate the robustness of the principle against locally varying object texture and reflectivity, we measured a “natural” object: human faces. Figure 5 displays raw data from a single-shot acquisition of a human face, with 160 projected lines, acquired with a camera resolution of 1024 × 682 pixels. This corresponds to a 3D point density of ρ ≈16%. Figure 5(b) displays the acquired 3D data of the object (Fig. 5(a)). Note that all perspectives are extracted from the same single video frame. Black-and-white texture information is included in the 3D data. Examples comprising color texture are shown in Fig. 11 and in . In Fig. 5(c), a close-up view illustrates the low noise, which will be discussed in detail in the next section. We emphasize that the displayed 3D models are not post-processed, no interpolation or smoothing was applied. Each displayed 3D point was measured independently from its neighbors. As discussed, the absence of spatial encoding strategies allows for edge preservation, as demonstrated in Fig. 6.
With the novel single-shot method, object surfaces can be measured with considerable density and a precision that is only limited by coherent or electronic noise [24,29]. The time required for a single measurement is as short as the exposure time for one single camera frame. A static binary pattern (which can be a simple chrome-on-glass slide) is projected. No electronically controlled pattern generator is required. Hence the 3D data can be acquired in milliseconds or microseconds, only limited by the available illumination. Each camera frame delivers a 3D model, so 3D motion pictures can be acquired. Examples are shown in section 4, in Visualization 2, and in .
In this paper we essentially discuss the key feature of our novel sensor: the single shot ability. The discussion so far was about data density and speed, not about precision. The reader, however, might suspect that the high data density is dearly bought by sacrificing precision. The following section demonstrates that the sensor is able to reach a precision close to the limit of what physics allows and that it is competitive to the paradigm “phase-measuring triangulation”. We refer to earlier research about the physical limits of 3D sensing, see e.g. [15,29–32,35,36].Eq. (3) is the principal uncertainty to localize the (speckled) image of a projected laser spot better than
The influence of coherent noise is illustrated in Fig. 7. Two line profiles are shown, acquired with the same image sided observation aperture: One line, projected with coherent laser illumination (Fig. 7(a)) and one line with reduced coherence (Fig. 7(b)). For both lines, the raw image and the lateral variation of the line maximum (with sub pixel precision) is shown, the latter leading to the depth precision. The depth precision is calculated for the parameters of our prototype setup (see below).
The line image in Fig. 7(b) displays much lower speckle contrast, which leads to better distance precision. We note that “incoherent illumination” means more than just applying a broad band light source. At metal surfaces (which are pure surface scatterers) an effective reduction of the speckle contrast can only be achieved by reducing the spatial coherence . Measurements of volume scatterers such as skin or plastics can be further improved by exploiting low temporal coherence .
The prototype of our 3D camera displays an observation aperture of sin uobs = 0.00225, a triangulation angle θ2 = 9° and a horizontal field width of Δx = 300 mm at the center of the measurement range (at z = 500 mm). With fully coherent illumination (Cs = 1), the achievable distance precision would be not better than δzcoh = 249 µm (see Eq. (3)).
The distance precision after optimization of the setup is found from the following experiment: A planar screen (laminated with thin white paper) is oriented perpendicular to the optical axis of the projector, within the measurement volume. The screen is illuminated with a pattern of 142 projected vertical lines. Measurements are taken at 11 different depths within the measurement volume (from z = 450 mm to z = 550 mm) and the precision δz of each line is calculated from the standard deviation δx’ along each line. Around the distance z = 500 mm, the best precision along an entire line is found to be δzmin = 23µm (the very best precision in the entire measurement volume is δzmin = 17µm).
So far we achieved this high precision only around the center of the field, because the sub pixel precise evaluation of the line maximum is sensitive to line broadening by aberrations . With low aberration optics the physical precision limit will be achievable in the entire field. The “worst” precision in the entire volume is better than δzmax ≤ 180µm for the planar screen and δzmax ≤ 200µm for human skin (the latter due to line broadening by volume scattering).
For the further discussion we consider it justified to use the best precision, as we are interested in the physical limit (rather than in the technical imperfections of our projector- and camera lens). The measured precision δzmin = 23 µm at z = 500 mm is about 10 times better than the coherent limit of 249 µm. This has been achieved by reduction of speckle noise, according to the optimization steps described below:
- i) Reduction of spatial coherence: by exploiting an illumination aperture (sin uill = 0.0075), about three times larger than the observation aperture (sin uobs = 0.00225), the speckle contrast is reduced  by a factor
- ii) At a depolarizing surface (such as our screen), illuminated with unpolarized light (LED projector), the speckle contrast is reduced  by a further factor of
Combining steps i) – iii), we achieve a reduction of speckle noise by a factor of cspat cpol cpix = 0.14, which explains a seven times improvement of the precision, compared to the coherent limit. In fact, the achieved precision is even better. This further improvement is due to volume scattering in connection with temporal incoherence. The quantitative contribution of temporal incoherence could be calculated in principle , but in fact, electronic noise is taking over the role of the dominant source of noise, if the speckle contrast is very low. Moreover, line broadening due to volume scattering partially counterbalances the positive effects of speckle reduction .
We conclude: In the center of the measurement volume, our prototype setup is able to display a precision of δzmin = 23µm, about ten times better than the coherent limit. This means that within the measuring range of Δz = 100 mm, more than 4000 depth steps could be distinguished which competes with the most high quality 3D cameras available at the market. Note that sensors with better precision commonly exploit many exposures for noise averaging.
4. With crossed lines toward higher point density
In the introduction, the maximum number of lines was estimated to be about 160 lines for a 1-megapixel camera and realistic 3D scenes. From Fig. 8(a) it is obvious that it might become very difficult to implement more lines.
However, there is another option for more 3D points: our first approach can be upgraded by the projection of crossed lines. Figure 8(b) displays the original pattern to be projected, with 160 vertical lines and 100 horizontal lines, based on the aspect ratio (16:10) of the projector.
After image acquisition, the two line directions are identified, isolated, and separately evaluated. Principally, a second pair of cameras could be added to evaluate the second perpendicular line direction. However, there is a simpler and more cost effective solution, requiring only two cameras instead of four. As shown in Fig. 3(a), only the distance from the camera and the projector perpendicular to a line direction defines the triangulation angle. For a crossed line pattern, we can generate two different triangulation angles for each camera. The resulting setup (see Fig. 9) has four triangulation angles - one large and one small angle for each line direction. With only two cameras, we create four triangulation sensors.
Principally, this can be performed with even more cameras and more line directions. Such a setup with C cameras and D line directions could produce C × D triangulation sub-systems. A setup as shown in Fig. 9 (C = D = 2) with 160 vertical and 160 horizontal projected lines is able to acquire nearly 300,000 3D points from a single frame of a 1-megapixel camera (crossing points are counted only once). Again, it turns out that a proper optical setup makes things easy.
What is the cost of the increased number of 3D pixels? The identification of the line direction requires some (not too serious) restriction of the surface shape: to distinguish between different directions, a small line segment has to be visible, which requires some neighborhood and a certain “smoothness” of the surface. This means that not all measured 3D points are completely independent of the neighborhood anymore. We add that it is advantageous to increase the intensity of the projected pattern at the crossing points. So, the line position can still be evaluated in both directions. However, this reduces the signal-to-noise ratio at the other line segments, which reduces the precision.
Figure 10 displays frames of a 3D movie, acquired with the setup of Fig. 9. The different perspectives are each extracted from one video frame. Figure 10(b) and (c) illustrate the low noise with a close up view. Again, all figures display unprocessed raw data.
At least one color camera must be used to for the acquisition of color texture. It is possible to acquire color texture with an auxiliary color camera, using additional flash exposure. This does not constitute “single-shot” acquisition, which is why we acquire color texture directly from the line images. The simplest solution is to replace the two black-and-white cameras by two color cameras. In this case we condone spatial interpolation by the RGB Bayer pattern cameras. Figure 11 displays three different video frames, taken from a color 3D movie. The full movie can be seen in Visualization 5 and .
This paper presents a single-shot 3D camera concept and device for the acquisition of up to 300,000 3D points within each single camera frame of two synchronized 1-megapixel video cameras. This number is close to the possible maximum, as theoretical estimations reveal. The 3D camera exploits triangulation with a pattern of 160 unidirectional lines or with a pattern of crossed lines with the same pitch. The fundamental problem of unique line identification is solved by combining the images from two cameras: one with a very small triangulation angle and the other with a large triangulation angle.
The 3D camera is technically simple. Special care is given to the proper geometry of the optics and illumination and to obeying the sampling theorem. The precision is limited only by coherent or electronic noise. The precision is better than 1/500 of the distance measuring range (δz ≤ 200 µm for the prototype setup).
The time for the acquisition of a 3D scene is limited only by the camera exposure time and the available illumination level (for static pattern projection). Visualization 6 illustrates the motion of a bouncing ball recorded with a camera frame rate of 30 Hz and an exposure time of ~5 ms. More videos are available on our YouTube channel .
The computational effort is moderate: 30 Hz recording and display with interactive choice of the perspective seems possible in real time.
6. A retrospective aha-experience
By reviewing the images of both cameras (see Fig. 12), a striking resemblance with image plane holograms can be noticed. Indeed, the phase of the lines (“fringes”) encodes the depth, as in a hologram. The 3D image from Fig. 12(a) could even be optically reconstructed by laser illumination. After eliminating the base band and the second diffraction order, the phase of the reconstruction represents the surface in 3D space. Of course, the phase has to be re-scaled: a 2π phase shift corresponds to a distance Δz1 = Δx/(L tanθ1) (see Fig. 2 and Eq. (1)). The virtual “wavelength” is λ1 = 2Δz1.
This will work for Fig. 12(a), but not for the image in Fig. 12(b). Due to the large triangulation angle, the phase modulation in Fig. 12(b) is much larger than 2π, and a unique object reconstruction is impossible without “phase unwrapping”.
After all, the first sensor, with the small triangulation angle θ1, is the key component: it serves as a “phase compressor” that enables the acquisition of objects with large depth variation. The first sensor, in combination with the second sensor, exploits concepts of holography and two-wavelength interferometry, here for rough, macroscopic objects (the second wavelength is λ2 = 2Δz2 = 2 Δx/(L tanθ2).
We conclude with a heavy heart, that the space-bandwidth constraints of single-shot principles have to be accepted. There is a little comfort, as state-of-the-art video cameras supply a plethora of pixels. Figures 5, 8, 10 and 11 indicate that even a 1-megapixel camera can yield more than a hundred thousand 3D pixels which is sufficient for 3D metrology with significant lateral resolution. Cameras with many more pixels are available, and full-HD quality will be achievable.
This paper is essentially about principles and limits, as the sensor works with simple technology. However, the sensor would not work at the limits, without precise calibration. We want to acknowledge the invaluable contributions of Florian Schiffers, who was involved in many fruitful discussions, and we wish to acknowledge specifically his assistance in the calibration .
References and links
1. N. L. Lapa and Y. A. Brailov, “System and method for three-dimensional measurement of the shape of material objects,” U.S. Patent No. US 7,768,656 B2 (2010).
2. B. Freedman, A. Shpunt, M. Machline, and Y. Arieli, “Depth mapping using projected patterns,” U.S. Patent Application No. US 2010/0118123 A1 (2010).
3. H. Kawasaki, R. Furukawa, R. Sagawa, and Y. Yagi, “Dynamic scene shape reconstruction using a single structured light pattern,” IEEE Conference on CVPR, 1–8 (2008).
4. R. Sagawa, R. Furukawa, and H. Kawasaki, “Dense 3D reconstruction from high frame-rate video using a static grid pattern,” IEEE Trans. Pattern Anal. Mach. Intell. 36(9), 1733–1747 (2014). [CrossRef] [PubMed]
5. B. Harendt, M. Große, M. Schaffer, and R. Kowarschik, “3D shape measurement of static and moving objects with adaptive spatiotemporal correlation,” Appl. Opt. 53(31), 7507–7515 (2014). [CrossRef] [PubMed]
6. S. Heist, P. Lutzke, I. Schmidt, P. Dietrich, P. Kühmstedt, A. Tünnermann, and G. Notni, “High-speed three-dimensional shape measurement using GOBO projection,” Opt. Lasers Eng. 87, 90–96 (2016). [CrossRef]
7. N. Matsuda, O. Cossairt, and M. Gupta, “MC3D: Motion Contrast 3D Scanning,” 2015 IEEE International Conference on Computational Photography (ICCP), Houston, TX, 2015, pp. 1–10.
9. W. Lohry, V. Chen, and S. Zhang, “Absolute three-dimensional shape measurement using coded fringe patterns without phase unwrapping or projector calibration,” Opt. Express 22(2), 1287–1301 (2014). [CrossRef] [PubMed]
11. R. Ishiyama, S. Sakamoto, J. Tajima, T. Okatani, and K. Deguchi, “Absolute phase measurements using geometric constraints between multiple cameras and projectors,” Appl. Opt. 46(17), 3528–3538 (2007). [CrossRef] [PubMed]
12. K. Zhong, Z. Li, Y. Shi, C. Wang, and Y. Lei, “Fast phase measurement profilometry for arbitrary shape objects without phase unwrapping,” Opt. Lasers Eng. 51(11), 1213–1222 (2013). [CrossRef]
13. C. Bräuer-Burchardt, P. Kühmstedt, and G. Notni, “Phase unwrapping using geometric constraints for high-speed fringe projection based 3D measurements,” Proc. SPIE 8789, 878906 (2013). [CrossRef]
14. K. Song, S. Hu, X. Wen, and Y. Yan, “Fast 3D shape measurement using Fourier transform profilometry without phase unwrapping,” Opt. Lasers Eng. 84, 74–81 (2016). [CrossRef]
15. G. Häusler and S. Ettl, “Limitations of optical 3D sensors,” in Optical Measurement of Surface Topography, R. Leach, ed. (Springer, 2011).
19. F. Willomitzer, S. Ettl, C. Faber, and G. Häusler, “Single-shot three-dimensional sensing with improved data density,” Appl. Opt. 54(3), 408–417 (2015). [CrossRef]
21. M. Servin, J. M. Padilla, A. Gonzalez, and G. Garnica, “Temporal phase-unwrapping of static surfaces with 2-sensitivity fringe-patterns,” Opt. Express 23(12), 15806–15815 (2015). [CrossRef] [PubMed]
22. S. Ettl, O. Arold, Z. Yang, and G. Häusler, “Flying Triangulation: an optical 3D sensor for the motion-robust acquisition of complex objects,” Appl. Opt. 51(2), 281–289 (2012). [CrossRef] [PubMed]
23. F. Willomitzer, S. Ettl, O. Arold, and G. Häusler, “Flying Triangulation - a motion-robust optical 3D sensor for the real-time shape acquisition of complex objects,” AIP Conf. Proc. 1537, 19–26 (2013). [CrossRef]
26. J. Geng, “Rainbow three-dimensional camera: New concept of high-speed three-dimensional vision systems,” Opt. Eng. 35(2), 376–383 (1996). [CrossRef]
27. C. Schmalz and E. Angelopoulou, “Robust single-shot structured light,” IEEE Workshop on Projector–Camera Systems, (2010).
28. M. Young, E. Beeson, J. Davis, S. Rusinkiewicz, and R. Ramamoorthi, “Viewpoint-coded structured light,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), (2007).
30. G. Häusler, “Ubiquitous coherence - boon and bale of the optical metrologist,” Speckle Metrology, Trondheim. Proc. SPIE 4933, 48–52 (2003). [CrossRef]
31. G. Häusler, “Speckle and Coherence,” in Encyclopedia of Modern Optics, B. Guenther, ed. (Elsevier, 2004).
32. J. Habermann, “Statistisch unabhängige Specklefelder zur Reduktion von Messfehlern in der Weißlichtinterferometrie,” Diploma Thesis, University Erlangen-Nuremberg (2002).
33. F. Schiffers, F. Willomitzer, S. Ettl, Z. Yang, and G. Häusler, “Calibration of multi-line-light-sectioning,” DGaO-Proceedings 2014, 12 (2014).
34. YouTube-Channel of the authors’ research group: www.youtube.com/user/Osmin3D
36. G. Häusler, C. Faber, F. Willomitzer, and P. Dienstbier, “Why can’t we purchase a perfect single shot 3D-sensor?” DGaO-Proceedings 2012, A8 (2012).