High-speed structured light based 3D scanning using an event camera

Xueyan Huang; Yueyi Zhang; Zhiwei Xiong

doi:10.1364/OE.437944

1. Introduction

In the field of 3D vision, such as 3D face recognition and industrial inspection, structured light (SL) based 3D scanning has been widely adopted due to its simplicity, accuracy and non-contact property [1,2]. Different from the classic stereoscopic system, that passively extracts features from stereo images, the structured light system actively projects temporally or spatially coded 2D patterns to the surfaces of target objects. Then a camera is utilized to capture the scene with structured illumination. After analyzing the correspondences between the captured images and the original patterns, the 3D information of the scene can be obtained.

Accuracy and scan speed are two main factors taken into account to evaluate the performance of a structured light system [3,4]. There is a trade-off between accuracy and scan speed due to limited sensor bandwidth. For example, in a structured light system equipped with a standard frame-based camera, higher reconstruction accuracy demands higher resolution of the camera. However, with the resolution increasing, the time needed for data I/O also increases, which slows down the scan speed. In order to maintain both high scan rate and high resolution, a popular scheme is utilizing a high-speed camera [5–7], which has large bandwidth but extremely expensive. In addition to expanding the bandwidth, another way to increase the scan rate of the SL system is to reduce the redundant information. Recently, the bio-inspired vision sensor, also known as the event camera, is becoming more and more popular in computer vision fields due to its advantages of high temporal resolution, low latency, low power and high dynamic range [8]. Since the event camera only transmits the high-level and non-redundant information, the necessary bandwidth is reduced, which shows great potential in implementing high-speed SL systems.

Event cameras, such as Dynamic Vision Sensor (DVS) [9,10], have a different circuit design from traditional frame-based cameras. Each pixel in DVS can asynchronously respond to relative voltage changes (i.e., intensity changes) on the logarithmic photo-detector and output an event in microsecond temporal resolution if the predefined threshold is exceeded. For those pixels with zero or minor intensity changes, no events are fired and transmitted. Compared with the conventional camera, that must sample the entire image for a valid scan, the imaging mechanism of the event camera drastically reduces the amount of required sampling data.

A crucial problem in utilizing the event camera in a structured light system is how to generate events. Since static scenes with constant illumination cannot bring in any brightness changes, it needs a light source to provide varying illumination on the surface of scanned objects to trigger events. Recent studies introduce some approaches to adopt an event camera in a structured light system. Brandli et al. [11] first combine an event camera with a pulsed line laser to perform terrain mapping tasks. They propose an adaptive event-based filtering algorithm based on the temporal histogram to efficiently extract the laser strip position. Matsuda et al. [12] adopt a DVS and a laser diode with a micro-mirror and trigger events by blinking the laser diode. They register events according to their columns and timestamps, thus getting the correspondence between the event location and the estimated projector location from the event timestamps. Martel et al. [13] demonstrate an event-based stereo system associated with a laser diode. They alleviate the stereo-matching problem by pairing the timestamps of triggered events generated from a pair of event cameras. Although the high temporal resolution and redundancy-suppressing property of event cameras greatly boost the scan rate, these laser-based systems still require a long acquisition time waiting the laser to scan over the target object, which makes them unsuitable to reconstruct fast moving objects. In addition, the applications of the laser-based SL systems are restricted by the power of laser scanner for safety issues. Besides the laser scanner, an alternative option for the active light source is the digital light projector (DLP). Mangalore et al. [14] trigger events by projecting a sequence of moving fringe patterns, which is equivalent to performing several line scanning simultaneously. After phase unwrapping, the depth map is derived by the conventional fringe projection profilometry (FPP) [15]. Leroux et al. [16] trigger events by projecting sinusoidal signals varying in intensity, and tag regions with different code-words (i.e., signal frequency). They use a burst filter to extract the frequency and then get corresponding pairs between the projector and the camera. However, the high temporal resolution of the event camera in these DLP-based methods is not fully utilized, since they require sequential projections (i.e., multi-shots) for a valid depth reconstruction, which reduces the scan rate of the SL systems.

In this paper, we present a novel event-based SL system that can perform high-speed 3D scanning for both static and dynamic scenes. Distinct from the time-multiplexing strategy that projects multiple patterns, we project a single pattern to the scene to fully exploit the high temporal resolution of the event camera. Specifically, we utilize a binary pattern with pseudo-random black-and-white dots, just like the one used in the Kinect depth sensor [2]. The DLP projector projects the binary pattern with the LED lights ON and OFF repeatedly, so that the event camera can detect the intensity changes and output the event stream. The deformation frames and the reference frame are generated via accumulating events from the event stream by a simple yet effective algorithm. The displacement map can be calculated with the digital image correlation method by comparing the deformation frame and the reference frame. Then the depth map is derived from the displacement map via triangulation. Figure 1 shows the diagram of the proposed system. With off-the-shelf devices, we build a prototype for the proposed system, which is composed of an event camera (CeleX-V) and a high-speed digital light projector (TI-DLP6500). Experimental results demonstrate that this system can perform 3D scanning under 1000 fps scan rate with an accuracy of 0.27 mm at a distance of 90 cm.

Fig. 1. The diagram of our event-based structured light system, which consists of a high-speed projector and an event camera. To trigger events, we blink the pseudo-random pattern with the lights ON and OFF repeatedly. Event frames are extracted from the event stream via a scan cycle separation algorithm.

Download Full Size | PDF

The rest of the paper is organized as follows. Sec. 2. briefly introduces the principle of the event-based structured light system as well as the reconstruction algorithms. Sec. 3. and Sec. 4. present our experimental setups and depth reconstruction results on static and dynamic scenes. Sec. 5. provides conclusions and discussions of the limitations.

2. Principle

2.1 Projection pattern

For high-speed SL, it is essential to utilize a small set of projection patterns. The ideal case is that only one pattern is utilized, which avoids the motion between captured frames. The binary pattern encodes the location information by the distinguishability of local spatial neighborhood, which has been widely utilized in consumer-grade products, such as Microsoft Kinect [17]. Essentially, the pattern is a $W \times H$ image with binary pseudo-random dots. Any sub-window of a particular size which slides over the full pattern should be unique. Following the seminal work in this field [18], we generate the pseudo-random pattern (shown in Fig. 1) according to the following rules:

The pattern is a black image with while dots. The size of white dots is $K \times K$ pixels.
The pattern is divided into $(\frac {W}{K})\times (\frac {H}{K})$ regions, and each $3 \times 3$ region contains only one white dot.
The position of a white dot is randomly selected with no adjacent white dots in its 8-connected neighborhood.

2.2 Event frame generation

Unlike the standard frame-based camera, pixels in the event camera operate asynchronously and independently. Each pixel in the event camera continuously compares previous stored intensity with current intensity and fires an event if the change of brightness exceeds a given contrast threshold. If the brightness increases, it fires a positive (ON) event. Otherwise, it fires a negative (OFF) event. Mathematically, the firing process of an event $e_k(x, y, t)$ can be formulated as

(1)$$e_k(x, y, t) = \left\{ \begin{array}{lr} \ \ 1\;\textit{if} \; \log \frac{I(x,y,t)}{I(x,y,t - \Delta t)} \geq C \\ -1\;\textit{if} \; \log \frac{I(x,y,t)}{I(x,y,t - \Delta t)} \leq{-}C \\ \end{array} \right.$$

where $I$ denotes brightness of the pixel, $C>0$ is the contrast threshold and $\pm 1$ represents the polarity $p$ of the event. $x_k, y_k$ and $t_k$ represent the horizontal and vertical coordinates and the timestamp of the $k^{th}$ event. Note that, some event-based devices do not natively support polarity detection, which only encode the event as a three-element tuple $(x_k, y_k, t_k)$. The output of the event camera is a series of events, called an event stream. The event stream is transmitted using address-event representation [19]. For our event-based SL system, events are triggered by blinking the pattern repeatedly. Specifically, we control the projector to project the designed pattern for a short time and then stop projection. After that, we restart another projection cycle following the same setting. This projection scheme generates distinct intensity changes, which facilitates the event firing.

A single event, even a small number of events, cannot provide enough information to recover the 3D contours of objects in the scene. Similar to traditional SL, we utilize the frame-based scheme to carry out the 3D scanning. To do this, we need to convert the event stream into 2D event frames first. The crucial problem for event frame generation is to specify slicing windows for the event stream. Actually, the frame generation result is very sensitive to the lengths of the slicing windows: With a short time window, the sliced event segment may generate a frame containing unclear patterns; With a large time window, the effective scan rate slows down, and motion blur may occur.

We launch several experiments to investigate the appropriate slicing scheme. In these experiments, the events are triggered via blinking the binary pattern at 33 Hz. The duration of projection and non-projection period in a projection cycle are both 15 ms. At the start of the projection period, the event camera mainly fires positive events. Similarly, at the start of the non-projection period, the event camera mainly fires negative events. We collect events for a few periods and present event counts in Fig. 2(a). It can be seen that positive and negative event bunches appear alternatively and the frequencies of positive and negative event bunches are approximately equal, which is consistent with our setup. An intuitive slicing strategy is to use equilong slicing windows, which can be easily implemented by using an external trigger for the event camera. However, event noise introduced by ambient illumination and electronic jitters, as well as event saturation, leads to unequal periods for both positive and negative event bunches. Thus, we have to investigate unequal slicing scheme to better segment the event stream. To further analyze the characteristic of event data, we choose an event segment with a time range $T$ as our target sample, which is a subset of the event stream in Fig. 2(a). This event segment is illustrated in a 3D scatter diagram with the axes of row, column and time shown in Fig. 2(b). It can be seen that the event camera scans row by row from bottom to top, which is consistent with its circuit design [20]. After scanning the first row, the event camera goes back to the bottom row, starting the next scan cycle. The event camera scans so fast that it can even perform several scan cycles in a projection period. It is worth noting that the time for a scan cycle is not constant, since the row containing no events requires little scan time. The final scan time depends on the distribution of the triggered events as well as the hardware configuration of the event camera.

Fig. 2. Event stream with different forms of presentations. (a) The figure of events bunches, where horizontal axis represents time and vertical axis represents event counts. (b) The 3D scatter diagram of events with the time range $T$, which is labeled in (a). (c) Event frames of a waving hand with stacking number $N=1$, $N=3$, $N=6$, respectively.

Download Full Size | PDF

We propose a simple yet effective method to adaptively slice the event stream. The core of our method is to segment the event stream to scan cycles. Given an event stream, we check the events one by one and count the event number for each row. If the number of events in a row reaches a predefined threshold, we set the row as the starting row, then go on iterating the remaining events. Since the event camera scans and outputs events row by row, if there is an event whose locating row number is larger than that of the previous event, it means that this event is the starting event of the next scan cycle and its previous event is the ending event of the last scan cycle. Using this method, we can easily slice the event stream to scan cycles. The above procedures are summarized in Algorithm 1. After getting the scan cycles, we accumulate the events within one or more cycles to generate a binary event frame via a logical OR operation. We also call the number of scan cycles for accumulation as stacking number. The proposed method allows us to extract an event frame with the minimum time range, and the predefined threshold helps to involve less noise when accumulating a frame.

We record the events under the scene of a waving hand and extract the event frames with different stacking numbers. The generated event frames are presented in Fig. 2(c). It can be seen that stacking more scan cycles provides better details. However, motion blur will be brought in if using too many scan cycles. Thus, the stacking number of scan cycles needs to be carefully tuned under specific experimental environment.

2.3 Depth estimation

For depth estimation, we need to capture two frames: a reference frame with no objects in the scene, which is reusable if the alignment of the camera and projector remains unchanged, and a deformation frame with target objects in the scene. The presence of objects causes the deformation of the pseudo-random pattern on the event frames, from which we can derive the depth of the objects. To calculate the displacements between the deformation frame and the reference frame, we adopt the digital image correlation (DIC) algorithms. Compared with 2D-DIC, 3D-DIC is considerred to be more accurate and practical [21–24]. However, to apply 3D-DIC, there should be multiple cameras in the setup. Given there is only one event camera in our setup, 2D-DIC is more appropriate.

The 2D-DIC algorithms assume the displacement filed to be homogeneous within a small region. Under this assumption, the points within a small region can be transformed from the reference frame to the deformation frame with the following first order equations:

(2)$$\begin{array}{l}x' = x + u + \frac{\partial u}{\partial x}(x-x_c) +\frac{\partial u}{\partial y}(y-y_c) \\ y' = y + v + \frac{\partial v}{\partial x}(x-x_c)+\frac{\partial v}{\partial y}(y-y_c),\end{array}$$

where $(x,y)$ in the reference frame and $(x',y')$ in the deformation frame are corresponding points, $(x_c,y_c)$ is the center of the small region and $u$ and $v$ represent displacements.

To evaluate the transformation, the 2D-DIC algorithm computes the similarity between the pixels within the small region in the deformation frame and those in the reference frame after transformation. A commonly used criterion of similarity is zero-mean normalized cross-correlation (ZNCC):

(3)$$C_{ZNCC}=\frac{\sum_{i=1}^{M} \sum_{j=1}^{N} [f(x_i,y_j) -\bar{f}]\times [g(x_i',y_j')-\bar{g}]} {\sqrt{\sum_{i=1}^{M} \sum_{j=1}^{N} [f(x_i,y_j) -\bar{f}]^2}\times \sqrt{\sum_{i=1}^{M} \sum_{j=1}^{N} [g(x_i',y_j') -\bar{g}]^2}},$$

where $f$ and $g$ are functions that output the intensity value at location $(x,y)$, $\bar {f}$ and $\bar {g}$ correspond to the mean intensity within the region with size $M\times N$ in the reference frame and the deformation frame.

In order to optimize the transform function in Eq. (2) with 6 variables $(u,v,\frac {\partial u}{\partial x},\frac {\partial u}{\partial y},\frac {\partial v}{\partial x},\frac {\partial v}{\partial y})$, we adopt a popular 2D-DIC package called Ncorr [25], which is an open-source package integrating RG-DIC [26] with multi-thread implementations. The basic principle of the Ncorr algorithm is to randomly select a small region in the deformation frame within a given mask, which specifies the region of interest (ROI) of target objects. The mask can be easily acquired by performing erosion and dilation operations on the deformation frames. Using the ZNCC criteria, Ncorr traverses the reference frame to find an initial guess of displacements with integer values. Then, it feeds a nonlinear optimizer with this initial guess to get more precise displacements for the following iterations. With the help of B-spline interpolation, the displacements result can achieve sub-pixel precision. Based on the displacements, the depth can be directly calculated via triangulation after the SL system is accurately calibrated. The whole workflow of our event-based SL system is presented in Fig. 3.

Fig. 3. The workflow of our event-based SL system, which can be divided into three main parts. The event frame generation part extracts event frames from event streams using Algorithm 1. The DIC part calculates the displacements between the deformation frame and the reference frame, and an automatically produced mask locating the target objects is used to improve the accuracy. The depth reconstruction part converts the relative displacements to the absolute depth using system intrinsic and extrinsic parameters.

Download Full Size | PDF

Fig. 4. (a) The hardware prototype of our system, which consists of a TI DLP6500 projector and a CeleX-V event camera. (b) The system model corresponding to (a) from the top view with specific experimental parameters.

Download Full Size | PDF

3. System setup and calibration

Our SL system, which is shown in Fig. 4, consists of an event camera CeleX-V [27] and a projector equipped with a Texas Instruments DLP6500 chip. CeleX-V is a new generation event-based vision sensor. Compared with other DVS devices, it has four advantages: high spatial resolution ($1280\times 800$ pixels), high temporal resolution (8 $\mu s$ latency), large bandwidth (100 Million events per second) and high dynamic range (140 dB). All these features make CeleX-V suitable to be utilized in the high-speed structured light system. The projector with the TI DLP6500 chip is able to project a binary pattern of $1920\times 1080$ resolution at a frame rate of 9500 fps. It has three controllable color LED bulbs: red, green and blue. The resolution of the binary pattern is the same as the projector, namely $1920\times 1080$ pixels. The parameter $K$ for the pattern generation mentioned in Sec. 2.1 is set as 8. The predefined threshold for slicing the event stream is set as 10, which is constant in our experiments.

In the experiments, we align the event camera lens with the projector lens along the vertical axis with a baseline of 6.8 cm. We place a white screen in front of the projector at a distance of approximately 100 cm to capture the reference frame. The size of the captured projecting pattern in the reference frame is about $1030 \times 580$. When scanning objects, we replace the white screen with a black screen, which reduce redundant events. Under these settings, the projected pattern on the background screen covers an area of 34 cm $\times$ 61 cm. We fine-tune the pose of the projector to make sure its central optical ray is perpendicular to the screen, and then adjust the pose of the event camera until the shape of the projecting pattern on the event frame is approximately rectangular. In the CeleX-V circuit design, each DVS pixel combines with a conventional activate pixel sensor (APS) [28]. The APS sensor outputs grayscale intensity and shares the same photodiodes with the DVS sensor. To accurately calibrate our system, we apply Zhang’s calibration algorithm [29] on the grayscale images captured in the APS mode.

As mentioned in Sec. 2.2, we utilize the DLP projector to project the pseudo-random pattern in each projection period to trigger events. The exposure time is set as 0.85 ms followed by a dark time of 0.15 ms (i.e., 1000 fps projection rate). The reason for the asymmetric projection period is that the CeleX-V is more sensitive to the increase of intensity, thus triggering more events. It is reasonable to expand the proportion of the exposure time to trigger more events under a fixed scan rate. In our experiments, we deem that events with positive polarity or negative polarity are beneficial to the frame generation without any preference, thus the events with different polarities are treated in the same way.

We also test varying intensities for different LED colors, from which we found that CeleX-V is not very sensitive to the blue light since we observe fewer events. The red light and the green light present comparable performance. This property may be relevant to the nature of the photodiode in CeleX-V. Therefore, we set the blue component of the projection light to 0 and use yellow light (i.e., red light mixed with green light) in our experiments. The light intensity of projector and the contrast threshold of CeleX-V are two main parameters determining the quality of event frames. With higher light intensity or lower contrast threshold, the number of events accompanied by noise (i.e., undesired events) increases. With lower light intensity or higher contrast threshold, the number of events accompanied by noise decreases. These two parameters need to be empirically determined under specific environments. In our experiments, we set both red and green lights with an intensity value 100 out of 256, and the contrast threshold of CeleX-V 130 out of 511. For each projection period, Algorithm 1 is able to robustly get the scan cycles, which softly synchronizes CeleX-V with the projector. When the event frames are generated with one scan cycle, our structured light system achieves a scan rate at 1000 fps.

4. Experimental results

4.1 Static scenes

To verify our proposed event-based SL system, we first measure the static scenes with some representative plaster models. The plaster models are sphere, cross-cone, Agrippa face and David bust, which are shown in the first row of Fig. 5. Under the settings illustrated in Sec. 3., we reconstruct the surface in the DVS mode with one scan cycle. In the APS mode, CeleX-V can generate 8-bit grayscale intensity frames at up to 100 fps as well. Thus, we also reconstruct the surface with the captured frames in the APS mode for comparison. The reconstruction results in both APS and DVS modes are shown in the second and third rows of Fig. 5, respectively. The results are presented as colored meshes using the ball-pivoting surface reconstruction algorithm [30], and the colors correspond to the depth value of 3D points. It can be seen that our event-based structured light system is able to reliably reconstruct the objects with complex shapes, and it also demonstrates a visually comparable performance with the frame-based reconstruction.

Fig. 5. 3D scanning results for different objects. (a)-(d) Plaster models including a sphere, a cross-cone, an Agrippa face model and a David bust model. (e)-(h) Reconstruction results in the APS mode. (i)-(l) Reconstruction results in the DVS mode. The color bar on the bottom of each column demonstrates the color-depth correspondence of each object.

Download Full Size | PDF

To quantitatively evaluate the accuracy of our SL system, we place a plaster square pyramid in the scene, which is shown in Fig. 6(a), and crop a 70 mm $\times$ 70 mm region from it to extract an inclined plane. We utilize our system to measure the 3D surface in both DVS and APS modes at the distance of 90 cm. For the DVS mode, we set 1000 fps projection rate. For the APS mode, we set the exposure time of the projecting pattern equal to the whole projection period, and capture the grayscale frames at 100 fps. The reconstructed 3D point cloud in DVS mode with one scan cycle is shown in Fig. 6(b). The cross-sections through the diagonal of the cropped region under the APS and DVS modes are presented in Fig. 6(c). Since the point cloud is derived from a inclined plane, the ground truth of the cross-section should be a straight line. Visually, the reconstruction results in APS mode performs better than that in DVS mode. For quantitative evaluation, we perform a plane fitting algorithm on this cropped region and re-sample pixels in the fitting plane to generate the ground truth. The errors are calculated by the difference between the 3D points and the ground truth, which are shown in Fig. 6(d) and Fig. 6(e). We accumulate the errors under DVS mode in a histogram shown in Fig. 6(f), which is approximately a normal distribution with the mean $\mu =0$ mm, and the standard variance $\sigma =0.27$ mm. The root mean square errors (RMSE) are calculated between the reconstructed 3D data and the ground truth within the cropped region to measure the reconstruction accuracy. The RMSE values in the DVS mode are 0.27 mm, 0.23 mm,0.24 mm for the stacking number $N=1, 2, 3$ respectively, and the RMSE in the APS mode is 0.10 mm for the same scene. The results reveal that stacking two scan cycles can indeed achieve higher accuracy than stacking a single scan cycle, but the trade-off is to reduce the scan rate. In addition, stacking three scan cycles can also bring in more noise, thus the reconstruction accuracy reduces. The accuracy in the APS mode is the highest since APS frames are 8-bit, which are much more robust to noise than 1-bit event frames. However, considering the relatively low RMSE in the DVS mode and the much higher scan rate (10 times faster than that in the APS mode), sacrifice of accuracy in trade of the significant gain of scan speed deserves.

Fig. 6. The evaluation of reconstruction accuracy. We crop a 70 mm $\times$ 70 mm region from a plaster square pyramid (a) to extract an inclined plane, and reconstruct the 3D point cloud of this cropped region shown in (b). The cross-sections through the diagonal of the cropped region (blue line in (a)) under the APS and DVS modes are presented in (c). We derive a fitting plane based on point clouds and consider it as the ground truth. The errors are obtained by calculating the differences between the 3D points and the ground truth. The error maps for the APS and DVS modes are shown in (d) and (e). The distribution of the errors under DVS mode is shown as a histogram in (f), which is approximately a normal distribution with the mean $\mu =0$ mm, and the standard variance $\sigma =0.27$ mm.

Download Full Size | PDF

4.2 Dynamic scenes

Our event-based SL system is able to scan dynamic scenes in a high speed. To verify that, we perform 3D scanning on a challenging dynamic scene, tearing a piece of A4 paper. The difference for our event-based SL system under static scenes and dynamic scenes is that the motion of dynamic objects can trigger additional undesired events. These additional events do not carry any pattern information, and they are considered as noise that may interfere with the DIC algorithm. However, under the 1000 fps projection rate, the ratio of events triggered by object motion can be relatively small. In this dynamic scene, the event frames are clear enough, thus we do not apply any denoising operations during the event frame generation procedure. Under 1000 fps projection rate, we record all the events during the process of tearing a piece of paper, which lasts for 2.3 seconds. We then extract 2300 frames using Algorithm 1. The depth reconstruction results with timestamps for every 450 ms (i.e, every 450 frames) are shown in Fig. 7. The results reveal that our system is able to capture the deformation of the paper and reconstruct its irregular shapes (e.g., convex and concave) with high speed. It is difficult for other SL systems with conventional frame-based cameras to achieve such high scan rate, and motion blur can be a fatal problem for them to perform 3D scanning on dynamic objects. As mentioned in Sec. 1, for previous event-based structured light systems, neither the laser-based methods nor the DLP-based time-multiplexing methods can deal with this dynamic scene, since they require a long acquisition time to scan over the target object. This 3D scanning example demonstrates the superior performance of our event-based SL system on challenging dynamic scenes with high scan rate. The final results of all event frames are presented in a video format, which is available at Visualization 1.

Fig. 7. 3D scanning results for the tearing a piece of A4 paper at different timestamps. (a) The first event frame generated with one scan cycle. (b)-(f) The sampled depth reconstruction results for every 450 frames, respectively. We present these results as colored meshes to better unfold their 3D textures, and the whole results are displayed in Visualization 1.

Download Full Size | PDF

5. Conclusion and discussion

In this paper, we present a novel high-speed structured light system using an event camera. A pseudo-random pattern is adopted to encode the spatial information. A DLP projector is utilized to trigger events by projecting the binary pattern with lights ON and OFF repeatedly. A simple yet effective algorithm is proposed to robustly detect the scan cycles and generate clear event frames without synchronization. With the help of the DIC algorithm, displacements between the reference frame and the deformation frame is derived at sub-pixel precision, and then the depth information is retrieved via triangulation. Experiments verify that our system can scan 3D objects with a maximum scan rate of 1000 fps. The quantitative evaluation on the surface reconstruction of an inclined plane reveals that our system can achieve an accuracy of 0.27 mm at a distance of 90 cm with one scan cycle (i.e., 1000 fps). Compared with existing frame-based or event-based structured light systems, our system can perform 3D scanning with high-speed, high accuracy, dense reconstruction, low IO demand and low power consumption properties, which makes it a promising 3D scanning solution that can be applied in many time-critical engineering fields.

As a preliminary attempt, we prove that the event cameras can be successfully utilized in a high-speed structured light system. However, the measurement accuracy of our system still have potential to be improved. There are many methods which may help enhance the accuracy, such as deploying multiple event cameras, utilizing 3D-DIC algorithms, matching with multiple reference images and changing to use temporal-coded patterns. In addition to accuracy, our system still has some limitations that require further investigation. First, object motion can bring in intensity changes in dynamic scenes, which leads to undesired events. More efficient denoising algorithms are expected to filter them out from the event streams. Second, as a prototype, our system does not make use of the polarity information of the events. How to integrate the polarity information in an event-based structured light system remains an open problem. We regard these aspects as our future work.

Funding

National Natural Science Foundation of China (61901435).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. S. S. Gorthi and P. Rastogi, “Fringe projection techniques: Whither we are?” Opt. Lasers Eng. 48(2), 133–140 (2010). [CrossRef]

2. J. Sell and P. O’Connor, “The xbox one system on a chip and kinect sensor,” IEEE Micro 34(2), 44–53 (2014). [CrossRef]

3. J. Geng, “Structured-light 3D surface imaging: a tutorial,” Adv. Opt. Photonics 3(2), 128 (2011). [CrossRef]

4. S. Zhang, “High-speed 3d shape measurement with structured light methods: A review,” Opt. Lasers Eng. 106, 119–131 (2018). [CrossRef]

5. T. Tao, Q. Chen, S. Feng, J. Qian, Y. Hu, L. Huang, and C. Zuo, “High-speed real-time 3D shape measurement based on adaptive depth constraint,” Opt. Express 26(17), 22440 (2018). [CrossRef]

6. W. Yin, S. Feng, T. Tao, L. Huang, M. Trusiak, Q. Chen, and C. Zuo, “High-speed 3D shape measurement using the optimized composite fringe patterns and stereo-assisted structured light system,” Opt. Express 27(3), 2411 (2019). [CrossRef]

7. A. Roth, E. Kristensson, and E. Berrocal, “Snapshot 3D reconstruction of liquid surfaces,” Opt. Express 28(12), 17906 (2020). [CrossRef]

8. G. Gallego, T. Delbruck, G. M. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. Davison, J. Conradt, K. Daniilidis, and D. Scaramuzza, “Event-based vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence p. 1 (2020).

9. P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 ×128 120 dB 15 μs Latency Asynchronous Temporal Contrast Vision Sensor,” IEEE J. Solid-State Circuits 43(2), 566–576 (2008). [CrossRef]

10. J. A. Leñero-Bardallo, T. Serrano-Gotarredona, and B. Linares-Barranco, “A 3.6 μ s latency asynchronous frame-free event-driven dynamic-vision-sensor,” IEEE J. Solid-State Circuits 46(6), 1443–1455 (2011). [CrossRef]

11. C. Brandli, T. A. Mantel, M. Hutter, M. A. Höpflinger, R. Berner, R. Siegwart, and T. Delbruck, “Adaptive pulsed laser line extraction for terrain reconstruction using a dynamic vision sensor,” Front. Neurosci. 7, (2014).

12. N. Matsuda, O. Cossairt, and M. Gupta, “MC3D: Motion Contrast 3D Scanning,” in 2015 IEEE International Conference on Computational Photography (ICCP), (IEEE, Houston, TX, USA, 2015), pp. 1–10.

13. J. N. P. Martel, J. Muller, J. Conradt, and Y. Sandamirskaya, “An Active Approach to Solving the Stereo Matching Problem using Event-Based Sensors,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS), (IEEE, Florence, 2018), pp. 1–5.

14. A. R. Mangalore, C. S. Seelamantula, and C. S. Thakur, “Neuromorphic Fringe Projection Profilometry,” IEEE Signal Process. Lett. 27, 1510–1514 (2020). [CrossRef]

15. V. Srinivasan, H. Liu, and M. Halioua, “Automated phase-measuring profilometry of 3-d diffuse objects.,” Appl. Opt. 23(18), 3105 (1984). [CrossRef]

16. T. Leroux, S. Ieng, and R. Benosman, “Event-based structured light for depth reconstruction using frequency tagged light patterns,” CoRR abs/1811.10771 (2018).

17. J. Smisek, M. Jancosek, and T. Pajdla, “3D with Kinect,” in 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), (2011), pp. 1154–1160.

18. Y. Zhang, Z. Xiong, Z. Yang, and F. Wu, “Real-Time Scalable Depth Sensing With Hybrid Structured Light Illumination,” IEEE Trans. on Image Process. 23(1), 97–109 (2014). [CrossRef]

19. K. A. Boahen, “A burst-mode word-serial address-event link-i: Transmitter design,” IEEE Trans. Circuits Syst. I 51(7), 1269–1280 (2004). [CrossRef]

20. M. Guo, J. Huang, and S. Chen, “Live demonstration: A 768 × 640 pixels 200Meps dynamic vision sensor,” in 2017 IEEE International Symposium on Circuits and Systems (ISCAS), (IEEE, Baltimore, MD, USA, 2017), pp. 1.

21. Y. Gao, T. Cheng, Y. Su, X. Xu, Y. Zhang, and Q. Zhang, “High-efficiency and high-accuracy digital image correlation for three-dimensional measurement,” Opt. Lasers Eng. 65, 73–80 (2015). [CrossRef]

22. P. Zhou, J. Zhu, and H. Jing, “Optical 3-d surface reconstruction with color binary speckle pattern encoding,” Opt. Express 26(3), 3452–3465 (2018). [CrossRef]

23. W. Yin, J. Zhong, S. Feng, T. Tao, J. Han, L. Huang, Q. Chen, and C. Zuo, “Composite deep learning framework for absolute 3d shape measurement based on single fringe phase retrieval and speckle correlation,” Journal of Physics: Photonics 2, 045009 (2020). [CrossRef]

24. W. Yin, Y. Hu, S. Feng, L. Huang, Q. Kemao, Q. Chen, and C. Zuo, “Single-shot 3d shape measurement using an end-to-end stereo matching network for speckle projection profilometry,” Opt. Express 29(9), 13388–13407 (2021). [CrossRef]

25. J. Blaber, B. Adair, and A. Antoniou, “Ncorr: Open-Source 2D Digital Image Correlation Matlab Software,” Exp. Mech. 55(6), 1105–1122 (2015). [CrossRef]

26. B. Pan, “Reliability-guided digital image correlation for image deformation measurement,” Appl. Opt. 48(8), 1535 (2009). [CrossRef]

27. S. Chen and M. Guo, “Live Demonstration: CeleX-V: A 1M Pixel Multi-Mode Event-Based Sensor,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), (IEEE, Long Beach, CA, USA, 2019), pp. 1682–1683.

28. E. R. Fossum, “CMOS image sensors: Electronic camera-on-a-chip,” SPIE milestone series 177, 63–72 (2003). [CrossRef]

29. Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Machine Intell. 22(11), 1330–1334 (2000). [CrossRef]

30. F. Bernardini, J. Mittleman, H. Rushmeier, C. T. Silva, and G. Taubin, “The ball-pivoting algorithm for surface reconstruction,” IEEE Trans. Visual. Comput. Graphics 5(4), 349–359 (1999). [CrossRef]

High-speed structured light based 3D scanning using an event camera

Abstract

1. Introduction

2. Principle

2.1 Projection pattern

2.2 Event frame generation

2.3 Depth estimation

3. System setup and calibration

4. Experimental results

4.1 Static scenes

4.2 Dynamic scenes

5. Conclusion and discussion

Funding

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (7)

Equations (3)

Optics Express