Fast 3D reconstruction via event-based structured light with spatio-temporal coding

Jiacheng Fu; Yueyi Zhang; Yue Li; Jiacheng Li; Zhiwei Xiong

doi:10.1364/OE.507688

1. Introduction

Structured light (SL) is a fundamental technique utilized in the field of depth sensing, which involves actively projecting pre-designed patterns onto the scene to achieve depth encoding. The illuminated scene is then captured by a camera, and the depth information can be decoded by analyzing the correlation between the projected patterns and the captured images. SL is highly regarded for its exceptional reconstruction accuracy and completeness in close-range depth estimation, distinguishing it from passive and physics-based techniques [1–5]. Consequently, SL has been widely embraced in both consumer-grade products and industrial applications [6–8]. However, conventional SL systems often require a delicate balance between speed and resolution, depending on the specific application context [9, 10]. This necessity primarily arises due to the limited bandwidth of both the camera and projector. Achieving a high-resolution and accurate depth map necessitates high-resolution sensors and multi-frame encoding methodologies, which puts significant demands on the transmission bandwidth and curtails the acquisition rate of the depth map.

Recently, the integration of neuromorphic sensors, known as event cameras, into the SL system has emerged as a solution to the challenge. With their unique circuit design, event cameras differ from traditional cameras that rely on external clock synchronization. Each pixel of the event camera independently and asynchronously responds to variations in intensity with ultra-high time resolution (as low as 1 microsecond) [11]. Event cameras produce ‘events’ as their output, which is a concise data representation providing information regarding the position, timestamp, and polarity of brightness changes. Compared to traditional cameras that capture full frames, event cameras selectively pinpoint regions with brightness fluctuations, discarding redundant temporal information. Besides, these cameras utilize photoreceptors operating on a logarithmic scale, thereby achieving a broader dynamic range. These specialized designs confer upon event cameras the advantages of ultra-high speed, high dynamics, and low power consumption [12,13]. Thus, event-based SL systems can also benefit from these advantages and possess high speed and HDR characteristics.

Existing methods for event-based SL can mainly be divided into two categories depending on how the event is triggered and how its timestamp is used: temporal-coding methods [14–16] and spatial-coding methods [17,18]. Temporal-coding methods rely on the timestamps to directly construct pixel-wise correspondences between the camera and the projector. On the contrary, spatial-coding methods only utilize the timestamps to extract the spatial coding structures such as lines and random speckles for the subsequent stereo matching. As a precursory work of the spatial-coding methods, Brandli et al. [17] proposed one of the first prototypes in this domain, where they applied a pulsed line laser to spike the dynamic vision sensor. They extracted the laser line with an adaptive filtering algorithm, allowing fast terrain reconstruction for robotics. Huang et al. [18] projected a 2D random speckle pattern with a digital light processing projector and used the timestamps to assist in the process of extracting spatially encoded event frames, achieving a high sampling rate. However, the spatial-coding methods always overlook the inherent capability of event timestamps for constructing exact pixel correspondence. As for the temporal-coding methods, Matsuda et al. [14] firstly combined the laser point projector with a 2D galvanometer to perform a raster scanning for event triggering, providing detailed event timings and thus enabling the establishment of pixel-wise correspondence. Following [14], Muglikar et al. [16] proposed to match the events by optimizing an energy function designed to exploit the correlation in event distribution. Although temporal-coding methods offer simple yet fine-grained cues utilizing the timestamps, they are hindered by reduced accuracy due to timestamp noises and signal crosstalk. Put things together, both strategies fail to fully exploit the spatio-temporal consistency of the event-based SL systems.

In this work, we build an event-based SL system, which consists of a laser point projector and an event camera. To encode both spatially and temporally, we project a random speckle pattern through the laser point projector onto scenes. In this way, the spatial coding is embodied in the distribution of random speckles, while the temporal coding is embedded within the timestamps of the events. Based on that, we propose a novel Spatio-Temporal Enhanced Matching approach, termed STEM, for high-accuracy reconstruction. STEM is composed of the spatio-temporal enhancing (STE) algorithm and the spatio-temporal matching (STM) algorithm. STE takes the original output of the event camera, which is referred to as the raw spatio-temporal event frame, as the input. It embeds the temporal coding into the spatial domain by mapping values of timestamps to a grayscale space to generate a gradient-enhanced event frame. The mapping rule is discovered by an optimization method with the object of maximizing the sum of square of subset intensity gradients (SSSIG) [19], a widely accepted metric for assessing the quality of random speckle images. The gradient-enhanced event frame is then combined with speckle contours, forming the final spatio-temporal enhanced event frame for subsequent matching. STM takes the spatio-temporal enhanced and the raw event frames as input and generates the final disparity. It is built on a meticulously designed hybrid cost function, which exploits the mutually reinforcing effect between the two input event frames. On the built SL prototype, we carry out comprehensive experiments, showcasing the superiority of our proposed method versus other methods. Notably, we achieve 16.1 fps 3D reconstruction with an error of 0.56mm in a scene 0.72m away on an Intel Core i7-10700F CPU.

The rest of this paper is structured as follows: Section 2 introduces our proposed spatio-temporal coding method applied to event-based SL systems. Section 3 provides a detailed description of our imaging system, along with the proposed system calibration method. Section 4 presents the qualitative and quantitative experiments. Finally, section 5 summarizes the work and discusses potential avenues for future research in event-based SL.

2. Method

The event-based structured light imaging system is illustrated in Fig. 1, and the overall flow of our proposed spatio-temporal coding and STEM is depicted in Fig. 2. In operation, the random speckle pattern is projected by the laser point projector, and the structure-illuminated scene is captured by the event camera, generating the raw event stream. In the stage of event frame generation, the event stream is transformed into the raw spatio-temporal event frame, which undergoes STE subsequently. Finally, the raw and the spatio-temporal enhanced event frames, along with their corresponding reference event frames are fed to STM for depth estimation. In this section, we will successively introduce the spatio-temporal coding strategy, STE, and STM in detail.

Fig. 1. An illustration of the event-based SL system, which consists of a laser point projector and an event camera. The projector projects the random speckle pattern in a point raster scanning manner. Simultaneously, the event camera captures this scanning process and generates the corresponding event data recording the position, timestamp, and polarity of the trigger.

Download Full Size | PDF

Fig. 2. The flowchart of the proposed spatio-temporal coding and the spatio-temporal enhanced matching (STEM).

Download Full Size | PDF

2.1 Spatio-temporal coding

As the laser point projector initiates, its built-in 2D MEMS mirror swiftly oscillates and guides the laser light meticulously across the projection zone, adhering to a raster-scanning manner. The entire projected scene will be illuminated point by point, following a right-to-left and then a top-to-down order, as indicated in Fig. 1. This point scanning results in rapid alterations in illumination at individual points within the scene, which is readily detectable by the event camera. Ideally, each pixel of the camera receives illumination changes of one scene point, therefore, each point cast by the projector will excite a corresponding pixel in the event camera and fire an event. Once the laser diode completes its full-frame scan, the event camera will collect an event stream recording the whole scanning process. By placing the events at the corresponding spatial position and setting the timestamps of the triggered events as their grayscale values, the event stream can be reorganized into a 2D event frame, which is a dense encoding of depth in the temporal domain. The depth of the scene can then be decoded by analyzing the correspondences between the captured event frame and the reference event frame, being the temporal coding method. The visualization of the event frame is shown in Fig. 3(a). Note that the gradient color represents the distribution of the timestamps.

Fig. 3. Diagrams of captured event frames of three different coding strategies and subsequent workflow of the spatio-temporal enhancing algorithm. (a) The captured event frame of pure temporal coding. The gradient color only indicates the different timestamps. (b) The collected event frame of pure spatial coding. (c) The raw event frame of spatio-temporal coding. (d) The event frame with the temporal feature embedding. (e) The detected contour of the random speckles. (f) The final spatio-temporal enhanced event frame.

Download Full Size | PDF

However, such an approach mainly considers how to trigger the events and only use the timestamps for matching, without any substantial coding design in the spatial domain, always leading to an unsatisfactory 3D reconstruction due to the non-negligible timestamp noises. Hence, we introduce the random speckle coding, which is commonly used in the digital image correlation field, to the event-based SL, aiming at providing sufficient distinguishability for pattern deformation detection. Here we opt for a dot-based random speckle pattern as [20], which consists of a white background and random black dots. In practice, the pattern can be directly relayed to the projector, which will be displayed in the form of a point scanning. Events will be triggered in the white background area, but not in the area of the black dots where the brightness is almost constant. Only considering whether the event is triggered or not constitutes the result of pure spatial encoding. The captured spatially coded event frame is shown in Fig. 3(b). Taking into account that during the process of scanning, there will be a timestamp recording when the event is triggered, therefore, we incorporate this temporal coding into the spatial coding to form a spatio-temporal coding. By simply setting the grayscale values of events to be their timestamps on the event frame, a compact representation of the spatio-temporal coding result can be derived, which is referred to as the raw spatio-temporal event frame and illustrated in Fig. 3(c).

Employing the above spatio-temporal encoding presents several key advantages: (1) The enriched data from dual-domain encoding will provide robust cues for subsequent matching. (2) The precision of the timestamp is largely affected by the data throughput rate of the event camera [21,22]. In the proposed coding method, the black speckle areas do not trigger events. Compared with [16], it reduces the data throughput by nearly half and thus guarantees the accurate timing of events.

2.2 Spatio-temporal enhanced matching

2.2.1 Spatio-temporal enhancing

As shown in Fig. 3(c), the spatial and temporal information are both incorporated. However, as can be observed, the spatially adjacent regions possess nearly identical timestamps due to the equidistant raster-scanning manner of the laser point projector. Compared to the spatially coded event frame, the raw spatio-temporal event frame does not introduce significant features that would enhance local distinguishability. In this case, the limited prominence of the temporal coding results in its function being largely overshadowed by the spatial coding, leading to a bias of features during the matching. As a result, the precision gains attributed to this naive incorporation of temporal coding are quite limited. We notice that, once synchronized, mapping a single timestamp on both the captured and the reference event frames to any value will not affect the correspondence of pixel-wise matching pairs and the following decoding result. Thus, an intuitive way to enhance the saliency of temporal coding is to map the timestamps of the raw spatio-temporal event frame to specific values following a specific rule, so that the newly transformed event frame will have more local distinguishability.

We introduce a spatio-temporal enhancing (STE) algorithm, an optimization-based method to embed the temporal features into the spatial domain. In view of the timestamp noise and the local consistency of its spatial distribution, we divide the event stream into specific time bins and then apply a consistent mapping rule for each time bin (i.e., the mapping rules of the events in a time bin are the same). In order to find the optimal mapping rule for each time bin, we introduce the sum of squares of subset strength intensity gradients (SSSIG), which is a well-recognized quantitative measure for assessing the quality of random speckle patterns [19,23,24], as the optimization objective. Essentially, SSSIG quantifies the texture richness of a pattern. The texture-rich pattern provides more matching information and therefore improves the matching accuracy. We search for the optimal mapping rule for each time bin by maximizing the SSSIG, which is defined as:

(1)$$\text{SSSIG} = \sum\left(\frac{{\partial I}}{{\partial x}}\right)^2 + \left(\frac{{\partial I}}{{\partial y}}\right)^2,$$

where $\frac {{\partial I}}{{\partial x}}$ and $\frac {{\partial I}}{{\partial y}}$ represent the gradient of an image $I$ along the $x$ direction and $y$ direction, respectively. We divide the time length of a complete single-frame scanning into $n$ equal-length time bins. The set of all time bins is denoted as $\mathcal {B} = \{B_1, B_2,\ldots, B_k\}$, where each time bin $B_k$ contains all timestamps in the $k^{\text {th}}$ time bin. Without loss of generality, an event stream obtained through a single scan can be expressed as: $\mathcal {E} = \{e_{k,i}\}_{k=1, i=1}^{K,N_k}$, where $N_k$ denotes the total number of events in the $k^{\text {th}}$ time bin. The $i^{\text {th}}$ event in the $k^{\text {th}}$ time bin $e_{k,i} = (t_{k,i}, x_{k,i}, y_{k,i}, p_{k,i})$ is characterized by a timestamp $t_{k,i}$, spatial coordinates $(x_{k,i}, y_{k,i})$, and polarity $p_{k,i}$.

The searching process of the best mapping rule can be formulated as the following optimization problem:

(2)$$\begin{array}{ll} \underset{f_{B_k}}{\text{maximize}} &\text{SSSIG}(I) \\ \text{subject to} & 0 \leq I_{x_{k,i},y_{k,i}} = f_{B_k}(e_{k,i}) \leq 255, \; \forall e_{k,i} \in \mathcal{E}, \; \forall B_k \in \mathcal{B}, \end{array}$$

where $f_{B_k}$ is the mapping function for the $k^{\text {th}}$ time bin, $I$ is the event frame reconstructed after mapping. This optimization problem aims to maximize the SSSIG of the reconstructed event frame $I$ by finding the optimal mapping function $f_{B_k}$ for each time bin $B_k$. Taking efficiency and generality into account, we limit the target mapping spatial space to an 8-bit grayscale space. Therefore, the function $f_{B_k}$ maps the timestamp of each event within the time bin $B_k$ to a grayscale value within the range of 0 to 255.

The discrete optimization problem is difficult to solve directly. Nonetheless, considering the given constraint that the target embedding domain (spatial domain) is restrained at 0-255, we can reformulate the mapping function $f_{B_k}$. Specifically, each $f_{B_k}$ can be written as $255 \times \alpha _k$, where $\alpha _k \in [0, 1]$ corresponds to the $k^{\text {th}}$ time bin. Since all the above computational processes are differentiable, we employ $\text {SSSIG}$ as our objective function and use the gradient descent algorithm to optimize the parameters $\alpha _k$. During the phase, after multiple experimental trials, we adopt the Adam optimizer [25] with an initial learning rate of $0.005$, which was found to strike a good balance between achieving a rapid convergence and ensuring the model converges to an appropriate solution. The final mapped image obtained is depicted in Fig. 3(d). The searched mapped image is in the form of repetitive patterns whose values are mostly composed of 0 and 255. This phenomenon can also be found in [23,26].

However, the direct culling of the entire segment of events, to a certain extent, destroys the geometric structures of the random speckles. Specifically, the gradient changes in the $x$ direction are partially removed, implying that the direct embedding will cause a loss of spatial coding. To maintain the integrity of geometries, we first reorganize the event stream into an event frame and extract the geometries of the random speckles through a contour detection algorithm [27], the result is shown in Fig. 3(e). Subsequent to this, we integrate the outcomes of direct embedding with the contour map, finalizing the coding outcome, as illustrated in the pipeline of Fig. 3. At this point, we have completed the enhancement of time-domain and spatial-domain features, embedding the time-domain features into the spatial domain.

To sum up, our proposed STE has two distinct benefits: (1) Temporal information is embedded into the original spatial domain, resulting in overall richer textures in the spatial domain. The derived result overcomes the obstacle that event-based imaging systems cannot distinguish fine-grained details due to the quantization error and also solves the bias of features during the matching when the raw spatio-temporal event frame is directly used. (2) STE can be inserted in generic stereo matching algorithms such as BM, SGBM [28] as a block in a plug-and-play manner, for the enhancement of spatio-temporal frames.

2.2.2 Spatio-temporal matching

To date, there are few stereo matching algorithms designed tailored to the characteristics of the events, and most of the algorithms used in previous works are quite time-consuming, which hinders a real-time 3D reconstruction. Here, we propose a fast event-based stereo matching algorithm STM, which fully exploits the spatio-temporal consistency of the events.

As mentioned above, STE effectively augments matching textures. However, during the phase, in the pursuit of robustness, multiple timestamps are mapped to a single gray value, which inevitably sacrifices the granularity and continuity of the timestamps. To alleviate this, we propose to reintroduce the raw spatio-temporal event frame and combine it with the spatio-temporal enhanced event frame in our matching method. Comparing the raw and the enhanced spatio-temporal event frames, we can find that the density, numerical value, and scale of grayscale values are quite different. Thus, these two frames will play different roles in stereo matching. The former provides texture-rich cues for the global matching gain, while the latter complements the fine-grained features to optimize the local matching. Besides, on the raw spatio-temporal event frame, since the timestamps increase strictly according to the laser scanning order, large disparity mismatches can be reduced significantly. These two event frames are functionally complementary and will bring extra gain by combining them. This mutually reinforcing effect is similar to the relationship between the absolute difference features and census features in AD-Census [29]. To capitalize on their synergies, we design a combined cost measure. The hybrid cost is shaped by both frames, leading to the deployment of two sliding windows along the epipolar line, as shown in Fig. 2.

Let $L_1$ be the spatio-temporal enhanced event frame, $L_2$ be the raw spatio-temporal event frame, $R_1$ and $R_2$ be their corresponding reference event frames. Note that $L_k$ and $R_k$ are rectified pairs. For any pixel located at (x,y), the hybrid matching cost can be formulated as,

(3)$$\begin{gathered} C_{hybrid} (x,y,d) = \frac{1}{(2w+1)^2} \sum_{i={-}w}^{w} \sum_{j={-}w}^{w} \Bigg( \sum_{k=1}^{2} \Bigg( \frac{L_k(x+i, y+j) - \mu_{B_{x,y}^{L_k}}}{\sigma_{B_{x,y}^{L_k}}} \\ - \frac{R_k(x+d+i, y+j) - \mu_{B_{x+d,y}^{R_k}}}{\sigma_{B_{x+d,y}^{R_k}}} \Bigg)^2 \Bigg), \end{gathered}$$

where $B_{x,y}^{L_k}$ and $B_{x+d,y}^{R_k}$ denote the blocks in $L_k$ and $R_k$, $d$ is the horizontal possible disparity, $\mu _{B_{x,y}^{L_k}}$, $\mu _{B_{x+d,y}^{R_k}}$ and $\sigma _{B_{x,y}^{L_k}}$, $\sigma _{B_{x+d,y}^{R_k}}$ are the mean values and the standard deviations of $B_{x,y}^{L_k}$, $B_{x+d,y}^{R_k}$, respectively. Note that, the sum of squared differences (SSD) is calculated on respective event frames, and the zero normalized operation is further applied to eliminate the distinct difference in the numerical scale between the two frames. By aggregating the costs from each window, the ultimate hybrid cost is formulated. The resultant disparity search is collectively determined by both frames, effectively harnessing the strengths of both.

To increase computing efficiency, we first perform a point matching based on the timestamps to derive an initialized disparity $d_{init}$, and then perform matching in a small neighborhood around $d_{init}$, which effectively narrows down the disparity search range. The final disparity $d^*$ can be calculated as follows,

(4)$$d^*(x, y) = \underset{d}{\mathrm{argmin}} \, C_{\textit{hybrid}} (x, y, d),$$

where $d^*(x, y) \in [d_{\text {init}}(x,y) - \Delta d, d_{\text {init}}(x,y) + \Delta d]$ with $\Delta d$ representing a fixed disparity adjustment away the initial disparity estimate $d_{\text {init}}$. For areas with no events triggered, the aforementioned method is not applied and the maximum search range is utilized.

To achieve sub-pixel accuracy, we resort to quadratic interpolation which carefully balances between the algorithm’s intricacy and accuracy. With extra engineering efforts, we achieve real-time 3D reconstruction on an Intel Core i7-10700F CPU.

3. Hardware

3.1 Setup

Our SL system is comprised of an event camera and a laser point projector, as shown in Fig. 4(a). We apply a vertical layout for our system, with the camera on the top and the projector on the bottom. The vertical baseline is set to 12.4 cm.

Fig. 4. (a) Hardware setup of our event-based SL system. (b) The layout of the two checkerboards used for system calibration. (c) Captured event frame of the two checkerboards.

Download Full Size | PDF

Laser Point Projector: The light source in our system is the Sony MP-CL1A laser projector [30]. This compact projector adopts laser beam scanning technology and provides a resolution of 1920 $\times$ 1080, a scanning frame rate of 60Hz, a brightness of 32 lumens, and an FOV of 20 degrees.

Event Camera: The event camera is Prophesee EVK4 equipped with a Sony IMX636 sensor [31]. The camera provides a spatial resolution of 1280 $\times$ 720, a temporal resolution of 1 microsecond, and a dynamic range of above 120 dB.

3.2 Calibration

For the event camera calibration, different from the traditional camera or the event camera with APS mode (i.e., grayscale mode) [32] which can capture the intensity images directly, our adopted event camera can solely operate in pure event mode. An accurate calibration of the event camera remains challenging. In this study, we utilize a robust and efficient calibration method [33], and revise it to fit the event-based paradigm. Firstly, we print a 6 $\times$ 9 checkerboard on a piece of A3 paper and then affix it to a movable and tiltable whiteboard. In parallel, the projector projects a 5 $\times$ 8 checkerboard on the whiteboard. The layout is shown in Fig. 4(b). To ensure events are triggered correctly, the printed checkerboard should be placed in an area that does not overlap the projected checkerboard but is also illuminated by the projector. As shown in Fig. 4(c), the concurrent capture of both patterns can be realized by carefully fine-tuning the sensitivity of the event camera. Finally, the intrinsic and extrinsic parameters of the event camera can be calculated with the identified 6 $\times$ 9 checkerboard using Zhang’s method [34]. For the projector, it is modeled as an inverse camera and its calibration is based on the identified 5 $\times$ 8 checkerboard. Given the known intrinsic and extrinsic parameters of the event camera, the intrinsic and extrinsic parameters of the projector can be easily calibrated. The relative extrinsic parameters between the camera and the projector can also be solved. It’s worth noting that only positive events are employed during this phase. In addition, we perform a vertical rectification for the captured event frame and the reference event frame, which can be fed to generic stereo matching algorithms subsequently. With the above pipeline, a calibration error of less than 1 pixel is attained.

4. Experiments

4.1 Baselines and evaluation metrics

We conduct experiments to validate the effectiveness of our proposed method and compare it with methods based on various coding strategies. For the temporal coding, we implement the method ESL [16]. For the spatial coding, we select BM and SGBM, two representative stereo matching algorithms, for depth recovery. Further, we build four additional baselines based on spatio-temporal coding. These four baselines are extended from BM and SGBM with distinct event frames as input. BM/SGBM-ST operates the raw spatio-temporal event frame and BM/SGBM-STE operates the spatiotemporal enhanced event frame. All the experiments are conducted on an Intel Core i7-10700F CPU.

For quantitative evaluation, we apply three evaluation metrics, the root mean square error (RMSE), the mean absolute error (MAE), and the normalized Pearson’s correlation coefficient (NPCC). RMSE calculates the Euclidean distance between estimates and ground truth, while MAE calculates the absolute distance. NPCC is a statistical measure that quantifies the linear correlation, ranging from −1 (perfect negative correlation) to +1 (perfect positive correlation). Their formulas of these metrics are:

(5)$$\text{RMSE} = \sqrt{\frac{1}{N} \sum \left( D(i, j) - \hat{D}(i, j) \right)^2},$$

(6)$$\text{MAE} = \frac{1}{N} \sum |D(i, j) - \hat{D}(i, j)|,$$

(7)$$\text{NPCC} = \frac{\sum \left( D(i, j) - \overline{D} \right) \left( \hat{D}(i, j) - \overline{\hat{D}} \right)}{\sqrt{\sum \left( D(i, j) - \overline{D} \right)^2} \cdot \sqrt{\sum \left( \hat{D}(i, j) - \overline{\hat{D}} \right)^2}},$$

where $D(i, j)$ is the true depth value at pixel coordinates $(i, j)$ within the ROI, and $\hat {D}(i, j)$ is the corresponding estimated depth value. $N$ is the total number of pixels in the ROI of the image.

4.2 Quantitative results

To quantitatively evaluate the 3D reconstruction accuracy, we perform a plane fitting and calculate the root mean square error (RMSE). The plane is from the front surface of a plaster pyramid. In the two-round experiments, the pyramid is located at 0.72 m and 1.20 m away from the system, respectively. An inclined rectangular region of 80 mm $\times$ 60 mm is chosen for the plane fitting (shown in Fig. 5(a)). These two specific distances represent a relatively far (1.20 $m$) and a relatively close (0.72 $m$) range within the typical operational bounds of the system, allowing us to comprehensively evaluate the performance of both the system and the algorithm under standard working conditions. The chosen values are not rigid and allow for slight variations. Similar approaches for the quantitative evaluation have also been used in [33]. The quantitative results are shown in Table 1, and the error maps are shown in Fig. 5. For fairness, none of the compared methods applies post-processing or filtering.

Fig. 5. Error maps of the reconstructed inclined plane fitting at 1.20 $m$ with different algorithms.

Download Full Size | PDF

Table 1. The quantitative results of errors of inclined plane reconstruction at different distances, and the time consumption of various algorithms. The column RMSE FAR shows the reconstruction error evaluated by RMSE of an inclined plane placed at a distance of 1.20 $m$, and RMSE Near shows the error evaluated by RMSE of an inclined plane placed at a distance of 0.72 $m$. The same rule applies for MAE and NPCC. All results are unfiltered. ‘T’ and ‘S’ denote ‘Temporal’ and ‘Spatial’, respectively.

View Table

As shown in Table 1 and Fig. 5, we can draw the following conclusions: (1) The temporal coding method ESL (Fig. 5(c)) presents the results with considerable noise, leading to unsatisfactory reconstruction accuracy. In contrast, spatial coding methods BM (Fig. 5(d)) and SGBM (Fig. 5(f)) showcase higher accuracy and efficiency than ESL. Among all, our proposed STEM emerges as the most potent algorithm, achieving the highest accuracy on all evaluation metrics. For example, STEM offers the lowest RMSE (i.e., 0.84 mm at 1.20 m and 0.56 mm at 0.72 m), which represents reductions of 80.2% and 84.3% when compared to the temporal-coding baseline ESL and shows improvement by approximately 45.5% and 37.1% compared to the spatial-coding baseline SGBM. A similar trend is shown in the results from the remaining metrics. (2) When applying the spatial-temporal coding strategy to BM and SGBM, it can be observed that BM-ST reduces the RMSE by 8.8% and 18.9% and SGBM-ST reduces the RMSE by 2.2% and 12.3% in near and far scenes compared to the vanilla BM and SGBM. This result demonstrates the feasibility of the spatio-temporal coding strategy. However, due to the problem of bias of features during the matching, only limited improvement is achieved. Further, it can be observed that the RMSE for BM-STE are reduced by 40.1% and 30.1%, while those for SGBM-STE are reduced by 21.3% and 33.1% in near and far scenes, when compared to BM and SGBM, respectively. The overall trend remains when evaluated by other metrics. This remarkable improvement substantially surpasses the results attained by BM-ST and SGBM-ST, which proves that STE effectively solves the bias of features during the matching and fully explores the potential of spatio-temporal coding. Crucially, the proposed STE accomplishes these enhancements without imposing much computational burden, making it both an efficient and versatile solution. (3) In terms of speed, STEM operates at 16.1 fps, which represents a 900 times improvement over ESL (0.018 FPS). Besides, STEM maintains a competitive speed even when compared to the commonly used fast stereo matching algorithms.

4.3 Qualitative results

4.3.1 Static scenes

For the qualitative results, we capture and reconstruct three scenes. As shown in the first row of Fig. 6, the scenes are composed of representative plaster models, including the sphere, the prism and the cone, and the Agrippa statue. We assess and compare our proposed method with baselines and illustrate the results in Fig. 6. As can be seen, ESL reconstructs the structures at a coarse-grained level. BM and SGBM present better, but still cursory results with several distinct error-matching areas. Overall, our proposed STEM achieves the best performance. It generates promising results with fine details and sharp boundaries while maintaining great continuity of the reconstructed scene, e.g., for the Agrippa statue, the depth variations in the eye sockets are well shown, and even for intricate structures like two ears, a complete and detailed reconstruction is achieved. The encouraging results presented by our method demonstrate its superiority over existing solutions.

Fig. 6. 3D reconstruction results of plaster models with different algorithms. The models include a sphere, a large depth-of-field scene composed of a prism and a cone, and an Agrippa statue. The sizes of the frames are 290 $\times$ 377 pixels, 350 $\times$ 455 pixels, and 500 $\times$ 650 pixels, respectively.

Download Full Size | PDF

Moreover, we can observe that BM-STE and SGBM-STE distinctly outperform BM and SGBM. For example, the error points in the reconstruction of the cone and the nose of the Agrippa statue are completely settled by the combination of the proposed STE. Besides, the application of the STE yields better depth continuity due to richer matching information. The resultant reconstructions exhibit a smoother appearance, which remains consistent no matter whether the BM or SGBM algorithm is utilized. It is evident that the proposed STE with spatio-temporal coding stands out as a robust and universal strategy for 3D reconstruction algorithms.

4.3.2 Dynamic scenes

Since our SL system achieves 3D reconstruction with single-shot, it is suitable for 3D reconstruction of dynamic scenes. We utilize the system to film a box being rapidly waved by hand along two different directions and reconstruct this dynamic scene using the proposed spatio-temporal coding and STEM. Figure 7 shows the keyframes of the dynamic scenes, and the video result is available at Visualization 1. It can be seen that due to the high sampling rate of the system, the reconstruction result is almost unaffected by the motion. The reconstruction outcomes provide compelling evidence for the superior depth resolution capabilities of our system. For instance, the depth distinction between the finger and the box is notably significant. Furthermore, as can be observed from the colormap visualization, the reconstructed frames exhibit commendable consistency from frame to frame, maintaining this stability even in scenarios involving extensive motion.

Fig. 7. Reconstruction results of captured dynamic scenes. (a)-(c) The hand grips the box and oscillates it in the vertical direction. (d)-(f) The hand oscillates the box and moves it back and forth relative to the imaging system.

Download Full Size | PDF

5. Conclusion and discussion

Conclusion: In this paper, we unveil an innovative 3D reconstruction method grounded on the built event-based structured light system. Benefiting from the distinctive imaging principle of our imaging system, we craft a spatio-temporal coding approach that realizes the simultaneous coding in the spatial domain and temporal domain via a single shot, enabling the derivation of richer matching information. To fully exploit the spatio-temporal consistency, we propose an enhanced matching algorithm STEM, which is comprised of STE and STM. Based on the SSSIG theory, STE innovatively introduces an optimization way to integrate dual-domain information and mitigates the matching bias inherent in the initial raw data. STM performs a fast disparity estimation with high accuracy based on a meticulously designed hybrid cost. Experimental results demonstrate that the proposed spatio-temporal paradigm surpasses existing methods by a large margin in terms of accuracy. Moreover, it is demonstrated that STE effectively enhances the raw spatio-temporal coding information and possesses effective and versatile properties for existing methods. Concurrently, the proposed STM manages to achieve real-time and robust performance in various test scenes. Compared to other works in event-based SL, this work achieves higher accuracy 3D reconstruction without compromising the reconstruction speed and thus has strong practical value in various application scenarios.

Discussion: We have made new attempts by incorporating spatio-temporal methods in the field of event-based SL. The upper bound of the sampling rate and accuracy of the proposed method is predominantly constrained by hardware limitations. The limitations are three-fold: the scanning rate of the projector, the timestamping accuracy of the event camera, and the bandwidth of the event camera. The performance of the proposed method can be further improved with the evolution of the hardware. We also find that the proposed method inherits some of the limitations of conventional random speckle structured light due to the application of random speckle coding in the spatial domain. In some extreme scenarios where speckles undergo significant deformation, moderate reconstruction quality is attained. A potential enhancement could be the incorporation of deep learning methodologies, as exemplified in the research of [35,36]. Finally, the method proposed cannot effectively solve certain complex imaging challenges, including the reconstruction of reflective and transparent regions. Fortunately, there have been some works trying to solve these problems [37,38], and we can consider introducing additional epipolar constraints or additional temporal coding based on them, which will be considered as our future work.

Funding

National Natural Science Foundation of China (62131003).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. Z. Liu, Y. Liu, Y. Ke, et al., “Geometric phase doppler effect: when structured light meets rotating structured materials,” Opt. Express 25(10), 11564–11573 (2017). [CrossRef]

2. Z. Cai, X. Liu, G. Pedrini, et al., “Accurate depth estimation in structured light fields,” Opt. Express 27(9), 13532–13546 (2019). [CrossRef]

3. P. Zhou, J. Zhu, and H. Jing, “Optical 3-d surface reconstruction with color binary speckle pattern encoding,” Opt. Express 26(3), 3452–3465 (2018). [CrossRef]

4. L. Wang, D. Lu, R. Qiu, et al., “3d reconstruction from structured-light profilometry with dual-path hybrid network,” EURASIP J. Adv. Signal Process. 2022(1), 14 (2022). [CrossRef]

5. Y. Zhang, P. Xiong, and F. Wu, “Unambiguous 3d measurement from speckle-embedded fringe,” Appl. Opt. 52, 7797–7805 (2014). [CrossRef]

6. J. Sell and P. O’Connor, “The xbox one system on a chip and kinect sensor,” IEEE Micro 34(2), 44–53 (2014). [CrossRef]

7. C. Neupane, A. Koirala, Z. Wang, et al., “Evaluation of depth cameras for use in fruit localization and sizing: Finding a successor to kinect v2,” Agronomy 11(9), 1780 (2021). [CrossRef]

8. B. Li, Z. Xu, F. Gao, et al., “3d reconstruction of high reflective welding surface based on binocular structured light stereo vision,” Machines 10(2), 159 (2022). [CrossRef]

9. S. Zhang, “High-speed 3d shape measurement with structured light methods: A review,” Opt. Lasers Eng. 106, 119–131 (2018). [CrossRef]

10. Z. Xiong, Y. Zhang, F. Wu, et al., “Computational depth sensing: toward high-performance commodity depth cameras,” IEEE Signal Process. Mag. 34, 55–68 (2022). [CrossRef]

11. G. Gallego, T. Delbrück, G. Orchard, et al., “Event-based vision: A survey,” IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 154–180 (2022). [CrossRef]

12. P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 × 128 120 db 15μs latency asynchronous temporal contrast vision sensor,” IEEE J. Solid-State Circuits 43(2), 566–576 (2008). [CrossRef]

13. Y. Suh, S. Choi, M. Ito, et al., “A 1280× 960 dynamic vision sensor with a 4.95-μm pixel pitch and motion artifact minimization,” in IEEE International Symposium on Circuits and Systems (ISCAS), (IEEE, 2020), pp. 1–5.

14. N. Matsuda, O. Cossairt, and M. Gupta, “Mc3d: Motion contrast 3d scanning,” in IEEE International Conference on Computational Photography (ICCP), (IEEE, 2015), pp. 1–10.

15. J. N. Martel, J. Müller, J. Conradt, et al., “An active approach to solving the stereo matching problem using event-based sensors,” in IEEE International Symposium on Circuits and Systems (ISCAS), (IEEE, 2018), pp. 1–5.

16. M. Muglikar, G. Gallego, and D. Scaramuzza, “Esl: Event-based structured light,” in 2021 International Conference on 3D Vision (3DV), (IEEE, 2021), pp. 1165–1174.

17. C. Brandli, T. A. Mantel, M. Hutter, et al., “Adaptive pulsed laser line extraction for terrain reconstruction using a dynamic vision sensor,” Front. Neurosci. 7, 275 (2014). [CrossRef]

18. X. Huang, Y. Zhang, and Z. Xiong, “High-speed structured light based 3d scanning using an event camera,” Opt. Express 29(22), 35864–35876 (2021). [CrossRef]

19. B. Pan, H. Xie, Z. Wang, et al., “Study on subset size selection in digital image correlation for speckle patterns,” Opt. Express 16(10), 7037–7048 (2008). [CrossRef]

20. Y. Zhang, Z. Xiong, Z. Yang, et al., “Real-time scalable depth sensing with hybrid structured light illumination,” IEEE Trans. on Image Process. 23(1), 97–109 (2014). [CrossRef]

21. H. E. Ryu, “Industrial dvs design; key features and applications,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, vol. 3 (2019).

22. M. Muglikar, D. P. Moeys, and D. Scaramuzza, “Event guided depth sensing,” in 2021 International Conference on 3D Vision (3DV), (IEEE, 2021), pp. 385–393.

23. X. Xu, X. Ren, F. Zhong, et al., “Optimization of speckle pattern based on integer programming method,” Opt. Lasers Eng. 133, 106100 (2020). [CrossRef]

24. Y. Su, Z. Gao, Z. Fang, et al., “Theoretical analysis on performance of digital speckle pattern: uniqueness, accuracy, precision, and spatial resolution,” Opt. Express 27(16), 22439–22474 (2019). [CrossRef]

25. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv, arXiv:1412.6980 (2014). [CrossRef]

26. X. Xu, Y. Lin, H. Zhou, et al., “A unified spatial-angular structured light for single-view acquisition of shape and reflectance,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023), pp. 206–215.

27. S. Suzuki and K. Abe, “Topological structural analysis of digitized binary images by border following,” Comput. Vision, Graph. Image Process. 30(1), 32–46 (1985). [CrossRef]

28. H. Hirschmuller, “Accurate and efficient stereo processing by semi-global matching and mutual information,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, vol. 2 (IEEE, 2005), pp. 807–814.

29. X. Mei, X. Sun, M. Zhou, et al., “On building an accurate stereo matching system on graphics hardware,” in IEEE International Conference on Computer Vision Workshops (ICCV Workshops), (IEEE, 2011), pp. 467–474.

30. Sony, “Sony mobile point projector mpcl1a,” https://www.sony.com/electronics/support/televisions-projectors-projectors/mp-cl1a/specifications (2023).

31. Prophesee, “Event camera evaluation kit 4 hd imx636,” https://www.prophesee.ai/event-camera-evk4/ (2023).

32. S. Chen and M. Guo, “Live demonstration: Celex-v: A 1m pixel multi-mode event-based sensor,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), (IEEE, 2019), pp. 1682–1683.

33. G. Falcao, N. Hurtos, and J. Massich, “Plane-based calibration of a projector-camera system,” VIBOT Master 9, 1–12 (2008).

34. Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000). [CrossRef]

35. W. Yin, Y. Hu, S. Feng, et al., “Single-shot 3d shape measurement using an end-to-end stereo matching network for speckle projection profilometry,” Opt. Express 29(9), 13388–13407 (2021). [CrossRef]

36. Y. Li, J. Peng, Y. Zhang, et al., “Self-distilled depth from single-shot structured light with intensity reconstruction,” IEEE Transactions on Computational Imaging 9, 678 (2023). [CrossRef]

37. X. Liu, J. D. Rego, S. Jayasuriya, et al., “Event-based dual photography for transparent scene reconstruction,” Opt. Lett. 48(5), 1304–1307 (2023). [CrossRef]

38. X. Yang, Q. Liao, X. Hu, et al., “Sepi-3d: soft epipolar 3d shape measurement with an event camera for multipath elimination,” Opt. Express 31(8), 13328–13341 (2023). [CrossRef]

Methods	Coding	RMSE ↓	RMSE ↓	MAE ↓	MAE ↓	NPCC ↑	NPCC ↑	FPS
	Type	Far(mm)	Near(mm)	Far(mm)	Near(mm)	Far	Near
ESL	T	4.25	3.57	3.31	2.83	0.9613	0.9713	0.018
BM	S	1.69	1.37	1.34	1.10	0.9887	0.9951	33.3
BM-ST	S&T	1.37	1.25	1.13	1.03	0.9919	0.9966	33.3
BM-STE	S&T	1.18	0.82	0.94	0.72	0.9947	0.9982	23.3
SGBM	S	1.54	0.89	1.22	0.71	0.9912	0.9967	12.8
SGBM-ST	S&T	1.35	0.87	1.09	0.69	0.9935	0.9979	12.8
SGBM-STE	S&T	1.03	0.70	0.85	0.56	0.9961	0.9987	10.9
STEM	S&T	0.84	0.56	0.67	0.47	0.9974	0.9994	16.1

Fast 3D reconstruction via event-based structured light with spatio-temporal coding

Abstract

1. Introduction

2. Method

2.1 Spatio-temporal coding

2.2 Spatio-temporal enhanced matching

2.2.1 Spatio-temporal enhancing

2.2.2 Spatio-temporal matching

3. Hardware

3.1 Setup

3.2 Calibration

4. Experiments

4.1 Baselines and evaluation metrics

4.2 Quantitative results

4.3 Qualitative results

4.3.1 Static scenes

4.3.2 Dynamic scenes

5. Conclusion and discussion

Funding

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (7)

Tables (1)

Equations (7)

Optics Express