Time-of-flight camera characterization with functional modeling for synthetic scene generation

Sergio Ortiz; Mukhil Azhagan Mallaiyan Sathiaseelan; Mukhil Azhagan Mallaiyan Sathiaseelan; Augustine Cha

doi:10.1364/OE.438523

1. Introduction

Time of Flight is an active imaging technique that provides the distance between the camera and an object. There are 2 main kinds of techniques in Time of Flight (ToF), (i) Direct ToF like LIDAR(Light Detection and Ranging) or LADAR (Laser imaging, detection and ranging) coming from the so-called laser ranging technique [1], based on counting the time that a pulse of light travels after being scattered by a surface, and (ii) Indirect time of flight (i-ToF) like the Continuous Wave time of flight (CW-ToF) where a scene is illuminated by an amplitude modulated light to measure the time taken for its reflection to be detected by the camera [2,3]. While the LIDAR/LADAR typically rely on moving parts (Scanner). In opposition the iToF cameras can acquire depths and amplitude images at once based on CMOS/CCD technology with processing in chip [3–6].

Elaborated signal models can be found in the literature [7,8], conceptually for the phase shift $\mathrm{\Delta }\phi $ between the transmitted modulation envelope and the reflection from the scene encodes the distance d that the light has traveled.

(1)$$\mathrm{\Delta }\phi = \frac{{4\pi d\nu }}{c}$$

where c is the speed of light, and $\nu $ the modulation frequency of the light. For extracting the phase shift the back-reflected signal $s(t )$ (Eq. (2) is cross-correlated with the reference signal $r(t )$ of the same modulation $\nu $. This method resembles the concept of phase-shifting interferometry (PSI) [9], where a coherent light beam interferes with a reference beam at several phase shifts, to retrieve the phase, at what are known as buckets. Assuming that the reflected and reference signal can be expressed as sinusoid signals, we represent them in Eq. (2),

(2)$$\begin{aligned} s(t )&= {a\; \Delta sin({\omega \tau + \; \mathrm{\Delta }\phi } )+ b}\\ r(t )&= {sin({\omega \tau } )} \end{aligned}$$

where a is the amplitude of the backscattered light collected by the camera, $\omega $ is the angular modulation frequency ($2\pi \nu )$, $\mathrm{\Delta }\phi $ is the phase offset due to the distance traveled, and b is the background light. The correlation signal $h(\tau )$ is measured at each pixel, where τ is the phase of the reference waveform, as presented in Eq. (3).

(3)$$h(\tau )= s(t )\otimes g(t )= \; \frac{1}{T}\mathop {\lim }\limits_{T \to \infty } \mathop \smallint \nolimits_{ - \frac{T}{2}}^{\frac{T}{2}} s(t )\cdot g({t + \tau } )dt$$

The evaluation of the integral in (Eq. (3) yields

(4)$$h(\tau )= \; \frac{a}{2}cos({\omega \tau + \; \mathrm{\Delta }\phi } )$$

Multiple measurements are taken on the correlation signal by phase shifting the reference signal to demodulate the Discrete Fourier Transform (DFT). Typically, 4 measurements (buckets) are collected, each of them separated 90 degrees.

(5)$$h({{\tau_i}} )= \; \frac{a}{2}cos({\omega {\tau_i} + \; \mathrm{\Delta }\phi } )+ {b_i}$$

With these 4 measurements of the cross correlation function the phase shift $\mathrm{\Delta }\phi $ and the amplitude a of the signal can be determined using Eq. (6), given below.

(6)$$\left\{ {\begin{array}{c} {\Delta \phi = atan\left[ {\frac{{({h({{\tau_2}} )- h({{\tau_4}} )} )}}{{h({{\tau_1}} )- h({{\tau_3}} )}}} \right]}\\ {a = \; \frac{1}{2}\sqrt {[{h({{\tau_1}} )- h({{\tau_3}} )} ]^2 + [{h({{\tau_2}} )- h({{\tau_4}} )} ]^2} } \end{array}} \right.\; $$

The first fundamental limitation of CW-ToF sensors comes from the estimated phase shift, as stated in Eq. (6), it is obtained from an arctangent function, which range can go from $[{ - \pi /2\; ,\; \pi /2\; } ]$, or in the case of the atan2 the range spans $[{0,\; 2\pi } ].$ This means that the measurable range or unambiguity range is ${\textrm{d}_{ur}}$, provided by Eq. (7).

(7)$${\textrm{d}_{ur}} = \frac{c}{{2\nu }}\; $$

While Eq. (7) imposes that to measure long distances the frequency needs to be small, an estimation of the depth measurement variance ${\sigma ^2}$, that can be approximated by Eq. (8), indicates that the lower the frequency the larger the variance.

(8)$${\sigma ^2} = \; \frac{c}{{4\sqrt 2 \pi \nu }}\frac{{\sqrt {a + \bar{h}} }}{{m\cdot a}}\; $$

Here $\bar{h}$ is the offset, defined as the average of the correlation signals, and m is the modulation contrast, defined as a non-dimensional quantity that relates the efficiency of the pixel in collecting the photoelectrons at a given modulation frequency. The apparent contradiction exposed in Eqs (7) and (8) is solved with the introduction of a multi-frequency system. Each modulation frequency has different ambiguity distance, but true location is the one where the different frequencies agree, as Eq. (1) states. The frequency of when the two modulations agree, called the beat frequency, is usually lower, and corresponds to a much longer ambiguity distance.

The variance in the recovery of the phase is affected in a quadratic fashion by the signal detected by CMOS/CCD sensor [7]. The collected signal is affected by the sensor noise sources. Some of the most common noise sources for CMOS [10,11] are the Shot Noise, Dark current, and read Noise. The Shot Noise is due to the detection process of the photons coming from 2 different sources (the ambient light and active illumination), being the most dominant noise source in medium to high signal to noise ratios. Dark Current arises from thermal energy within the silicon lattice comprising the CMOS. Electrons are created over time that are independent of the light falling on the detector. These electrons are captured by the camera potential wells and counted as signal. Reading noise is produced when the gate of the sensor is activated, and the number of photons is read by the Analog Digital converter (ADC). These common noise sources when applied to the indirect ToF have been modeled as Poisson and Gaussian processes [12,13]

Crosstalk in CMOS pixels [14–16] can be caused by two main different mechanisms: optical and electrical (diffusion). This phenomenon causes image blur and degrades the SNR (Signal to Noise Ratio); Also, like PRNU (Photo Response Non-Uniformity), crosstalk is becoming an increasingly significant problem, given the trend toward smaller pixel dimensions. Optical Crosstalk comprises of several effects: Spatial crosstalk due the optics that can be characterized by the Point Spread Function. The Crosstalk due to the micro-lens, as the pixels go from the center to the edge of an array, the chief rays are more and more tilted from the main imaging lens. To compensate for this progressive tilt, the micro-lenses must be gradually shifted off the pixel center as the pixel goes from the array’s optical axis toward the detector’s edge. A final source of optical crosstalk is due to the dielectric structures that separate the pixels producing an internal reflection at the dielectric-film and air-gap interface concentrating the incident light in the selected pixel. Electrical crosstalk arises due to a lateral drift/diffusion of non-equilibrium carriers created by light from the illuminated photodiode pixel to the neighbors.

Multipath can be modeled as a mixed combination of signals that are registered by the pixel, when the directionality of the active illumination cone together with an adequate geometry produce a superposition of signals that coming from different sources travel a similar path. Models and Corrections have been proposed elsewhere [17–22]. This effect is easily perceived in iToF since at the level of the signal amplitude level, it affects adjacent pixels by producing a modulation of the received amplitude. This is also why multipath is referred to as multipath ‘interference’. However, at the depth level, provided the multipath is weak the most noticeable effect is a reduction of the accuracy, due to an increase on the bias regarding to the ground truth distance. Yet, when severe multipath leads to flying pixels [23].

In the literature and in the market nowadays numerous examples of ray tracings can be found [24–25] Many of them use physical based rendering based on Monte Carlo approaches [24]. Typically, this process takes several hours for a scene. In contrast, the objective in our work is to produce a reliable and fast simulation as a platform to test new and existing algorithms (both heuristics and classical) for i-ToF sensors. There do exist other datasets in literature that describe 3D scenes and provide depth maps. Datasets such as the DIODE dataset [26], NYU Depth [27] and KITTI [28] provide depth maps for a variety of depth measurement cases. However, they are built to specific purposes such as depth estimation using Neural Networks, object detection in 3D etc. Their focus is neither to provide accurate depth information nor to train denoising and other precision sensitive applications. The Middlebury stereo dataset [29,30] has been used in literature for Time-of-Flight images. Though it provides depth maps, but is again not accurate in depth, though better than the others. In addition, the dataset has no crosstalk does not simulate a real ToF scenario. A simulated Light field dataset is presented in [31] provides detailed simulated datasets however they have very limited simulated scenes. They also do not showcase realistic noise. In our work however we redefine the noise using noise models available in literature that model them as gaussian and Poisson distributions [32,33]. Thus, the above-described datasets can be difficult to train Learning based algorithms. A similar dataset is also found in the Kinect-Fusion Dataset [34], but is smaller and not as accurate. The above approaches do not showcase multipath, crosstalk or any of the physical challenges in real ToF scenarios.

Using such datasets there have been various approaches that attempt to solve various problems in ToF imaging [35] presents work on denoising using a wavelet based denoising using noise approximation priors [36]. also showcases a similar denoising approach. With advances in Artificial Intelligence, there are approaches using various Convolutional Neural Network (CNN) and its variants. [37] introduces one of the first works that make use of neural networks for end-to-end ToF processing including phase unwrapping, denoising and multipath. The dataset however is generated using 3D modelling software Blender. Though visually realistic, this introduces a variation for the multipath and misses an opportunity to emulate the light reception in the pixel, thus unable to generate crosstalk and other TOF camera specific challenges. A similar format is followed in the FLAT Dataset [38] also builds on the blender generated scene format by adding more images for various types of depths and models. This dataset is also used by them to build and end to end ToF pipeline.

Though the performance and the applicability of such NN based techniques in real case scenarios can be debated, it can be clearly noted that there is a large need for datasets to train NN based approaches. Not only the training of NN, but the requirement of benchmarking the capability of the algorithms based either on image driven or on data driven approaches require to cover a great number of scenarios. Our proposed approach can not only simulate similar scenes as the above datasets, but also do so with a higher accuracy in noise addition, multipath approximation as well as crosstalk addition. The simulations and the simulated objects for our work can cover any arbitrary distance, different ambient light conditions, amplitudes as well as orientation, making it a much stronger simulation model with the potential to create simulation scenes in a comprehensive fashion. We present quantitative results, (whenever possible) about their performance and describe the individual steps in the following sections.

2. Methods

The objective of the simulation is to provide a random number of regular and irregular polygons that are located at different depths, with different orientations along the XYZ, with varying sizes, varying number of sides, and each with a different reflectivity (signal amplitude). The polygons are “floating” on a smooth irregular background added to a randomly oriented plane substrate. Table 1 shows all the parameters used to produce the simulated scene.

Table 1. Parameters in generating a scene. All parameters can either be randomized or controller by the user.

View Table | View all tables in this article

A flowchart of the proposed simulator pipeline and its various stages are presented in Fig. 1. The initial 3D space is rendered based on randomized parameters generated by the randomizer (a random sampler from a normal distribution). Next, parameters for the background plane are added, to achieve a smooth tiled background plane to simulate walls and other surfaces. An irregular background is added to the background to further simulate the distortion and other irregularities. Next, we gather camera specific parameters such as the intrinsics and extrinsics. With the camera parameters, a scene illumination is simulated, and raytracing is done to gather intersections and normals, which are then collected at the pixel detector. With the generated complex signals, where the amplitude and phase are interpreted as phasors, we employ the multiple frequencies scheme to increase range and accuracy. Realistic artifacts such as Noise and other diffusion effects like Crosstalk or optical blur are then introduced. The resultant signals, their unwrapped depth including the ground truth signals are presented back to the user for training learning algorithms or evaluating other computer vision algorithms. Figure 1 shows a block diagram of the general process described above: (1) Background: geometry and signal (2) Polygons addition: Regular and irregular polygons in random orientations. (3) Triangulation and Raytracing, and (4) Realistic Noise addition: based on experimental noise.

Fig. 1. shows a block diagram of the general process described above: (1) Background: geometry and signal (2) Polygons addition: Regular and irregular polygons in random orientations. (3) Triangulation and Raytracing, and (4) Realistic Noise addition: based on experimental noise.

Download Full Size | PDF

2.1 Simulated background

The first step in the simulator is to generate the 3D space of interest and the substrate plane. This plane is placed at random distance in between the unambiguity range of the selected frequencies, calculated as the beat frequency or greatest common denominator of the multiple frequencies. The background plane is also assigned a random amplitude a between 0 and the range of saturation. Using the intrinsics of the camera the intersection points between the rotated plane and the furthest pixels regarding to the optical center are found to bound the plane. The substrate plane is rendered with enough resolution to prevent out of bounds problems due to rotation, i.e., to account for the diagonal distance at 45 deg in all 3 axes. The resultant point cloud is randomly rotated using quaternions around X, and Y axis (regarding to the camera axis). This generates randomly oriented background surfaces. This is done to simulate walls and floors in ToF images.

However, it is rare for the background surface to be smooth; thus, to simulate this a cloud like pattern is added using frequency domain filtering techniques [39]. As Fig. 2 illustrates the pattern is generated by a random gaussian noise (a), which power spectrum, calculated by the Fourier Transform (b), that is filtered by a low pass filter in an inverse distance from the center, for after producing the inverse of the Fourier transform (c). The result is shown in (d) after scaling by a random number, the combination of the substrate and the Fourier cloud is presented in (e).

Fig. 2. illustrates the irregular pattern is generated by a random gaussian noise (a), whose power spectrum, calculated by the Fourier Transform (b), is filtered by a low pass filter in an inverse distance from the center leading to changes in the inverse of the Fourier transform as shown in (c). The result presented in 3D in (d) after scaling by a random number. This substrate is added to the rotated plane, creating the final background (e).

Download Full Size | PDF

2.2 Simulating polygons

The next step is used to create a random number of regular and irregular polygons with a random number of sides up to 20 sides (this is a user defined range, and can naturally be extended, and is not a constraint). The polygons are generated in random locations of the 3D space, at random depths that added to the Fourier Cloud background substrate. The center of these polygons is denoted by ${\vec{R}_n}$. The only constraint in this randomness is done to prevent polygons to not exceed the bounds of 0 depth, or the unambiguity range. During creation, the polygons are assigned randomly tilted ellipses (in angle and radii), in such manner that none of the ellipses intersect with the others. A random number then determines if the polygon would be regular or irregular. Finally, If the polygon is irregular the ellipse diameter is split at different angles to produce the irregular polygon. The center of the polygons is associated with an amplitude and the polygons are rotated a random amount using quaternions. Figure 3 illustrates the process of adding the polygons the substrate plane, (a) generating the mixed of regular and irregular polygons at random positions in the in the Field of view of the camera. (b) polygons at random distances and orientations. (c) AB assigned to each polygon within the range of 0 and saturation state.

Fig. 3. The process of simulating Polygons, from Generation of Polygons from randomized Ellipses(a) to their addition to the background substrate(b), and the final step of adding amplitude as Active Brightness (AB), as described in (c)

Download Full Size | PDF

2.3 Triangulation and ray tracing

The resultant point cloud after this process is described by triangles by means of the Delaunay decomposition [40]. Delaunay decomposition is a well-described mathematical method used in computational geometry [40,41]. It uses the property of the convex hull (envelope) of the discrete set of points$\; \{{X;Y} \}$, and the number of obtained triangles is related to the number of points in the grid and the number of vertices on the convex hull. If m is the number of points in the grid, and l is the number of vertices, then the number of triangles t is defined by the following equation: t = 2m − 2 – l. In a first step, a two-dimensional Delaunay decomposition is performed (Fig. 4), where each point from the discrete data set $\{{X;Y} \}$ is one of the vertices of at least two triangles. This results in a 3D approximation of the surface $Z\; = \; f({X,Y} )$ (Fig. 4.a). The next step is finding the triangle, which is hit by a ray denoted by its initial position ${R_0}({i,j} )$, and cosine directors $\vec{K}({i,j} )$, given by the projection provided by the camera calibration [42]. The selected model was the rational 6KT distortion model spread, widely used in the computer vision community [42]. Instead of looking for all possible impacts in the planes (triangles) in the scene, the Fermat principle is applied and the least Euclidean distance to the point cloud is calculated (Fig. 4.b). This point depending where located will produce a very reduced number of triangles to investigate. With the number of possible triangles significantly reduced, the intersection points to the ray is given by Eq. (9), with each of the triangles of interest [43].

(9)$$R({i,j} )= {R_0}({i,j} )+ t\vec{K}({i,j} )\; $$

Fig. 4. a) illustrates the ray tracing of a depth scene, the map color indicates the depth, the camera position is determined by$\; {R_0}({i,j} )$, and the cosine directors going from the camera pixels towards a scene are given by $\vec{K}({i,j} )$, the feasible impact point is given by $R({i,j} )$. b) Delaunay triangles and vertices closer to the cosine director $\vec{K}({i,j} )$,the impact point of the ray is visually highlighted by the red asterisk, and the point in Euclidean space $R({i,j} )$.

Download Full Size | PDF

To check which whether the calculated points are inside the triangle, i.e., to determine if they are the adequate solution, the barycentric technique was used. In a nutshell, if the barycentric coordinates and its sum are in the range from [0,1], then the calculated point belongs to the triangle. The corresponding $Z({i,j} )$ value is given by solving the plane equation with the $\{{X({i,j} ),Y({i,j} )} \}$ intersection coordinates. Because of the way the triangulation is done (which is done for speed), it is possible for a ray to be close to a point in the edge between the polygon and the substrate. The ray will still fall within the barycentric coordinate check; however, the ray intersection actually occurs at a surface behind this edge. This can be identified by always selecting the triangle of intersection with the smallest area, since the triangles in the edges are always affected by the relatively large differences in Z, leading to a larger area. At this stage, it should be noted that, in cases where the polygon is sufficiently behind the substrate, their edge intersection should not be considered because the ray will in fact not hit surface and hit an infinitely long background which will act as a sink – these cases will be denoted by a ‘NaN’.

Once the intersection of the 3-D coordinates $\vec{R}({i,j} )$ are known, the radial distance from the origin of the camera to the intersections denoted by $d({i,j} )$ is calculated as the Euclidean distance from the origin of the camera to the intersections. By using Eq. (1), it is possible to calculate the phases for all the$\; k$ modulation frequencies $\mathrm{\Delta }\phi ({i,j,k} )$, providing a ground truth phase tensor. The corresponding amplitude of every point $({a({i,j} )} )$, theoretically should be the same for every frequency. However, the amplitude is affected by the difference in contrast modulation per frequency ($m(k ))$. As it was previously introduced the amplitude is specified for any arbitrary point in the scene (${a_n}$), aka, substrate plane and polygons. The amplitude is modified accordingly using the well-known inverse distance law regarding to the center of the polygon$({{{\vec{R}}_n}} )$, assuming that the surfaces of the elements of the scene are Lambertian, see Eq. (10).

(10)$$a({i,j,k} )= {a_n}m(k )\frac{{{{\Vert\vec{R}}_n}\Vert^2}}{{\Vert\vec{R}{{({i,j} )}\Vert^2}}}\; $$

2.4 Realistic noise simulation

As many CMOS sensors behave as Quantum wells, the tunnel effect is exhibited during sensing producing a diffusion of the photoelectrons from high signal regions to lower that are more visible at the edges, see Fig. 4(c). Typically, CMOS use the Shallow Trench Isolation (STI) technique to achieve electrical isolation between the pixels. Though STI provides a satisfactory isolation close to the semiconductor surface (once an electron is captured under the gate of the photodiode it will most likely stay there), there is significant electrical crosstalk between the pixels. This crosstalk is caused because the conversion of photons in electron-hole pairs is taking place at a certain depth in the epi layer - up to several micrometers under the surface of the silicon. At this depth, the electrical field strengths are low, and the chance that an electron is stored in an adjacent photodiode is therefore relatively high. This effect can be modeled as an isotropic diffusion over the number of photons that are collected by the detector and its neighbors. Figure 4(a) illustrates the crosstalk effect for 2 adjacent pixels under a charge gradient. The incoming light (green lines) is absorbed and converted into photoelectrons (green dots) by the epitaxial layer. The charge (photoelectrons) is accumulated and attracted to either one of the poly gates, denoted by PGA and PGB respect, by the electrical field induced by the P type. As the figure illustrates some of the electrons can move laterally to adjacent pixels producing the Crosstalk effect. Assuming that the pixel under study (red) is represented in Fig. 4(b), the effect of the isotropic diffusion explained graphically in Fig. 4(a) denotes that the total amount of signal $h({{\tau_\psi }} )\; $for a given bucket $\psi \; \in \; \{{1,2, \ldots ,\mathrm{\Psi }} \}$ can be modeled as a weighted sum of the contributions of the rest of the photoelectrons collected by the kernel, where $\xi ({m,n} )$ are normalized weight coefficients.

(11)$$h({{\tau_\psi };i,j,k} )= \mathop \sum \limits_{m,n} \xi ({m,n} )h({{\tau_\psi };m,n,k} )\; $$

To evaluate the crosstalk experimentally, a single pixel should be exposed to light, and the resulting values for that pixel and its neighbors (which are shielded from light by a metallization layer) must be measured. The signal measured in each neighboring pixel can then be expressed as a percentage of the total signal, thus providing a measure of the crosstalk in each direction. Given these parameters, it is possible to simulate crosstalk as follows: First, by determining the amount of signal collected by the pixel of interest, followed by calculating the amount of signal remaining in the pixel and the signals in the adjacent pixels. The process should be repeated for each pixel in the input image to generate a corresponding output image with crosstalk. In our experiments, a similar procedure was tested using the Sentaurus TCAD (Synopsys Mountain View, CA), showcasing results that resemble a Gaussian blur. This is presented and analyzed in the Results Section. The calculation was performed using the actual configuration of the pixel performing a Ray tracing plus Finite Element Method propagation at the semiconductor. A 3 × 3 pixel neighborhood was explored at the operating NIR wavelength of 850 nm. Starting from Eq. (6) the phasors can be rewritten as a complex signal $(S )$, where the even buckets represent the imaginary part of the signal (${S_r})\; $and the odd the real (${S_i})$ as introduced in Eq. (12).

(12)$$\left\{ {\begin{array}{c} {{S_r} = \; h({{\tau_2}} )- h({{\tau_4}} )\; }\\ {{S_i} = \; h({{\tau_1}} )- h({{\tau_3}} )} \end{array}} \right.$$

Though Eq. (11) introduced in Eq. (12) provides a more exact modelling, it is out of the scope of this work to solve it at that level of detail. Our proposal is to produce an approximation approach to the Crosstalk phenomenon as the weighted sum of plane waves. Thus, we propose the use of the coefficients $\xi $ coming from the Sentaurus TCAD simulation (3 × 3), and a Gaussian blur fitting to produce results comparable to the experimental scenes.

S({i,j,k} )\sim \; \mathop \sum \limits_{m,n} \xi ({m,n} )a({m,n,k} )exp[{i\mathrm{\Delta }\phi ({m,n,k} )} ]\;

(13)$$with\; \left\{ {\begin{array}{c} {\xi ({m,n} )= exp\left\{ { - \frac{{{{({i - m} )}^2} + {{({j - n} )}^2}}}{{2{\sigma^2}}}} \right\}}\\ {\xi ({m,n} )= \; \left\{ {\begin{array}{ccc} {0.02}&{0.06}&{0.02}\\ {0.05}&{0.70}&{0.04}\\ {0.02}&{0.06}&{0.02} \end{array}} \right\}\; } \end{array}} \right.\; $$

The noisy signal was generated by adding an experimental random Gaussian noise dependent on the amplitude level (Eq. (14)),

(14)$$\left\{ {\begin{array}{c} {{{\tilde{S}}_r} = \; {S_r} + \; \Delta {{\tilde{S}}_r}(a )}\\ {{{\tilde{S}}_i} = {S_i} + \; \Delta {{\tilde{S}}_i}(a )} \end{array}} \right.\; $$

where $\Delta {\tilde{S}_{r,i}}(a )$ is the experimental signal dependent noise for the real and imaginary parts of the signal.

Fig. 5. a) illustrates the process of diffusion in a shallow trench process, b) illustrates the diffusion due to a difference in signal. c) provides the experimental result of a ToF with shallow trench and the gradient in signal observed in the edges (in this case the amplitude was used).

Download Full Size | PDF

The experimental noise of the ToF, (denoted as $\Delta {\tilde{S}_{r,i}}(a )$ for the real and imaginary part, and ${\; }\Delta \tilde{A}(a )$ for the error of the experimental amplitude) was characterized by collecting the complex signal for 3 Frequencies from a flat wall for 12 distances ranging from 300 to 4500 mm at 3 ambient light levels: 0 Klux, 3Klux (office space ambient light), and 25 Klux (corresponding to the ambient light expected for a cloudy day), with 100 frames per distance and condition. The standard deviation of imaginary and real part of the signal, and the average of the amplitude were evaluated for 100 frames and the 3 frequencies that the iToF camera uses. Figure 6 shows the result for the real (a) and imaginary part of the complex signal (b).

Fig. 6. illustrating the precision (noise) of the signal as a function of the average signal. There are small differences in the Frequencies due to the modulation contrast ($m$). a) precision of the real part of the signal with the amplitude, b) precision of the imaginary part of the signal with the amplitude.

Download Full Size | PDF

For this work a Gaussian noise was assumed $\Delta S({\bar{a}} )= N[{\mu ({\bar{a}} ),{\sigma^2}({\bar{a}} )} ]$, where $\mu $ denotes the average of the precision in function of the average amplitude, while ${\sigma ^2}({\bar{a}} )$ is the variance in function of the average amplitude. Even though fitting the average (AV) would require a simpler polynomial from the data, it is not the same for the standard deviation (SD). For using the same equation for both, the average (AV) and standard deviation (SD) (average amplitude dependent) were fitted to a rational function of 4^th degree polynomial in the numerator with coefficients denoted by {${p_i}{\; }with{\; }i = 0,1,..4$}and one of 3rd degree in the denominator denoted {${q_j}{\; }with{\; }j = 0,1,{\; }and{\; }2$} (Eq. (15)). Therefore, instead of using the raw data, the average and standard deviation of the precision were calculated from intervals of the average amplitude, denoted by the plus and asterisk symbols of Fig. 7(a) and (b). The resultant points were regressed to the rational function using MATLAB (Mathworks, Boston, MA) with options LAR (Least Absolute Residuals) for “robust” and “Levenberg-Marquardt” for the “algorithm”. Figure 6(c) shows the result with the raw data as dots, the fitted average as a solid line, and the average +- the standard deviation as a dash line. The fitting was performed for the real and imaginary part of the signal, for every frequency, and ambient light condition as Tables 2 and 3 summarize. The result is illustrated in Fig. 6 where the data corresponding to the real part of the signal for the first frequency at 25 Klux is shown.

(15)$$\{{\mu ({\bar{a}} ),\sigma ({\bar{a}} )} \}= \; \frac{{{p_4}{{\bar{a}}^4} + {p_3}{{\bar{a}}^3} + {p_2}{{\bar{a}}^2} + {p_1}\bar{a} + {p_0}}}{{{{\bar{a}}^3} + {q_2}{{\bar{a}}^2} + {q_1}\bar{a} + {q_0}}}\; $$

Fig. 7. shows an example of the process followed for the real and imaginary part, frequencies, and ambient conditions. Specifically, the Fig. 6 exemplifies the process followed for the real part of the signal of the first frequency for 25 Klux ambient light. a) The average (AV) of the precision from intervals of the average amplitude, plus sign, and fitted average, solid line. b) standard deviation of the precision (SD) from intervals of the average amplitude, asterisks, and fitted line, dash. c) shows the result with the raw data as dots, the fitted average as a solid line, and the average plus minus the standard deviation as a dash line.

Download Full Size | PDF

Table 2. Results for the fitting for real and complex mean.

View Table | View all tables in this article

The noise for imaginary and real part of the signal is generated from the given amplitude ($a$) providing the mean $({\mu (a )} )\; $plus the standard deviation ($\sigma (a ))$ multiplied by a Normally distributed random number of mean 0 and standard deviation 1.

(16)$$\Delta S(a )= \; \left( {\mu (a )+ \; 2\sqrt 2 \sigma (a )} \right)\cdot N({0,1} )\; $$

Figure 8 shows the final wrapped of a simulation performed for a frequency of 189 MHz, using 10 polygons (9 visible in the figure) ranging from 3 m to 10 m, over a tilted cloudy background with an average distance of 7m. The amplitude of the polygons and background was selected to range between 10 and 100 to enhance the visual effect of the noise. Adding noise is required to produce a realistic physical modelling of the iToF behavior, since it is the main responsible of the system performance. Figure a) shows the result for the phase inradians of the simulation without the addition of the noise on the left, and b) with the noise added for an ambient light of 25 kLux.

Fig. 8. Phase scene with a) and without noise b) as the output of the simulator, Colormap denotes a range up to 2π.

Download Full Size | PDF

One of the main drawbacks of indirect time of Flight imagers and of practically all the imagers that use flood active light is the presence of multipath. In this work we will use a model that accounts for it as the sum of 2 complex signals [17,18] as denoted in Eq. (17). Equation (17) categorizes the illumination into two, called direct (${S_d})\; $and the other called multipath$\; ({S_m})$ that comes from another source, for instance, back reflected by another surface or due to subsurface scattering. The process is illustrated by Fig. 9 where the direct illumination provides the so-called direct component, while the one reflected by the floor is called multipath.

(17)$$S = {S_d} + \; {S_m} = \; {a_d}{e^{i{\phi _d}(k )}} + {a_m}{e^{i{\phi _m}(k )}}$$

Fig. 9. Multipath due to a floor bounce back in both specular and diffuse scenarios.

Download Full Size | PDF

Here, ${a_d}$ is the amplitude of the direct illumination, ${a_m}$ is the amplitude of the multipath, and ${\phi _d}$ and ${\phi _d}$are the phases corresponding to direct and multipath, respectively. It is a condition that is derived from Fermat principle that ${\phi _d}$ > ${\phi _m}$. Now, the resultant phase depends on $\phi $ of not only the direct path but also the ‘multipath’ path, and the relative amplitudes of both signals. The amplitude is not a constant anymore but depends on the evolution of the direct and multipath phases but also on the frequency described in Eq. (18).

(18)$$\left\{ {\begin{array}{c} {\textrm{a}(k )= \sqrt {a_d^2 + a_m^2 + 2{a_d}{a_m}cos[{{\phi_d}(k )- {\phi_m}(k )} ]} \; }\\ {\phi (k )= \textrm{atan}\left[ {\frac{{{a_d}sin{\phi_d}(k )+ {a_m}sin{\phi_m}(k )}}{{{a_d}cos{\phi_d}(k )+ {a_m}cos{\phi_m}(k )}}} \right]\; } \end{array}\; } \right.$$

In this work, our approach to approximating multipath was assigning a ‘multipath flag’ to a pre-determined number of polygons. After, a random amplitude was assigned as the multipath amplitude specified by the point in the polygon. To calculate the multipath phase ${\phi _m}\; $a random distance was assigned to the point in the polygon (with ${\phi _d}$ > ${\phi _m})$, and ray tracing was performed as described earlier. The amplitude assigned to every ray intersection was modified by using the distance calculated from the ray tracing using Eq. (10). Finally, the noise is added to the complex signal from Eq. (6) as follows.

(19)$$S = {\tilde{S}_d} + \; {\tilde{S}_m} = \; ({{S_d} + \; \mathrm{\Delta }{S_d}} )+ \; ({{S_m} + \; \mathrm{\Delta }{S_m}} )\; $$

The final phasors depending on the diffusion, multipath and noise are described by Eq. (20).

(20)$$\left\{ \begin{array}{l} {\tilde{\textrm a}}\left( k \right) = \sqrt {\tilde{a}_d^2 + \tilde{a}_m^2 + 2{{\tilde{a}}_d}{{\tilde{a}}_m}cos\left[ {{{\tilde{\phi }}_d}\left( k \right) - {{\tilde{\phi }}_m}\left( k \right)} \right]} \\ \tilde{\phi }\left( k \right) = \textrm{atan}\left[ {\frac{{{{\tilde{a}}_d}sin{{\tilde{\phi }}_d}\left( k \right) + {{\tilde{a}}_m}sin{{\tilde{\phi }}_m}\left( k \right)}}{{{{\tilde{a}}_d}cos{{\tilde{\phi }}_d}\left( k \right) + {{\tilde{a}}_m}cos{{\tilde{\phi }}_m}\left( k \right)}}} \right] \end{array} \right.$$

3. Results

The experiments were performed with a Microsoft iToF IV generation sensor, as it can be found in Hololens 2, Azure Kinect Development Kit or other Microsoft Licensed iToF products of 4^th generation.

3.1 Noise fitting

The experimental signal noise error for real and imaginary parts ($\Delta {\tilde{S}_{r,i}}(a ))$ obtained as a result of the flat wall experiment was independently fitted using non-linear least squares to a rational function of 4^th degree polynomial in the numerator with coefficients denoted by {${p_i}{\; }with{\; }i = 0,1,..4$}and one of 3rd degree in the denominator denoted {${q_j}{\; }with{\; }j = 0,1,{\; }and{\; }2$} (Eq. (15)). The fitting used Least Absolute Residuals for Robustness and Levenberg-Marquardt as the curve fitting algorithm for options. The results are shown in the tables below including the goodness of fit (${R^2}$). Table 2 provides the fitting for the average of the real and imaginary parts of the signal, for the 3 frequencies in use and the 3 ambient conditions. The ${R^2}$ values for the average ranging from 0.999 to 1.000 show a high goodness of fit for every frequency and ambient light, with a minimum value for 25 Klux. The coefficients tend to show uniformity in their magnitude, even for the broad signal range that was studied in the experiment.

Table 3 provides the fitting for the standard deviation of the signal precision of the real and imaginary parts of the signal, for the 3 frequencies in use and the 3 ambient conditions. The ${R^2}$ values for the average ranging from 0.980 to 0.997 showing what it could be considered as a goodness of fit with a minimum value for 25 Klux what is expected due to the extra noise introduced by the ambient light level. In terms of magnitude the coefficients tend to be less stable than the results showed in Table 2, also according to the most variable nature of the standard deviation of the noise.

Table 3. Results for the fitting for real and imaginary std. dev.

View Table | View all tables in this article

3.2 Crosstalk

Figure 10 shows the crosstalk (diffusion) obtained from the Sentaurus TCAD (a) and the result of fitting to a Gaussian (b). Even though the values are not completely radially symmetric for the Sentaurus TCAD, as compared to the Gaussian, there is certain a resemblance. The result of the fitting to a Gaussian provided a blurring sigma = 0.46. The results for both techniques offer an inverse relationship of the contribution of the blur due to the crosstalk with the radial component. This is due to a combination of the distance path, electrical and physical barriers that the photoelectron needs to travel to produce the crosstalk effect. The results for the Gaussian fitting are radially symmetric, and higher (bigger crosstalk contribution) for those pixels closer to the pixel under study, they are lower for the diagonal pixels. However, the results obtained from Sentaurus TCAD are not radially symmetric, this is due that in Sentaurus TCAD the geometry of the pixel is taking into account from Microlenses to size and location of the active area, kind of isolation, position of the electronics, …. This lack of symmetry in the geometry of the pixel array makes more difficult or simpler for the photoelectron to produce the crosstalk in adjacent pixels. As it can see in the lack of uniformity between vertical and horizontal values.

Fig. 10. a) Result of the diffusion with the Sentaurus TCAD and (b) its corresponding Gaussian fitting.

Download Full Size | PDF

An experimental scene was taken in the lab with no ambient light, using Microsoft ToF sensor in long range configuration, with a clamp holding a paper over a uniform background. 100 frames and average to create a ground truth. While the ground truth values in depth provided the geometry of the scene location and orientation of the background and foreground regarding to the camera, the average amplitude was used to add the noise by locating the minimum value of the distance for foreground and background and assigning to the rest of the pixels the amplitude values in accordance to Eq. (10). The crosstalk (blur) was added before adding the noise as described in Eq. (13) by using TCAD coefficients. Figure 11 shows a qualitative comparison of the crosstalk effect with regarding to a simulated result. a) 1 frame of the experimental scene are shown. b) shows the results of the simulation after adding the crosstalk and noise. c) shows a cross section for the experimental and simulated scenes. The result is comparable as in terms of noise amplitude as in degradation of the edge.

Fig. 11. The top row provides the amplitude of 1 frequency obtained from an experimental a) and a simulated scene b). A cross section is represented in the bottom row c).

Download Full Size | PDF

3.3 Precision comparison

Two static scenes have been used to measure the correlation between the precision histograms in experimental and simulated scenes. Based on the prediction provided by the equations in the methods section an area presumably free from multipath and diffusion was chosen to produce a fair comparison. The precision was measured by means of the standard deviation of 100 frames, while the simulated precision was generated by the average amplitude from the 100 frames of the experimental scene - using Eq. (16) to produce 100 frames, and after calculating the precision by means of the standard deviation. Figure 12: showing an illustration of the experimental and simulated precision of the real component of Frequency 189 MHz, for 2 scenes; the left column shows (scene #1) a post holding a disk over a uniform background and the corresponding histograms for the real and imaginary histograms. a) experimental precision, c) precision of simulated noise, e) histogram of the precisions for experimental and simulated noise for real and imaginary components of the signal. The right column (scene #2) shows a circular whole with a plane in foreground b) experimental precision, d) precision of simulated noise, f) histogram of the precisions for experimental and simulated noise for real and imaginary components of the signal. The correlation coefficient was calculated between the experimental and simulated precision of the 2 scenes. For Scene #1 correlation coefficients were 0.940 for the real and 0.947 for the imaginary, while for scene #2 the values were 0.998 and 0.998, respectively for real and imaginary. Notice that for scene #1 (Fig. 12(c)) the correlation is lower due to the shift in between experimental and simulated. This shift was produced by residual ambient light when the data were collected.

Fig. 12. showing an illustration of the experimental and simulated precision of the real component of Frequency 189 MHz, for 2 scenes; the left column shows (scene #1) a post holding a disk over a uniform background and the corresponding histograms for the real and imaginary histograms. a) experimental precision, c) precision of simulated noise, e) histogram of the precisions for experimental and simulated noise for real and imaginary components of the signal. The right column (scene #2) shows a circular whole with a plane in foreground b) experimental precision, d) precision of simulated noise, f) histogram of the precisions for experimental and simulated noise for real and imaginary components of the signal.

Download Full Size | PDF

3.4 Qualitative comparison of multipath

Figure 13 shows a qualitative comparison between a real corner scene, and a simulated corner, using the focal length of the camera, pixel pitch of the experimental ToF camera, and using similar amplitudes and distances for the panels placed in the corner. The noise, multipath, and crosstalk was added to the simulated corner scene. The noise was added by using the average amplitude of 100 experimental frames and using the amplitude of the minimum distance regarding to the ToF camera as the reference for adding the distance dependent amplitude in the simulated image (Eq. (10)). The multipath was added by means of calculating the geometrical distance of the panels that made the corner. The ratio of the amplitude of the multipath of the “aggressor” panel (front) to the “victim” was calculated experimentally by minimizing the amplitude value regarding to the average experimental scene. The result was analyzed in a qualitative fashion by presenting a cross section of the radial distance for the experimental scene Top-left figure, while in the top- right the direct path and the multipath radial depth as the dotted and solid line respectively. The bottom row shows the Coefficient of variation calculated as the standard deviation ($\sigma $) of the amplitude for the 3 frequencies over the average (Eq. (21)).

(21)$$CoV({i,j} )= \; \frac{{\sigma ({AB({i,j,k} )} )}}{{\overline {AB} ({i,j,k} )}}$$

Fig. 13. show a qualitative comparison between a real corner scene, and a simulated corner, using the focal length of the camera, pixel pitch of the experimental ToF camera, and using similar amplitudes and distances for the panels placed in the corner. The noise, crosstalk and multipath was added to the simulated corner scene. The result was analyzed in a qualitative fashion by presenting a cross section of the radial distance for the experimental scene a), while in b) is presented the direct path and the multipath radial depth as the dotted and solid line respectively. The bottom row shows the Coefficient of variation calculated as the standard deviation ($\sigma $) of the amplitude for the 3 frequencies over the average for the experimental scene c), and for the simulated scene d).

Download Full Size | PDF

4. Conclusion and future work

We have presented a geometrical methodology to create fully randomized scenes compatible with the features modelled from natural scenes of an iToF. A triangle-based ray tracing has been used in combination with the intrinsics provided by the optics. Cloud noise was added to provide texture to the substrate plane and polygons to simulate various objects in the substrate. The randomness of the clouds and shapes alone can potentially describe a variety of potential scenes for a iToF sensor, thus being a benchmark to test and evaluate various algorithms under various device circumstances. Crosstalk was introduced to model the warp and diffusion observed in the depth and amplitude experimental data. Using a simulation program capable of simulating the electrical and optical characteristics of silicon-based semiconductors, the result was fitted to a Gaussian blur leading to realistic performance. The Gaussian blur fitting can be considered as a first order approach when the geometry of the pixel is not known. Experimental noise of the iToF was characterized using a flat wall for different distances and ambient light scenarios. The resultant precision was fitted to a rational function for the average amplitude registered by the camera. The fitted model was added on top of the amplitude generated randomly by the computer. The noise model could be further improved by interpolating the values among the experimental ambient lights. Multipath was added following a model previously published in the literature to create a more realistic signal. Even, Monte-Carlo ray tracing methods could make the Multipath signal closer to the multipath signal observed in specific conditions like corners, in many cases the sources of multipath cannot be easily identified from either a geometrical or radiometric point of view. The idea of adding random multipath signal noise to the simulator can help to the camera manufacturers to find the best combination of frequencies that help to minimize the impact of multipath, or from an algorithm perspective to use algorithms that can mitigate the multipath signal. Thus, the functional model of the simulator fully covers the needs for generating primitive scenes with realistic noise, as well their respective ground truths that can be used as a benchmark for algorithms traditional and statistical learning algorithms.

Funding

Microsoft Corporation.

Acknowledgments

Authors would like to appreciate the contribution of Dr Satya Nagaraja for his support on the simulations performed with Sentaurus TCAD, and the Microsoft Sensor Team for their support in collecting the experimental data, and for the discussions.

Disclosures

SOE: Microsoft corporation (I,E), MAMS: (E,F), AC: Microsoft corporation (E). The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request. Data used in this paper can be reproduced with the raw information obtained from an Azure Kinect development Kit or other cameras that contain a Microsoft licensed Time of Flight sensor of IV generation.

References

1. J. R. Bernhard and Manfred, “Empfangsleistung in Abhängigkeit von der Zielentfernung bei optischen Kurzstrecken–Radargeräten,” Appl. Opt. 13(4), 931–936 (1974). [CrossRef]

2. R. Lange, “3D time-of-flight distance measurement with custom solid-state image sensors in CMOS/CCD-technology,” Universität Siegen (2000).

3. R. Lange and P. Seitz, “Solid-state time-of-flight range camera,” IEEE J. Quantum Electron. 37(3), 390–397 (2001). [CrossRef]

4. K. Zarychta, E. Tinet, L. Azizi, S. Avrillier, D. Ettori, and J. M. Tualle, “Time-resolved diffusing wave spectroscopy with a CCD camera,” Opt. Express 18(16), 16289–16301 (2010). [CrossRef]

5. S. R. Klaus, H. H. Guenther, and X. Z. A. Hartmann, “New active 3D vision system based on rf-modulation interferometry of incoherent light,” in Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, 1995.

6. C. Bamji, S. Mehta, B. Thompson, T. Elkhatib, S. Wurster, O. Akkaya, A. Payne, J. Godbaz, M. Fenton, V. Rajasekaran, L. Prather, S. Nagaraja, and V. Mogallapu, “1Mpixel 65 nm BSI 320 MHz demodulated TOF Image sensor with 3µm global shutter pixels and analog binning,” in IEEE International Solid - State Circuits Conference - (ISSCC), 2018.

7. M. Frank, M. Plaue, H. Rapp, U. Köthe, B. Jähne, and F. Hamprecht, “Theoretical and Experimental Error Analysis of Continuous-Wave Time-Of-Flight Range Cameras,” Opt. Eng. 48(1), 013602 (2009). [CrossRef]

8. J. L. Dai, Y. Liu, M. Hullin, and Q. Dai, “Fourier Analysis on Transient Imaging with a Multifrequency Time-of-Flight Camera,” in IEEE Conference on Computer Vision and Pattern Recognition (IEEE), 3230–3237 (2014).

9. P. de Groot, “Phase Shifting Interferometry,” in Optical Measurement of Surface Topography, ed. (Springer, 2011).

10. A. Alexandre, P. Garda, G. Vasilescu, S. Feruglio, A. Pinna, C. Chay, O. Llopis, and B. Granado, “Noise Characterization of CMOS Image Sensors,” in 10th WSEAS CSCC, Athens, Greece, 2006.

11. R. Gow, D. Renshaw, K. Findlater, L. Grant, S. McLeod, J. Hart, and R. Nicol, “A Comprehensive Tool for Modeling CMOS Image-Sensor-Noise Performance,” IEEE Trans. Electron Devices 54(6), 1321–1329 (2007). [CrossRef]

12. H. Wach and E. Jr, “Noise modeling for design and simulation of computational imaging systems,” Proc. SPIE 5438, 159–170 (2004). [CrossRef]

13. L. Grant, “Characterisation of Noise Sources in CMOS Image Sensors,” in Imager Design Forum, ISSCC 2005, (2005).

14. E. Fossum, “CMOS image sensors: electronic camera-on-a-chip,” IEEE Trans. Electron Devices 44(10), 1689–1698 (1997). [CrossRef]

15. T. Hsu, Y. Fang, C. Lin, S. Chen, C. Lin, D. Yaung, S. Wuu, H. Chien, C. Tseng, J. Lin, and C. Wang, “Light Guide for Pixel Crosstalk Improvement in Deep Submicron CMOS Image Sensor,” IEEE Electron Device Lett. 25(1), 22–24 (2004). [CrossRef]

16. M. Furumiya, H. Ohkubo, Y. Muramatsu, S. Kurosawa, F. Okamoto, Y. Fujimoto, and Y. Nakashiba, “High-sensitivity and no-crosstalk pixel technology for embedded CMOS image sensor,” IEEE Trans. Electron Devices 48(10), 2221–2227 (2001). [CrossRef]

17. S. K. Nayar, G. Krishnan, M. D. Grossberg, and R. Raskar, “Fast Separation of Direct and Global Components of a Scene Using High Frequency Illumination,” in SIGGRAPH ‘06, Boston, Massachusetts (2006).

18. R. Whyte, L. Streeter, M. J. Cree, and A. A. Dorrington, “Resolving multiple propagation paths in time of flight range cameras using direct and global separation methods,” Opt. Eng. 54(11), 113109 (2015). [CrossRef]

19. D. Fuentes-Jimenez, D. Pizarro-Perez, M. Mazo, and S. Palazuelos, “Modelling and correction of multipath interference in time-of-flight cameras,” in IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2012), pp 893–900.

20. A. A. Dorrington, J. P. Godbaz, M. J. Cree, A. D. Payne, and L. V. Streeter, “Separating true range measurements from multi-path and scattering interference in commercial range cameras,” in Three-Dimensional Imaging, Interaction, and Measurement 786404 (2011).

21. J. P. Godbaz, M. J. Cree, and A. A. Dorrington, “Closed-form inverses for the mixed pixel/multipath interference problem in AMCW lidar,” Proc. SPIE 8296, 829618 (2012). [CrossRef]

22. S. Fuchs, “Multipath Interference Compensation in Time-of-Flight Camera Images,” in 20th International Conference on Pattern Recognition, (2010).

23. A. a and K. J. Sabov, “Identification and Correction of Flying Pixels in Range Camera Data,” in SCCG ‘08, Budmerice, Slovakia, (2008).

24. H. Jensen, J. Lecturers, D. P. Arvo, A. Keller, A. Owen, M. Pharr, and P. Shirley, “Monte Carlo Ray Tracing,” Siggraph, (44), 2003.

25. Y. Deng, Y. Ni, Z. Li, S. Mu, and W. Zhang, “Toward Real-Time Ray Tracing: A Survey on Hardware Acceleration and Microarchitecture Techniques,” ACM Comput. Surv. 50(4), 1–41 (2017). [CrossRef]

26. I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walter, and G. Shakhnarovich, “DIODE: A Dense Indoor and Outdoor DEpth Dataset,” CoRR, vol. abs/1908.00463, 2019.

27. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor Segmentation and Support Inference from RGBD Images,” in ECCV, (2012).

28. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite,” in CVPR, (2012).

29. D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, X. W. N. Nesic, and P. Westling., “High-resolution stereo datasets with subpixel-accurate ground truth,” in German Conference on Pattern Recognition (GCPR 2014), Münster, Germany, (2014).

30. D. S. A. R. Szeliski, “High-accuracy stereo depth maps using structured light,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (IEEE, 2003), pp. I-I.

31. K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke, “A Dataset and Evaluation Methodology for,” in ACCV, (2016).

32. A. Belhedi, A. Bartoli, S. Bourgeois, V. Gay-Bellile, K. Hamrouni, and P. Sayd, “Noise modelling in time-of-flight sensors with application to depth noise removal and uncertainty estimation in three-dimensional measurement,” IET Comput. Vis. 9(6), 967–977 (2015). [CrossRef]

33. V. Falie and Buzuloiu, “Noise characteristics of 3D time-of-flight cameras,” in IEEE International Symposium on Signals Circuits and Systems ISSCS, (IEEE, 2007) pp. 1–4.

34. S. Meister, S. Izadi, P. Kohli, M. Hämmerle, C. Rother, and D. Kondermann, “When Can We Use KinectFusion for Ground Truth Acquisition?,” (2011).

35. T. Edeler, K. Ohliger, S. Hussmann, and A. Mertins, “Time-of-flight depth image denoising using prior noise information,” IEEE 10th International Conference on Signal Processing Proceedings (IEEE, 2010), pp. 119–122.

36. H. Schöner, V. Wieser, B. Moser, F. Bauer, B. Heise, A. A. Dorrington, A. D. Payne, and M. J. Cree, “Image processing for three-dimensional scans generated by time-of-flight range cameras,” J. Electron. Imaging 21(2), 023012 (2012). [CrossRef]

37. S. Su, F. Heide, G. Wetzstein, and W. Heidrich, “Deep End-to-End Time-of-Flight Imaging,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE, 2018), pp. 6383–6392.

38. Q. Guo, I. Frosio, O. Gallo, T. Zickler, and J. Kautz, “Tackling 3D ToF Artifacts Through Learning,” in ECCV, 2018.

39. P. D. Kovesi, “MATLAB and Octave Functions for Computer Vision and Image Processing,” 2000. [Online]. Available: https://www.peterkovesi.com/matlabfns. [Accessed 08 July 2021].

40. B. Delaunay, “Sur la sphère vide,” Izvestia Akademii Nauk SSSR, Otdelenie Matematicheskikh i Estestvennykh Nauk 7, 793–800 (1934).

41. L. Guibas, D. Knuth, and M. Sharir, “Randomized incremental construction of Delaunay and Voronoi diagrams,” Algorithmica 7(1-6), 381–413 (1992). [CrossRef]

42. D. Claus and A. W. Fitzgibbon, “A Rational Function Lens Distortion Model for General Cameras,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (CVPR'05), (IEEE, 2005), pp. 213–219.

43. S. Ortiz, D. Siedlecki, L. Remon, and S. Marcos, “Three-dimensional ray tracing on Delaunay-based reconstructed surfaces,” Appl. Opt. 48(20), 3886–3893 (2009). [CrossRef]

3D Space	Shapes	Camera	Detector	Noise
Scene Depth	Fourier Cloud Depth	Focal Length	Number of Rays	Crosstalk
Background Depth	Fourier Cloud Density	Pixel Size	Region Averaging	Phase Noise
Background Tilt	Number of Polygons	Number of Pixels	AB for each region	AB Noise
	Polygon Sides	Distortion Intrinsic	Multiple Frequencies

$μ (\bar{a})$	REAL
Light Level	0 Klux
Freq.#	$p_{4}$	$p_{3}$	$p_{2}$	$p_{0}$	$p_{4}$	$q_{2}$	$q_{1}$	$q_{0}$	$R^{2}$
1	1.88E-03	99.703	1.57E+05	7.72E+06	-1.34E+08	9.60E+03	2.79E+06	-4.04E+07	1.000
2	2.48E-03	81.384	1.09E+05	5.10E+06	-8.56E+07	7048.710	1.87E+06	-2.64E+07	1.000
3	2.43E-03	75.93	9.75E+04	3.56E+06	-5.79E+07	6.83E+03	1.51E+06	-2.09E+07	1.000
	3KLux
1	8.59E-04	146.081	3.83E+05	8.90E+07	-1.01E+08	1.74E+04	1.00E+07	-1.72E+07	1.000
2	-2.56E-02	877.719	2.85E+06	7.34E+08	-3.22E+07	1.21E+05	8.12E+07	-3.34E+07	1.000
3	1.05E-03	118.709	2.55E+05	5.01E+07	1.32E+08	1.34E+04	5.73E+06	7.75E+06	1.000
	25 Klux
1	3.30E-03	89.96637	2.60E+05	-1.93E+06	2.58E+08	1.02E+04	1.20E+05	1.29E+07	0.999
2	-8.07E-03	375.2673	1.21E+06	-3.61E+07	1.41E+09	4.98E+04	-9.38E+05	6.82E+07	0.999
3	4.69E-03	26.70867	-1.18E+03	-1.91E+03	-7.46E+05	-1.05E+01	-1.84E+03	-2.96E+04	0.999
$μ (\bar{a})$	IMAGINARY
Light Level	0 Klux
1	3.41E-03	50.167	2.74E+04	4.88E+05	2.13E+05	2.82E+03	2.57E+05	-1.47E+06	1.000
2	-7.03E-03	306.390	6.09E+05	3.09E+07	-6.17E+08	3.50E+04	1.11E+07	-1.80E+08	1.000
3	3.45E-03	44.527	2.39E+04	5.12E+05	1.02E+06	2.46E+03	2.51E+05	-1.16E+06	1.000
Light Level	3KLux
1	3.29E-03	52.102	3.70E+04	2.52E+06	3.86E+06	3.04E+03	2.98E+05	2.85E+04	1.000
2	4.03E-03	41.863	2.04E+04	1.77E+05	1.57E+07	2.00E+03	2.60E+04	1.54E+06	1.000
3	-5.88E-03	242.412	4.61E+05	6.88E+07	-5.57E+05	2.69E+04	8.14E+06	-1.09E+07	1.000
Light Level	25 Klux
1	1.61E-03	101.174	2.38E+05	-7.84E+06	3.53E+08	1.01E+04	-2.62E+05	1.87E+07	1.000
2	3.19E-03	70.011	1.43E+05	-4.74E+06	1.69E+08	5.97E+03	-1.40E+05	8.23E+06	1.000
3	-1.58E-03	171.560	5.09E+05	-1.84E+07	7.39E+08	2.13E+04	-5.82E+05	3.73E+07	1.000

$σ (\bar{a})$	REAL
Light Level	0 Klux
Freq.#	n1	n2	n3	n4	n5	d1	d2	d3	$R^{2}$
1	4.05E-04	1.838	191.925	-6.05E+03	5.86E+04	6.90E+02	-1.85E+04	1.78E+05	0.997
2	3.12E-04	2.717	796.652	2.45E+04	-2.51E+05	1604.578	7.48E+04	-3.39E+05	0.996
3	-1.33E-01	2.44E+03	4.48E+06	5.37E+08	-3.48E+09	3.78E+06	1.52E+09	-2.54E+09	0.996
	3KLux
1	6.13E-04	-2.361	3.03E+03	5.15E+06	-4.55E+07	-4.01E+03	7.08E+06	-5.87E+07	0.994
2	3.52E-04	2.437	768.860	-6.86E+04	1.37E+06	1.31E+03	-1.03E+05	1.99E+06	0.992
3	3.45E-04	2.167	777.328	-3.12E+04	2.98E+05	1.21E+03	-4.64E+04	4.40E+05	0.996
	25 Klux
1	3.86E-04	1.76792	-129.99349	1.79E+04	-5.24E+04	-64.053183	4.74E+03	-1.46E+04	0.984
2	3.53E-04	2.054359	-235.14311	5.83E+04	-4.23E+03	-17.490901	1.10E+04	2.37E+04	0.985
3	3.50E-04	1.838369	-137.79575	3.14E+04	-4.32E+04	-39.363912	6.54E+03	3.44E+03	0.981
	IMAGINARY
	0 Klux
1	2.25E-04	3.287	805.488	-1.33E+04	1.00E+04	1.99E+03	-1.83E+04	-1.67E+05	0.994
2	3.51E-04	1.989	219.755	-7.90E+03	6.22E+04	773.299	-2.26E+04	159621.176	0.996
3	-6.67E-02	1.14E+03	1.37E+06	6.27E+07	-3.67E+07	1.57E+06	2.48E+08	6.17E+06	0.996
	3KLux
1	3.14E-04	2.972	1.37E+03	-4.26E+03	-1.09E+05	2.02E+03	-9.45E+03	-1.24E+05	0.995
2	3.28E-04	2.443	884.654	9.67E+03	-2.26E+05	1.31E+03	1.19E+04	-3.08E+05	0.995
3	3.44E-04	2.001	731.682	1.67E+04	4.36E+06	998.483	1.95E+04	5.97E+06	0.993
	25 Klux
1	3.56E-04	1.949	-250.862	3.52E+04	-1.33E+05	-76.894	8.31E+03	-3.07E+04	0.983
2	3.99E-04	1.790	-132.779	1.87E+04	-1.76E+05	-64.807	4.93E+03	-4.37E+04	0.984
3	3.81E-04	1.657	-66.693	1.10E+04	-1.50E+05	-61.582	3.66E+03	-4.33E+04	0.980

3D Space	Shapes	Camera	Detector	Noise
Scene Depth	Fourier Cloud Depth	Focal Length	Number of Rays	Crosstalk
Background Depth	Fourier Cloud Density	Pixel Size	Region Averaging	Phase Noise
Background Tilt	Number of Polygons	Number of Pixels	AB for each region	AB Noise
	Polygon Sides	Distortion Intrinsic	Multiple Frequencies

$μ (\bar{a})$	REAL
Light Level	0 Klux
Freq.#	$p_{4}$	$p_{3}$	$p_{2}$	$p_{0}$	$p_{4}$	$q_{2}$	$q_{1}$	$q_{0}$	$R^{2}$
1	1.88E-03	99.703	1.57E+05	7.72E+06	-1.34E+08	9.60E+03	2.79E+06	-4.04E+07	1.000
2	2.48E-03	81.384	1.09E+05	5.10E+06	-8.56E+07	7048.710	1.87E+06	-2.64E+07	1.000
3	2.43E-03	75.93	9.75E+04	3.56E+06	-5.79E+07	6.83E+03	1.51E+06	-2.09E+07	1.000
	3KLux
1	8.59E-04	146.081	3.83E+05	8.90E+07	-1.01E+08	1.74E+04	1.00E+07	-1.72E+07	1.000
2	-2.56E-02	877.719	2.85E+06	7.34E+08	-3.22E+07	1.21E+05	8.12E+07	-3.34E+07	1.000
3	1.05E-03	118.709	2.55E+05	5.01E+07	1.32E+08	1.34E+04	5.73E+06	7.75E+06	1.000
	25 Klux
1	3.30E-03	89.96637	2.60E+05	-1.93E+06	2.58E+08	1.02E+04	1.20E+05	1.29E+07	0.999
2	-8.07E-03	375.2673	1.21E+06	-3.61E+07	1.41E+09	4.98E+04	-9.38E+05	6.82E+07	0.999
3	4.69E-03	26.70867	-1.18E+03	-1.91E+03	-7.46E+05	-1.05E+01	-1.84E+03	-2.96E+04	0.999
$μ (\bar{a})$	IMAGINARY
Light Level	0 Klux
1	3.41E-03	50.167	2.74E+04	4.88E+05	2.13E+05	2.82E+03	2.57E+05	-1.47E+06	1.000
2	-7.03E-03	306.390	6.09E+05	3.09E+07	-6.17E+08	3.50E+04	1.11E+07	-1.80E+08	1.000
3	3.45E-03	44.527	2.39E+04	5.12E+05	1.02E+06	2.46E+03	2.51E+05	-1.16E+06	1.000
Light Level	3KLux
1	3.29E-03	52.102	3.70E+04	2.52E+06	3.86E+06	3.04E+03	2.98E+05	2.85E+04	1.000
2	4.03E-03	41.863	2.04E+04	1.77E+05	1.57E+07	2.00E+03	2.60E+04	1.54E+06	1.000
3	-5.88E-03	242.412	4.61E+05	6.88E+07	-5.57E+05	2.69E+04	8.14E+06	-1.09E+07	1.000
Light Level	25 Klux
1	1.61E-03	101.174	2.38E+05	-7.84E+06	3.53E+08	1.01E+04	-2.62E+05	1.87E+07	1.000
2	3.19E-03	70.011	1.43E+05	-4.74E+06	1.69E+08	5.97E+03	-1.40E+05	8.23E+06	1.000
3	-1.58E-03	171.560	5.09E+05	-1.84E+07	7.39E+08	2.13E+04	-5.82E+05	3.73E+07	1.000

Time-of-flight camera characterization with functional modeling for synthetic scene generation

Abstract

1. Introduction

2. Methods

2.1 Simulated background

2.2 Simulating polygons

2.3 Triangulation and ray tracing

2.4 Realistic noise simulation

3. Results

3.1 Noise fitting

3.2 Crosstalk

3.3 Precision comparison

3.4 Qualitative comparison of multipath

4. Conclusion and future work

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (13)

Tables (3)

Equations (22)

Optics Express