Non-line-of-sight imaging and tracking of moving objects based on deep learning

JinHui He; ShuKong Wu; Ran Wei; YuNing Zhang

doi:10.1364/OE.455803

1. Introduction

The advent of imaging equipment with high sensitivity and high resolution, along with the booming of computational imaging technology, makes it possible to ‘see around corners’. This technique is called Non-line-of-sight Imaging (NLOS Imaging). It was first proposed in 2008 by Raskar et al. from MIT, and has been widely applied to detect objects outside the field of view (FOV) of imaging equipment [1]. Due to the breakthrough of perspective limitation, NLOS imaging has a very broad application in fields such as disaster relief, military counter-terrorism, medical imaging, and assisted driving [2].

According to the time resolution of the imaging equipment, NLOS imaging can be divided into transient imaging and steady-state imaging. Early research mainly focused on transient imaging, using modulated pulse laser with high frequency as light source to illuminate NLOS scenes, and expensive imaging equipment such as streak camera or single-photon avalanche diode (SPAD) to detect the reflected laser. Benefit from the ultra-high temporal resolution of these photodetectors, the time-of-flight (ToF) information of the laser can be captured. By exploiting the ToF ranging principle similar to LiDAR, the light transmission model can be established, and image reconstruction can then be operated through various optimization algorithms [3–5]. In contrast, steady-state imaging usually employs standard CMOS camera, where the speed of light is assumed to be infinite so that the propagation time of diffuse light in the scene can be ignored. By utilizing the total light intensity the sensor received during the exposure time [6–8] or the optical memory effect of speckle coherence [9,10], reconstruction work can be accomplished.

1.1 Deep learning for non-line-of-sight imaging

No matter transient imaging or steady-state imaging, traditional techniques require to establish the light transmission model from the hidden object to the imaging device, and then use the reflected light to inversely calculate the actual information of the target through optimization algorithms. This is an ill-posed inverse problem, which involves a large number of high-dimensional matrix operations, and it is much more difficult to obtain an accurate model containing environmental uncertainties such as noise and aberration in actual imaging process [11]. Deep learning has been applied to NLOS imaging in recent years. By inputting massive data samples beforehand, neural networks can learn the complex nonlinear association between physical measurement and the information of targets, so as to achieve rapid reconstruction [12].

1.2 Tracking moving objects around the corner

Previous works have been delved into reconstructing images of NLOS scenes. However, in certain applications such as detection of vehicles and pedestrians approaching from blind areas in autonomous driving, or detection of indirectly accessible suspects in criminal investigation, not only imaging, capturing the location of the target is also acquired. In relative researches, Gariepy et al. and Chan et al. respectively used lasers and SPAD to obtain the time-photon count histogram of each pixel. After Gaussian fitting, the joint probability distribution of the target location is calculated using all pixels’ measurement [13,14]. Smith et al. used speckle correlation to achieve high-precision tracking of two objects [15].

All the above studies can only present the position of the target in the form of coordinates, while our goal is to achieve intuitive, real-time tracking of moving objects through rapid imaging. Based on these requirements, an approach using deep learning with no enhancement of active light source method is designed. We select generative neural network and convolutional neural network suitable for image processing to complete the reconstruction work. Firstly, a simulated scene is built to reconstruct the classic MNIST digits dataset using the proposed neural network, proving the imaging ability of the network when applied to NLOS scene. Secondly, coordinates of a non-self-luminous 3D mannequin in a NLOS room are also recovered. Moreover, in order to verify the performance of our system in practical application, a physical scene simulating conventional interior lighting environment is set up, and the mannequin undergoing continuous motion is detected by a RGB camera placed outside the scene. This experiment eventually realized the combination of NLOS imaging and tracking: images of the mannequin in the reconstructed video can change with the real target in real time, accurately reflecting its trajectory. The system provides a more intuitive way to image and track moving objects in corners.

2. Experimental procedure

2.1 Experimental setup

For the reason that deep learning is essentially a ‘data-driven’ approach, a host of sample data need to be obtained in advance to train the network before putting the system into practical application. Although this feature makes considerable preparatory work a necessity when the NLOS scene changes, it is not a problem for relatively fixed scenes such as security monitoring areas. In experimental operations, the collection of training data is a time-consuming task. For the purpose of improving efficiency, a software is first used to conduct simulation experiments. Here we select the physical-based 3D image software, Blender with Cycle renderer, to build the NLOS scene.

According to the basic scene commonly proposed in previous studies, the camera cannot directly shoot the target due to the existence of certain obstruction (e.g. walls), but it can capture the diffuse light reflected by the relay wall to indirectly obtain the targets’ information (Fig. 1). We build up two NLOS scenes to study the feasibility of the system in NLOS imaging and positioning respectively.

Fig. 1. Basic non-line-of-sight scene. Light emitted from the hidden target reflected on the diffuse surface, then captured by the camera, the ROI of the diffuse images can be used to recover the 2D image of the hidden target.

Download Full Size | PDF

Scene No. 1 is shown in Fig. 2(a), representing the concept of ‘Accidental Pinhole Cameras’ proposed by Torralba et al. in 2012 [16]: It simulates that the object is located indoors, and the light emitted from this object can be projected to a diffuse surface outdoors through the window hole on the wall, then captured by a camera faced to the surface. The role of this window hole is to act as an occluder, which can form penumbra information of the target on the relay wall. The targets are MNIST data set with a picture resolution of 28*28. In Blender, a flat-panel display is simulated by setting a plane of the same size as luminescent material, and mapping the numeral images as material nodes to it (Fig. 2(b)). We write a piece of code to continuously switch numeral images mapped to the plane, and at each time the software generates the diffuse image by rendering (Fig. 2(c)), so as to obtain the training set in which the numeral images and the diffuse images are one-to-one corresponded.

Fig. 2. Simulation scene No. 1. (a) A scene with a window pinhole as occluder and a self-illuminating flat as hidden object. (b) Settings of the hidden object: add material node to a plane, map MNIST hand-written numeral images on this plane. (c) An example of rendered diffuse image taken by the camera.

Download Full Size | PDF

In the above scene, the target is set as a self-luminous plane. However, most objects that need to be detected in real life are 3D (such as human body) and cannot shine actively. It can only transmit information by reflecting ambient light. Generally, the light is weakened a lot after multiple reflections, as a result, very little information can be captured.

In order to study the positioning ability of the neural network in such case, the following ‘L-shaped’ scene is constructed (Fig. 3): The target to be observed is a mannequin, and its surface material is set to Diffuse Reflection BSDF (roughness rate: 0.0) according to the albedo characteristics of human skin. In order to record the position (x and y coordinates in Blender) of the target more precisely, we write a piece of code to control its random movement in an enclosure space (simulating indoor scene). A chair is placed between the target and the relay wall as an occluder, simulating the partial occlusion of reflected light by furniture or other objects in real indoor environment. An ordinary lamp that emits incoherent light is installed on the top of the room, the light irradiates on the mannequin, and the reflected light projects penumbra information on the relay wall with the effect of the chair-shaped obstruction. Diffuse images taken by the camera (placed in the ‘corridor’) are generated by rendering.

Fig. 3. Simulation scene No.2. A simplified indoor environment is built to simulate the practical scene commonly seen in the field of anti-terrorism or rescue: being necessary to determine the location of the target before entering the room.

Download Full Size | PDF

2.2 Deep convolutional inverse graphics network for image reconstruction

2.2.1 Network structure

Among various deep neural networks, generative neural networks are often used for image processing. In this paper we design a generative network called Deep Convolutional Inverse Graphics Network (DCIGN). Its basic structure is shown in Fig. 4(a) [17], which mainly contains the Encoder before the latent variables z and the Decoder after them. The encoder is composed of a convolutional neural network (CNN). Its function is to extract and abstract features from the input image by conducting down-sampling through convolutional layers and pooling layers. The latent variables z is comprised of a series of mean and variance obtained through reparameterization trick. We hope that z can be as close as possible to the characteristic probability distribution of the input data. The decoder consists of a deconvolutional neural network (DNN), which uses transpose convolutional layers to randomly sample from each feature distribution, and then generate output images by up-sampling to enhance the dimension (Fig. 4(b)).

Fig. 4. Basic Structure of the generative neural network used in this paper. (a) Fundamental structure of DCIGN, showing how this network works: the encoder generates the means and variances of the latent variables, determining their probability distribution; and the decoder randomly samples latent variables, using the sampling values to reconstruct the input. (b) Different layers contained in DCIGN.

Download Full Size | PDF

2.2.2 Selection of loss function

Loss function is a tool to evaluate the difference between network outputs and real data. According to the characteristics of DCIGN, the loss function used in this paper is divided into two parts.

The first part is reconstruction loss, which measures the differences between images generated by the decoder and the labels (real images) in the data set. We select Sigmoid Cross Entropy function frequently-used in image processing to evaluate the reconstruction loss. First, the pixel values of output images are scaled to interval (0,1) through Sigmoid function, and its cross entropy is calculated:

(1)$$H ={-} \sum\limits_{i = 1}^\textrm{m} {\sum\limits_{j = 1}^n {[{x_i}_jln{{x}_{{ij}}} + (1 - {x_i}_j)ln(1 - {x_i}_j)]} }$$

(2)$${x_i}_j = label{s_{ij}}$$

(3)$${{x}_{i}}_{j} = sigmoid(output{s_i}_j) = \frac{1}{{(1 + {e^{ - output{s_i}_j}})}}$$

where $label$ and $output$ relatively represents the reference images in the data set and the output images of the network.

The second part is KL Divergence, which is used to evaluate the error between the probability distribution of latent variables z and the actual feature distribution of real images. It is usually assumed that the actual feature distribution $p(z )$ obeys the standard normal distribution $N(0,1)$, and the probability distribution $p(z|x)$ obtained from training follows the normal distribution $N(\mu ,{\sigma ^2})$, then the KL Divergence of the two is:

(4)$${D_{K\textrm{L}}}(p\parallel q) = {D_{K\textrm{L}}}(N(0,1)\parallel N(\mu ,{\sigma ^2})) ={-} \frac{1}{2}\sum\limits_{i = 1}^n {({\sigma _i} + 1 - \exp ({\sigma _i}) - \mu _i^2)}$$

where $n$ represents the dimension of the latent variable z.

For the purpose of diminishing the error as much as possible, the KL divergence and the Cross Entropy function are required to reach their minimum. Therefore, the training of DCIGN is actually a process of conducting minimal optimization and inverse gradient propagation to the loss function.

2.2.3 Selection of optimizer

The optimizer selected in this paper is Adam optimizer. It combines the advantages of Stochastic Gradient Descent (SGD) with Momentum and RMSprop, and has a very prominent parametric optimization effect [18].

The basic principle of SGD with Momentum is to construct the velocity $V$ during backpropagation, adding the gradient of the loss function to the velocity, then use the velocity to update the parameters:

(5)$${V_0} = 0$$

(6)$${V_{t + 1}} = \rho {V_t} + (1 - \rho )\nabla {f_{\omega t}}$$

(7)$${\omega _{t + 1}} = {\omega _t} - \alpha {V_{t + 1}}$$

where $\rho $ represents friction, normally be 0.9-0.99; $\omega $ represents the parameters of the network; $\alpha $ is the learning rate; f stands for the loss function and $\nabla {f_{\omega t}}$ is the gradient of the loss function with respect to the parameter ${w_t}$.

By using the defined velocity to update parameters, even if the gradient changes to zero at the saddle point or the local optimum, the parameter updating will not stop because the velocity is still not zero. In addition, SGD with Momentum can transform the drastically changed data into a gentler transition state, so as to obtain a smoother descent effect and eventually accelerate the speed of gradient descent.

RSMprop is to calculate the exponentially weighted average of the square of the gradient:

(8)$$\nabla f_{{w_{t + 1}}}^2 = \beta \nabla f_{{w_t}}^2 + (1 - \beta )\nabla f_{{w_t}}^2$$

(9)$${w_{t + 1}} = {w_t} - \alpha \frac{{\nabla {f_{{\omega _t}}}}}{{\sqrt {\nabla f_{{w_t}}^2 + \varepsilon } }}$$

where $\beta $ is the weighting coefficient, normally be 0.9-0.99; $\omega $ represents the parameters of the network; $\nabla f_{{w_t}}^2$ stands for the square of the gradient of the loss function f with respect to the parameter ${w_t}$; $\varepsilon$ is a rather small value added to keep the denominator from zero, usually being 1e-7 and $\alpha $ is the learning rate.

In the direction with a smaller gradient, the denominator $(\nabla f_{{w_t}}^2 + \varepsilon )$ in the fraction reduces as the parameters are updated, which makes the update stride larger; While in the direction with a larger gradient, the denominator $(\nabla f_{{w_t}}^2 + \varepsilon )$ increases as the parameters are updated, making the update stride decrease. It can effectively prevent excessive oscillation of the parameters updating during gradient descent, approaching the optimal parameters faster.

2.3 Convolutional neural network for positioning

Since only the coordinates of two dimensions (scalars) are needed to estimate the location of the object, the structure of the neural network can be greatly simplified compared with the generative network used to recover images. Here, we only need the Convolutional Neural Network (CNN). According to the characteristic of diffuse images obtained in scene No.2, adjustments are made to the number of the convolutional layers and the size of the convolutional kernels. The network structure is shown in Fig. 5: after five convolutional layers (a blend of batch normalization layers and max-pooling layers) compressing features, three fully connected layers extract the location information, outputting x, y coordinates. The Cross Entropy loss function and Adam optimizer are also used to train this network.

Fig. 5. Structure of CNN with dimension of each layer’s output marked.

Download Full Size | PDF

3. Results

3.1 Image reconstruction with self-luminous plane

For scene No.1 described in Section 2.1, 10000 MNIST handwritten numeral images are randomly sampled, the relative disuse images are rendered in Blender, among which 8000 are used as training set, 1000 as validation set and 1000 as test set. the Regions of Interest (ROI, 400 pixels by 400 pixels) containing effective diffuse information are firstly extracted, and the perspective transformation function in OpenCV, a computer vision library, is used to transform the ROI images from an oblique perspective into a positive perspective. The size of each processed diffuse image is 180 pixels by 200 pixels. Since all networks used in this paper are built on TensorFlow, a machine learning framework for computing tensors, it is also necessary to convert the image data into tensor data and then perform normalization. After going through all the above preprocessing procedures, the data can be eventually sent into the neural network.

In terms of the selection of hyper-parameters, the learning rate is set to 0.001, the epoch is set to 200, and each batch contained 50 images. ReLU function is chosen to be the activation function. Compared with the Sigmoid function widely used before, since the gradient of ReLU function keeps 1 in the positive range of x, it can effectively avoid the saturation of gradient update during back-propagation. As a result, ReLU function is generally used as the activation function for large-scale deep neural networks. Dimension of each layer is shown in Fig. 6. GTX 1080Ti graphics card is used to execute the training process, which takes about 6 hours in total.

Fig. 6. Structure of DCIGN for image reconstruction with dimension of each layer’s output marked.

Download Full Size | PDF

Figure 7 shows the results of network training. The network contains 96,909 parameters altogether, among which 96,881 are trainable. The decline curve of the loss function (Fig. 7(a)) shows as the number of iterations increases, the value of loss function drops rapidly and then flattens out, indicating that the training process has achieved expectation. Figure 7(b) presents partial reconstruction images of test set, it turns out that after only 200 epochs, DCIGN can reconstruct the handwritten numeral images very accurately, which means the reconstruction effect of this well-trained network on the test set is satisfactory. In addition, the entire construction process carries out very quickly, which only takes about 3 ms for each image. This experiment has fully confirmed the feasibility of applying the proposed DCIGN to the NLOS scene, aiming to imaging some self-luminous plane objects.

Fig. 7. Training results and image reconstruction effect. (a) The curve of loss function with respect to epoch, as the time of iteration increases, the loss function converges sharply, meaning a good training effect. (b) comparison of the original images (left) and the reconstruction images (right), though the diffuse images (middle) are completely blurry, the outputs of DCIGN are quite clear and distinguishable.

Download Full Size | PDF

3.2 Positioning with non-self-luminous 3D model

For scene No.2 which is used to explore the location-recovery capacity of the neural network, the diffuse images are preprocessed through procedures similar to the simulation experiment of numerical images explained in Section 3.1: converting into tensor data after perspective transformation. The learning rate is set to 0.001, the batch size and epoch are both set to 500. The diffuse images and the ground truth of coordinates are sent into CNN as inputs and labels for training. 1000 groups of data are randomly selected in advance to test the generalization performance of the network. Partial results are shown in Table 1, the formula $|{a - \hat{a}} |/a$ is applied to calculate the relative error between the output and the ground truth. Over 1000 test data, the average error is about 1.831%, which manifests that CNN can use diffuse images to achieve sub-centimeter level of swift and accurate positioning.

Table 1. Position estimation error between CNN output and the ground truth (10 samples selected from test set)

View Table | View all tables in this article

4. Application in physical scene

In the above section, the reliability of deep learning in coping with NLOS imaging and positioning has been fully confirmed. Moreover, compared with traditional methods, it can achieve rapid recovery (millisecond level) in that there is no need to establish complex optical transmission model. This superiority provides a new idea to track objects continuously moving in NLOS scenes. However, these above experiments are designed separately, and the first numerical study is conducted under ideal conditions where the targets are static and there is no ambient light in the scene. Tancik et al. from MIT firstly realized dynamic tracking of 3D geometries with CNN through simulation experiment [19]. In their work, objects trace an infinity sign and a circular path. Every frame of video is sent into CNN to estimate position, and the moving trail of the target is eventually presented in the grid. Yanpeng Cao et al. also designed a NLOS-LUCAI frame based on CNN, which can carry out ambient compensation, and realize the precise positioning of 3D mannequin at different positions in a grid region under the interference of changing ambient illumination [20]. Prior to this, there also have been several studies using different devices and methods to track objects around the corner, but all of them were only to recover the coordinates [13–15,21]. Besides, former approaches such as Tancik’s and Yanpeng Cao’s all require additional light source to actively illuminate the NLOS scene. This may bring more information of the hidden target to the imaging device, however, it also poses interference to the scene. In certain applications such as criminal investigation, any interference could incur danger. In order to explore a more convert and intuitive approach to track objects in NLOS scene, the following experiment is conducted to verify the rapid imaging ability of DCIGN to non-self-luminous moving targets.

We use PVC expansion sheets to build a scene similar to scene No.2 (Fig. 8): the target (2D flat mannequin) can move freely in the NLOS space. A desk lamp emitting white incoherent light is placed on top to illuminate the NLOS scene. In the experiment, two standard H264-encoded camera modules produced by Rayvision are used. The camera sensor is CMOS IMX322 from SONY, whose focal length is 4 mm, and the viewing angle is about 70°. One (camera 1) located outside the scene is responsible for photographing the diffuse surface. The other one (camera 2) is fastened on the top of the diffuse surface to directly shoot the mannequin, the pictures are used as labels in the training set.

Fig. 8. Physical scene. Incoherent light emitted from the desk lamp illuminates on the mannequin, through several times of reflection, captured by camera 1 which faced towards the diffuse surface. Camera 2 directly shoots the model, those images are taken as reference.

Download Full Size | PDF

Since the light emitted by the lamp can also illuminate the diffuse surface without reaching the target, light that does not contain effective target information will also be captured by camera 1 through reflection, directly use these images will make the effect of reconstruction deteriorate, and even being impossible to distinguish. As a result, measures need to be taken to eliminate the influence of ambient light and to improve the signal-to-noise ratio (SNR). Here background subtraction is employed: Before experiment, we use camera 1 to take a picture of the diffuse surface when there is no target in the scene, this picture is used as the background image. Differential processing is conducted between the background image and diffuse images. Then sigmoid function is applied to amplify the difference. After all these procedures the processed images are sent into the network.

The proposed DCIGN is still used for this experiment. During training process, the position of the mannequin shifts randomly and continuously, camera 1 and camera 2 are controlled by code to simultaneously take 10000 groups of diffuse images and reference images which act as training data. After training, camera 1 shoots a 15-second video of the moving mannequin, which is sent to DCIGN for reconstruction frame by frame. Part of the results are shown in Fig. 9: Fig. 9(a) and 9(b) respectively, shows the real diffuse images and the corresponding ones after background subtraction, it can be seen that as the mannequin moves, it is hard for naked eyes to differentiate changes of reflection on the diffuse surface, but after differential processing, the variation of penumbra information can be clearly presented. The network output clearly shows the contour and posture of the mannequin (Fig. 9(d)).

Fig. 9. Reconstruction effect of video signal, a few frames extracted (see Visualization 1). (a) Diffuse images captured by camera 1 (same as what naked eyes can see), no obvious difference when model moves. (b) Processed diffuse images with distinguishable variation of penumbra information. (c) Original video photographed by camera 2. (d) Network output (use (b) as input)

Download Full Size | PDF

4.1 Test of generalization

The above experiment has successfully reconstructed images of a continuously moving mannequin under the interference of ambient light. Furthermore, in order to verify the generalization of the proposed DCIGN, three more mannequins with different postures are applied to the system. The mannequins are sequentially numbered as No.1-3, among which mannequin No.1 and No.2 are used to train the network, so as to improve its reconstruction capacity a step further. And mannequin No. 3 is used to test the generalization. The shape of each one is shown in Fig. 10.

Fig. 10. Three different-shaped mannequins. Mannequin No.1 and No.2 are used for training, mannequin No.3 is used for testing

Download Full Size | PDF

To ensure consistency, 10000 groups of training data are respectively captured for each training mannequin (No.1 and No. 2). Through unified preprocessing procedure (perspective transformation, background subtraction and so on), all data are sent to the DCIGN to train for 500 epochs. Partial reconstruction results are presented in Fig. 11(a).

Fig. 11. Reconstruction results of mannequins. (a) Partial reconstruction images of training mannequins No.1 and No.2. (b) reconstruction images of testing mannequins No.3.

Download Full Size | PDF

After the whole training process, 400 diffuse images of mannequin No.3 shot by camera 1 are used for testing. The reconstruction images shown in Fig. 11(b) reveal that even though images of mannequin No. 3 are not involved in the training data, DCIGN still can rebuild its rough images.

Combined with the first simulation experiment, the imaging ability of the proposed DCIGN has been fully confirmed, no matter for static plane objects or moving objects. Besides, DCIGN has superiority in the speed of each reconstruction, which only take around 3ms, which is much less than a frame of regular video (say a 60 fps video, about 16.7 ms per frame). This reconstruction speed could completely meet the requirement of real-time presentation in the form video.

5. Discussion

In this paper, a deep-learning-based framework for non-line-of-sight imaging and positioning is proposed. Through simulation experiments, it has been verified that this ‘data-driven’ method can achieve very fast reconstruction speed, which provides a better solution for dynamic tracking of hidden objects. The deep convolutional inverse graphics network we designed realizes the real-time tracking of a non-self-illuminating model moving in a blind area. Table 2 shows the comparison between our method and previous works. The deep learning method based on neural network can guarantee high positioning accuracy (sub-cm level) and simultaneously reduce the time required for a single reconstruction in a dramatic way, requiring only a few milliseconds, which is significantly lower than the resolution limit of human eyes. Being beneficial from this, our method can intuitively demonstrate the trajectory of targets in the form of video through continuous rapid imaging. Meanwhile, since the video directly shows the recovered image of the scene, it is also possible to identify the target.

Table 2. Comparison between different techniques of tracking objects in NLOS scene (Advantages highlighted in green and weaknesses highlighted in red)

View Table | View all tables in this article

In addition, the powerful capability of neural networks in information extraction and nonlinear fitting also makes the establishment of light transmission model no longer necessary, thus considerably simplifying the imaging system. Only standard RGB camera and ordinary light source (if the target object is not self-luminous and the scene is too dark) are required to complete our goal, as a result, the cost is extremely reduced. The deep learning method proposed provides an effective approach for NLOS imaging and tracking of moving targets in practical application. In certain cases, such as criminal investigation and security monitoring, where there may be no direct access to the scene, but people still have to observe the whereabouts of the target in the scene, this technology will show great significance.

Even for the above potential values, our approach still has a long way to go. From Fig. 9 and Fig. 11, it is obvious that the reconstruction images are sufficiently recognizable for users to identify the overall shape and trajectory of the hidden targets, but they are still rough. Moreover, reconstruction effect deteriorates dramatically when the mannequin is too far away from the diffuse wall (over 100 cm), which mainly results from the little ambient light reflected off the mannequin. Improvements to the performance of the network is the main objective of our subsequent research.

Funding

Basic Research Program of Jiangsu Province (BK20212006); National Natural Science Foundation of China (6210031456); Fundamental Research Funds for the Central Universities (2242021K1G005).

Disclosures

The authors declare no conflicts of interest.

Data Availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. R. Ramesh and J. Davis, “5d time-light transport matrix: What can we reason about scene properties?” Tech. Rep (Massachusetts Institute of Technology, 2008).

2. T. Maeda, G. Satat, T. Swedish, L. Sinha, and R. Raskar, “Recent advances in imaging around corners,” arXiv:1910.05613 (2019).

3. A. Velten, T. Willwacher, O. Gupta, A. Veeraraghavan, M. G. Bawendi, and R. Raskar, “Recovering three-dimensional shape around a corner using ultrafast time-of-flight imaging,” Nat. Commun. 3(1), 745 (2012). [CrossRef]

4. J. Rapp, C. Saunders, J. Tachella, J. M-Bruce, Y. Altmann, J.-Y. Tourneret, S. McLaughlin, R. M. A. Dawson, F. N. C. Wong, and V. K. Goyal, “Seeing around corners with edge-resolved transient imaging,” Nat. Commun. 11(1), 5929 (2020). [CrossRef]

5. F. Heide, L. Xiao, W. Heidrich, and M. B. Hullin, “Diffuse Mirrors: 3D Reconstruction from Diffuse Indirect Illumination Using Inexpensive Time-of-Flight Sensors,” in 32th Computer Vision and Pattern Recognition (CVPR) (2014), pp. 3222.

6. C. Saunders, J. Murray-Bruce, and V. K. Goyal, “Computational periscopy with an ordinary digital camera,” Nature 565(7740), 472–475 (2019). [CrossRef]

7. T. Maeda, Y. Wang, R. Raskar, and A. Kadambi, “Thermal Non-Line-of-Sight Imaging,” in 11th Computational Photography (ICCP) (2019), pp. 1–11.

8. M. Tancik, G. Satat, and R. Raskar, “Flash photography for data-driven hidden scene recovery,” arXiv:1810.11710 (2018).

9. M. Batarseh, S. Sukhov, Z. Shen, H. Gemar, R. Rezvani, and A. Dogariu, “Passive sensing around the corner using spatial coherence,” Nat. Commun. 9(1), 3629 (2018). [CrossRef]

10. S. Divitt, D. Gardner, and A. Watnik, “Imaging around corners in the mid-infrared using speckle correlations,” Opt. Express 28(8), 11051–11064 (2020). [CrossRef]

11. D. Faccio, A. Velten, and G. Wetzstein, “Non-line-of-sight imaging,” Nat. Rev. Phys. 2(6), 318–327 (2020). [CrossRef]

12. S. Li, M. Deng, J. Lee, A. Sinha, and G. Barbastathis, “Imaging through glass diffusers using densely connected convolutional networks,” Optica 5(7), 803–813 (2018). [CrossRef]

13. G. Gariepy, F. Tonolini, R. Henderson, J. Leach, and D. Faccio, “Detection and tracking of moving objects hidden from view,” Nat. Photonics 10(1), 23–26 (2016). [CrossRef]

14. S. Chan, R. Warburton, G. Gariepy, J. Leach, and D. Faccio, “Non-line-of-sight tracking of people at long range,” Opt. Express 25(9), 10109–10117 (2017). [CrossRef]

15. B. M. Smith, M. O’Toole, and M. Gupta, “Tracking Multiple Objects Outside the Line of Sight Using Speckle Imaging,” in 36th Computer Vision and Pattern Recognition (CVPR) (2018), pp. 6258–6266.

16. A. Torralba and W. T. Freeman, “Accidental pinhole and pinspeck cameras: Revealing the scene outside the picture,” in30th Computer Vision and Pattern Recognition (CVPR) (2012), pp. 374-381.

17. D. P. Kingma, M.Welling, “Auto-encoding variational bayes,” arXiv:1312.6114 (2013).

18. I. GoodFellow, Y. Bengio, and A. Courville, Deep Learning (Massachusetts Institute of Technology, 2016), Chap. 8.

19. M. Tancik, G. Satat, and R. Raskar, “Flash photography for data-driven hidden scene recovery,” arXiv:1810.11710 (2018).

20. Y. Cao, R. Liang, J. Yang, Y. Cao, Z. He, J. Chen, and X. Li, “Computational framework for steady-state NLOS localization under changing ambient illumination conditions,” Opt. Express 30(2), 2438–2452 (2022). [CrossRef]

21. J. Klein, C. Peters, J. Martín, M. Laurenzis, and M. B. Hullin, “Tracking objects outside the line of sight using 2D intensity images,” Sci. Rep. 6(1), 32491 (2016). [CrossRef]

Sample Number	Ground Truth		Network Output		Error
Sample Number	x	y	$x$	$y$	$\| x - x \| / x$	$\| y - y \| / y$
1	0.74285716	0.61	0.7516746	0.6118772	1.19%	0.3%
2	0.5346939	0.77	0.52559304	0.77572316	1.7%	0.74%
3	0.23673469	0.28	0.23216733	0.2768991	1.8%	1.1%
4	0.40816328	0.255	0.4047117	0.2530048	0.85%	0.78%
5	0.9510204	0.675	0.95601654	0.678604	0.53%	0.53%
6	0.23673469	0.21	0.23687069	0.20758475	0.06%	1.15%
7	0.5877551	0.91	0.58908343	0.91536796	0.23%	0.59%
8	0.11836734	0.795	0.12095334	0.7996366	2.18%	0.58%
9	0.8367347	0.78	0.8442822	0.79520696	0.9%	1.95%
10	0.3265306	0.345	0.32423747	0.3462457	0.7%	0.36%

Sample Number	Ground Truth		Network Output		Error
Sample Number	x	y	$x$	$y$	$\| x - x \| / x$	$\| y - y \| / y$
1	0.74285716	0.61	0.7516746	0.6118772	1.19%	0.3%
2	0.5346939	0.77	0.52559304	0.77572316	1.7%	0.74%
3	0.23673469	0.28	0.23216733	0.2768991	1.8%	1.1%
4	0.40816328	0.255	0.4047117	0.2530048	0.85%	0.78%
5	0.9510204	0.675	0.95601654	0.678604	0.53%	0.53%
6	0.23673469	0.21	0.23687069	0.20758475	0.06%	1.15%
7	0.5877551	0.91	0.58908343	0.91536796	0.23%	0.59%
8	0.11836734	0.795	0.12095334	0.7996366	2.18%	0.58%
9	0.8367347	0.78	0.8442822	0.79520696	0.9%	1.95%
10	0.3265306	0.345	0.32423747	0.3462457	0.7%	0.36%

Non-line-of-sight imaging and tracking of moving objects based on deep learning

Abstract

1. Introduction

1.1 Deep learning for non-line-of-sight imaging

1.2 Tracking moving objects around the corner

2. Experimental procedure

2.1 Experimental setup

2.2 Deep convolutional inverse graphics network for image reconstruction

2.2.1 Network structure

2.2.2 Selection of loss function

2.2.3 Selection of optimizer

2.3 Convolutional neural network for positioning

3. Results

3.1 Image reconstruction with self-luminous plane

3.2 Positioning with non-self-luminous 3D model

4. Application in physical scene

4.1 Test of generalization

5. Discussion

Funding

Disclosures

Data Availability

References

Supplementary Material (1)

Data Availability

Cited By

Figures (11)

Tables (2)

Equations (9)

Optics Express