Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Photon-efficient 3D reconstruction employing a edge enhancement method

Open Access Open Access

Abstract

Photon-efficient 3D reconstruction under sparse photon conditions remains challenges. Especially for scene edge locations, the light scattering results in a weaker echo signal than non-edge locations. Depth images can be viewed as smooth regions stitched together by edge segmentation, yet none of the existing methods focus on how to improve the accuracy of edge reconstruction when performing 3D reconstruction. Moreover, the impact of edge reconstruction to overall depth reconstruction hasn’t been investigated. In this paper, we explore how to improve the edge reconstruction accuracy from various aspects such as improving the network structure, employing hybrid loss functions and taking advantages of the non-local correlation of SPAD measurements. Meanwhile, we investigate the correlation between the edge reconstruction accuracy and the reconstruction accuracy of overall depth based on quantitative metrics. The experimental results show that the proposed method achieves superior performance in both edge reconstruction and overall depth reconstruction compared with other state-of-the-art methods. Besides, it proves that the improvement of edge reconstruction accuracy promotes the reconstruction accuracy of depth map.

© 2022 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

3D depth sensing is growing widely used in various applications such as augmented reality [1], man machine interaction [2] and self-driving automobile [3]. The mainstream depth sensing methods can be divided into passive 3D imaging [4] and active 3D imaging [5]. Compared with active 3D imaging, passive 3D imaging has more stringent requirements on imaging conditions, especially is not suitable for low illumination environment. Active illumination imaging is more robust to ambient light than passive illumination. The active 3D imaging technique light detection and ranging(LiDAR) adopts pulses of light to illuminate the scene and use a sensor to record the time-of-flight information of back-scattered light from the scene. LiDAR is emerging as an appealing choice for diverse depth imaging scenes.

In the context of LiDAR, depth perception capability in low light flow environments is mainly promoted by pulsed laser illumination coupled with nascent single-photon avalanche diodes (SPAD) capable of recording the round-trip time of individual photon with picosecond time accuracy, such as Long-range imaging [6], non-line-of-sight imaging [7], and imaging through Scattering medium [8]. The principle of SPAD for scene 3D reconstruction is to accumulate the photon counting histogram by recording time-of-flight(ToF) data under multiple pulse emission. However, restrictions on optical flow and imaging time prevent adequate collection of signal photons. As a consequence, the raw measurements with extremely low signal photon counts and low signal-to-background ratio (SBR) impose tremendous challenge on the depth reconstruction algorithms.

To tackle this challenge, a large amount of recent works have concerned on developing robust reconstruction algorithms from noisy SPAD measurements. Recent works can be broadly divided into three directions. Kirmani et al. and Shin et al. [911] utilized scene structures to establish probabilistic models for back-scattered light detected by SPAD and employed pixel-by-pixel processing to remove the background noise. Rapp et al. and Chen et al. [12,13] combined with the adaptive super-pixel approach to reconstruct sparse ToF Data by exploiting distribution characteristics and space domain correlation of signal photon. Lindell et al. and Sun et al. [14,15] used convolution neural networks to fuse data from multiple sensors for depth estimation by employing a multi-sensors fusion strategy. The echo signal will be weaker at edge locations than at non-edge locations due to light scattering for active 3D imaging system. Fig. 1(b) shows the ToF data acquired by the SPAD array and the echo signal at the edge of the object is significantly weaker. Considering that the range images have more simpler structures than the intensity images, the range images can be seen as a patchwork of smooth regions divided by edges [4,16]. However, how to reconstruct depth images from the perspective of improving edge reconstruction accuracy has not been investigated.

 figure: Fig. 1.

Fig. 1. (a) Schematic of SPAD-based pulsed LiDAR imaging system including a pulsed laser source, synchronous signal generator and a SPAD detector with a integrated TCSPC module. The TCSPC module timestamp photon arrival event to generate a 3D spatial-temporal data. Each data cube contains a histogram, which is generated by accumulating echo photons. (b) Figure(I) represents the imaging scene, figure(II) denotes the histogram data recorded at background pixel, figure(III) denotes the histogram data recorded at the edge pixel of imaged object, figure(IV) denotes the histogram data recorded at non-edge pixel of imaged object. As we can see, the recorded echo photons at the edge position are significantly less than the non-edge position.

Download Full Size | PDF

A most promising direction for depth reconstruction from noisy SPAD measurements depends on neural networks, which are based on rich representation capability and strong feature extraction capability. Lindell et al. [14] and Sun et al. [15] adopted multi-sensor fusion strategy, including a conventional intensity sensor and SPAD sensor, to reconstruct depth images from noisy SPAD measurements. However, the conventional intensity sensor is not suitable for low light flow environments. Peng et al. [17] introduced a non-local block into neural network model to exploit the long-range correlation of SPAD data in spatial-temporal domain. Nevertheless, the non-local block has a large model size, which is not easily portable. Besides, the non-local block does not take full advantage of the correlation on the feature channel. Zang et al. [18] proposed a lightweight neural network, which improves integration of neural network without employing sensor fusion approach. However, the network would introduce redundant information when directly cascading the feature maps at the encoding and decoding ends. Therefore it is valuable to improve the accuracy of edge reconstruction by enhancing the performance of neural network for depth reconstruction in low optical flow environment.

In this paper, we propose a performance-enhanced neural network by improving the reconstruction accuracy of depth edges to promote depth reconstruction accuracy. We adopt the multi-scale integrated U-Net network [18] as the backbone of our network. The multi-scale integrated U-Net network achieves reconstruction performance comparable to the performance of sensor fusion methods [14,15] by recovering edge information lost due to down-sampling through cascading the encoder and decoder with long and short skip connection. However, multiple skip connection can introduce redundant information (e.g. noise and pseudo-edges) from the encoder to the decoder. It will ultimately affect the reconstruction of the depth map. This is because the features in the encoder are low-level features due to shallow convolution layers, while the corresponding features in the decoder are higher-level features due to deeper convolution layers, and they are semantically different. Therefore it is not appropriate to splice them directly through skip connection. Attention-gate(AG) structure is introduced to replace the skip connection. The features of encoder related to the decoder are retained and the irrelevant features are filtered out by the AG module. Meanwhile, in order to enhance weak edge reconstruction, a lightweight attention module is also employed to capture long-range correlations of the spatial-temporal domain of SPAD measurements in the process of up-sampling and feature refinement. In addition, we employ hybrid loss functions including Kullback-Leibler (KL) divergence, total variation (TV) spatial regularization and gradient regularization functions to further obtain accurate edge reconstruction of objects in the scene. We validate our proposed method under various signal-to-background ratio(SBR). We also evaluate the impact of the edge reconstruction on the depth reconstruction using the evaluation metrics we first proposed. The experimental results show that our method is well suited for the SPAD-LiDAR imaging under low SBR environment.

2. Forward model

In a representative SPAD-based pulsed LiDAR imaging system, the pulsed laser triggered by the synchronization signal emits multiple pulses of laser light to irradiate the imaging scene. Then the received pluses from the scene return to the SPAD sensor, which would be recorded by the time-correlated single-photon counting (TCSPC) module. Finally the system would generate 3D spatial-temporal histogram data, as shown in Fig. 1(a).

It is worth noting that previous literature [1012,14,17,18] generally assumes that the SPAD-based system operates under low optical flow environment. It means that in each illumination cycle, the returned photons are very rare(much less than 1 photon). Thus the pile-up effect [1921] and other effects which would distorted the histogram, can be negligible. The low flux environment allows photon arrival events recorded to be modeled as independent between adjacent timestamps. Thus the number of photons returned to the sensor for the time stamp n and imaging position (i,j), can be depicted as:

$$s_{i, j}[n]=\int_{n \Delta t}^{(n+1) \Delta t} R_{i, j} \cdot s\left(t-\frac{2 z_{i, j}}{c}\right) d t+b_{\lambda},$$
where $s_{i, j}[n]$ indicates the photon numbers of the photon flow arrived at the sensor at time-stamp, $\Delta t$ is the bin of time-stamp’s duration time and $R_{i, j}$ denotes scene reflectance, the distance fall-off effect, BRDF and other factors affecting the number of reflected photons. s(t) represents the time domain waveform of the emitted pulse and $Z_{i, j}$ denotes the distance from the imaging position (i,j) to the sensor. c is light speed. $b_{\lambda }$ is depicted as photon flow caused by the ambient light with the wavelength $\lambda$.

For the SPAD sensor, the sensor’s quantum efficiency $\eta$ indicated photoelectric conversion efficiency [22] and the dark count $b_{d}$ (error detection rate) are taken into account. Consequently, the number of photons captured by the SPAD sensor within N illumination cycles can be denoted as a time-domain histogram:

$$h_{i, j}[n] \sim P\left\{N\left[\eta s_{i, j}[n]+b_{d}\right]\right\},$$
where $h_{\mathrm {i}, \mathrm {j}}[\mathrm {n}]$ indicates the detection events of the sensor at time-stamp n for imaging position (i,j) within N illumination cycles. The temporal histogram of detection events satisfy Poisson distribution. We employ the 3D time-domain histogram $h_{i, j}$ as the input to neural network, then the trained neural network maps the original histogram data to the denoised histogram data [18], and this function can be approximated as:
$$\widehat{h}_{\mathrm{i}, \mathrm{j}}=f\left(h_{i, j} ; \theta\right),$$
where $\widehat {h}_{i, j}$ is denoised histogram, $f$ is feed-forward function of neural network, $\theta$ is the parameters of trained neural network. Finally, we convert 3D denoised spatial-temporal data into 2D depth map [15], by using a soft argmax operator "S"
$$\hat{\mathrm{n}}_{i, j}=\mathrm{S}\left(\hat{h}_{\mathrm{i}, \mathrm{j}}\right)=\sum_{\mathrm{n}} \mathrm{n} \cdot \hat{h}_{\mathrm{i}, \mathrm{j}}, \hat{\mathrm{d}}_{i, j}=\hat{\mathrm{n}}_{i, j} \cdot 2 c \Delta t,$$
where $\hat {\mathrm {n}}_{i, j}$ is the index of reconstructed depth in the denoised histogram, $\widehat {\mathrm {d}}_{i, j}$ is the estimated depth in response to the illuminated position (i,j).

3. Edge enhancement model

In this section, the proposed network architecture, the ADAG(Attention-Directed Attention Gate) module, and hybrid loss functions would be introduced.

3.1 Network architecture

We use the multi-scale integrated U-Net network, proposed by Zang et al. [18], as the backbone of our neural network. The U-net++ [23] was used as the backbone of the multi-scale integrated U-Net network. Through U-net++ network, the edge information lost due to down-sampling was recovered by multi-level up-sampling and multiple long- and short-range cascades on the same level. Consequently, depth reconstruction accuracy comparable to that of sensor fusion network was obtained under low optical flow environment. However, the multi-scale integrated U-Net network did not filter the redundant information when adopting skip connection to cascade the features from the encoder and the decoder directly. The features in the encoder and the decoder are semantically different. Therefore it is not appropriate to splice them by skip connection. Moreover the multi-scale integrated U-Net network used only maximum pooling in down-sampling, which leads to more information lost in the down-sampling process. To tackle these problems, we introduce ADAG module into network to remove redundant information from the encoder and enhance reconstruction accuracy of weak edge from the decoder. Meanwhile, we use maximum pooling and average pooling to down-sample in a parallel manner to reduce lost information.

Overall network structure is shown in Fig. 2. $X_{i, j}$ denotes the feature map where $\mathrm {i} \in \{0,1,2,3\}$ is the levels of down-sampling, $\mathrm {j} \in \{0,1,2,3\}$ is levels of refinement,l is the depth of down-sampling. The network takes the noisy 3D spatial-temporal data as the input firstly. Then network extracts features by the feature extraction module including down-sampling as the encoding side and transposed convolution as the decoding side. Then the extracted feature flow to the feature fusion module named ADAG, which improves the accuracy of edge reconstruction. Considering the trade-off between model size and performance improvement efficiency, we only add the ADAG module for feature fusion in the uppermost layer (the feature with the largest spatial resolution). Because large spatial resolution means richer feature information and the features at both ends of the encoding and decoding ends differ the most compared to other scales for the U-net structure. Then the feature refinement module CBAM is adopted to refine feature after fusion. Finally, convert the denoised 3D estimated depth to 2D depth map through the soft argmax operator.

$$\mathrm{X}_{\mathrm{i}, \mathrm{j}}=\left\{\begin{array}{lr} \operatorname{CBR}\left(\operatorname{Cat}\left[\operatorname{MaxPool}\left(\mathrm{X}_{\mathrm{i}-1, \mathrm{j}}\right), \operatorname{AvgPool}\left(\mathrm{X}_{\mathrm{i}-1, \mathrm{j}}\right)\right]\right), & i>0, j=0 \\ \operatorname{CBR}\left(\operatorname {ADAG } \left(\operatorname{Cat}\left[\mathrm{X}_{\mathrm{i}, \mathrm{k}}\right]_{k=0}^{j-1}, U\left(\mathrm{X}_{\mathrm{i}+1, \mathrm{j}-1}\right),\right.\right. & i=0,0<j<l \\ C B R\left(\operatorname{Cat}\left[\mathrm{X}_{\mathrm{i}, \mathrm{k}}\right]_{k=0}^{j-1}, U\left(\mathrm{X}_{\mathrm{i}+1, \mathrm{j}-1}\right)\right), & 0<i<l, 0<j<l, \\ S\left(\operatorname{CBAM}\left(\operatorname{Conv}\left(\mathrm{X}_{\mathrm{i}, \mathrm{j}}\right)\right)\right), & i=0, j=l \end{array}\right.$$
where $C B R(.)$ indicates the sequence operations including convolution operation, Batch Normalization and ReLU, $Conv (.)$ Indicates the convolution operation, $Cat[.]$ indicates the cascade operation, $MaxPool(.)$ and $AvgPool(.)$ denotes the max-pooling and average-pooling for down-sampling, $U(.)$ denotes the transposed convolution for up-sampling, $ADAG$ denotes the added edge enhancement module, $CBAM$ denotes the feature refinement module, $S(.)$ denotes the soft argmax operator to perform 3D projection.

 figure: Fig. 2.

Fig. 2. Schematic of the proposed network’s architecture. We adopt the multi-scale integrated U-Net network proposed by Zang et al. [18] as the backbone of our network. Unlike the multi-scale integrated U-Net network to directly cascading of features, we introduce ADAG module into the node of the cascade of convolutional output features(the dashed arrows) in the first row and up-sampling features(the green arrows) from the second row. Meanwhile, we employ the hybrid loss functions to constrain the train process. Consequently, the proposed network can obtain stronger performance in edge information extraction. The reconstructed 2D depth map obtains a higher accuracy of edge reconstruction.

Download Full Size | PDF

3.2 ADAG module

For the U-net network, if the encoder end and decoder end are cascaded directly through a skip connection, redundant information unrelated to the features from the decoder end would be introduced [24]. We introduce Attention Gate(AG) module [25] ever used in medical image segmentation into proposed network. The AG module has the advantage of being able to learn task-relevant features with attention and can be integrated into the network model in a plug-and-play manner. The AG module is originally used in the field of medical image segmentation to process 2D medical image data. In this paper, the AG module is used to SPAD-LiDAR depth reconstruction to process 3D SPAD measurement data. We introduce AG module into the U-net++ model to replace the skip connection between the encoder end and decoder end. The purpose is to enhance the information related to the features from the decoder end. To further enhance the edge enhancement performance of AG module, the attention module CBAM and the AG module are merged to form ADAG module.

The ADAG module is shown in Fig. 3(a). The module input consists of two branches. Along the upper branch, the first node named "Data Transfer" denotes input features passed from the encoding ends. The features(X) is obtained as the features to be filtered after cascading the features from the coding end. Along the lower branch, the first node named "CBAM" denotes input features passed from the attention module(CBAM). These input features are the up-sampled features refined by the self-attention module as gating signal(G). After the input features are convolved by 1*1*1 convolution kernel($w_{g}, w_{x}$), the number of channels changes from $F_{g}$ to $F_{c}$. After the operation of summing the two inputs, the activation is performed by ReLU activation function($\sigma _{1}$). Subsequently the features are convolved by 1*1*1 convolution kernel($\Psi$). Then activated by sigmoid activation function($\sigma _{2}$), the normalized coefficients are multiplied with the input features at the encoding end by elements. Then the obtained edge-enhanced output are cascaded with the refined features from the decoding end. The operation procedure for ADAG can be modeled as:

$$\begin{gathered} \alpha=\sigma_{2}\left(\psi^{T}\left(\sigma_{1}\left(w_{x}{ }^{T} X+w_{g}{ }^{T} G+b_{x}+b_{g}\right)\right)+b_{\psi}\right), \\ \mathrm{Y}=\operatorname{Cat}[\mathrm{X} \cdot \alpha, G], \end{gathered}$$
where $\alpha$ denotes attention coefficient, Y denotes the output of ADAG module, Cat[.] denotes the cascade operation, $b_{g}, b_{x}, b_{\psi }$ are bias terms of convolution $W_{g}, W_{x}, \Psi$.

 figure: Fig. 3.

Fig. 3. (a) The overview of ADAG module. The output of CBAM module is adopted as the gating signal(G). The attention coefficients($\alpha$) is obtained by calculating the correlation between the input features(X) and the gating signal(G). Input features(X) are scaled by attention coefficients calculated in ADAG. The output(Y) of the ADAG module is obtained by cascading the refined input features and the output of CBAM. (b) The overview of CBAM module. The module contains two sequential sub-modules:channel attention module($M_{c}$) and spatial-temporal attention module($M_{s}$). We can obtain the long-range correlation of feature channels and SPAD measurements through these modules. Aiming to improve extraction of weak signals, the final refinement results are obtained by summing attention coefficients and original input features by element.

Download Full Size | PDF

The captured light pulse signal photons have the same temporal domain distribution and the arrival time of the signal photons can be any timestamp within the light pulse period [17]. Therefore the signal photons have long-range temporal domain correlation. For most natural scenes, neighboring pixels with similar geometric structures are spatially correlated. It is similar to the temporal domain, such structures may appear at arbitrary spatial locations. Therefore it is also necessary to consider the long-range correlation in the spatial domain. In [17], a non-local block [26,27] was introduced into the network to exploit the long-range correlation in the spatial-temporal domain of SPAD data. However, the non-local block has a high computational overhead. Besides it is not easily transplanted.

To tackle these problems, we introduce a plug-and-play lightweight self-attention module Convolutional Block Attention Module(CBAM) [28], which is employed to extract long-range correlations of features in spatial-temporal and channel domain. We place CBAM module after transposed convolution to refine the features after up-sampling. Aiming to improve the weak edge reconstruction and does not affect the original up-sampling results, we add CBAM into the network through skip connect. The CBAM module is shown in Fig. 3(b). The refined features are extracted by a sequence attention extraction operation including the channel attention module and the spatial-temporal domain attention module. Finally the output results are obtained by adding the refined features to the input features using skip connection. The operation procedure for CBAM can be modeled as:

$$\begin{gathered} F^{\prime}=M_{c}\left(F_{i}\right) \otimes F_{i}, F^{\prime \prime}=M_{s}\left(F^{\prime}\right) \otimes F^{\prime}, \\ F_{r}=F_{i} \oplus F^{\prime \prime}, \end{gathered}$$
where $M_{c}$ denotes the operation of attention extraction in the channel dimension. Through module $M_{c}$, channel attention coefficients are obtained. $M_{s}$ represents the operation of attention extraction in the spatial-temporal dimension. Through module $M_{s}$, spatial-temporal attention coefficients are obtained. The calculation symbols ($\otimes, \oplus$) denote multiplying and adding in element-wise manner. $F_{i}, F_{r}$ denote the input features and refined features respectively. $F^{\prime }, F^{\prime \prime }$ denote the intermediate features.

3.3 Loss function

Aiming to make network training more effective, we adopt hybrid loss functions to perform a two-stage depth estimation process, similar to other neural network based methods [15,17,18,28]. Considering the similarity of the temporal-domain distribution between the noisy SPAD histogram and the ground truth histogram, we employ a combination of the Kullback-Leibler (KL) divergence as the objective function to constrain the first-stage estimation process, which estimates the denoised 3D histogram. The network perform projection from 3D histogram into 2D depth map after the first-stage. Aiming to maintain edge sharpness and improve robustness against noise, we utilize the total variation(TV) spatial regularization to constrain the projected 2D depth map. Besides we use a gradient regularization for joint constraints with TV regularization to reconstruct more accurate edges.

3.3.1 Kullback-Leibler (KL) divergence

The Kullback-Leibler (KL) divergence [15] is employed to estimate the similarity of the probability distribution at each imaging position (i,j) between the denoised histogram and the normalized ground-truth histogram. This loss function can be expressed as:

$$D_{K L}\left(h_{i, j}, \widehat{h}_{i, j}\right)=\sum_{n} h_{i, j}[n] \log \frac{h_{i, j}[n]}{\widehat{h}_{i, j}[n]},$$
where n is the the index of the histogram.

3.3.2 Total variation(TV) spatial regularization

Aiming to improving denoising performance and edge retention capability, the total variation(TV) spatial regularization is introduced to constrain the projected depth map. This loss function can be written as:

$$\mathrm{TV}(\hat{\mathrm{d}})=\sum_{\mathrm{i}, \mathrm{j}}\left(\left|\hat{\mathrm{d}}_{\mathrm{i}+1, \mathrm{j}}-\hat{\mathrm{d}}_{\mathrm{i}, \mathrm{j}}\right|+\left|\hat{\mathrm{d}}_{\mathrm{i}, \mathrm{j}+1}-\hat{\mathrm{d}}_{\mathrm{i}, \mathrm{j}}\right|\right),$$
where $\widehat {\mathrm {d}}_{\mathrm {i}, \mathrm {j}}$ denotes the depth value of the i-th row and j-th column of the denoised depth map.

3.3.3 Gradient regularization

Natural images contain a large number of edges and details. Aiming to reconstruct a depth map with fine details and accurate boundaries, it is crucial to penalize the boundary error [4,29]. We introduce the following loss function to constrain the gradients of depth:

$$L_{g r a d}=\frac{1}{n} \sum_{n}\left[\left(\nabla_{x} d_{x, y}\right)^{2}+\left(\nabla_{y} d_{x, y}\right)^{2}\right],$$
where $\nabla _{x} d_{x, y}$ and $\nabla _{y} d_{x, y}$ indicate the gradient difference between the ground truth of 2D depth map and the reconstructed depth map in horizontal and vertical directions, respectively. Therefore the complete expression of the loss function can be written as:
$$L_{\text{total }}=D_{K L}+\lambda_{1} L_{\text{grad }}+\lambda_{2} T V.$$
After experimentation, we set $\lambda _{1}=10^{-5}$ and $\lambda _{2}=10^{-5}$.

3.4 Evaluation metrics

3.4.1 Depth estimation accuracy

In order to make a quantitative comparison with the previous work [14,18], we choose the depth reconstruction accuracy metric as one of the evaluation metrics:

  • • Root mean squared error:$R M S E=\sqrt {\frac {1}{N} \sum _{i=1}^{N}\left (d_{i}-d_{i}^{G T}\right )^{2}}$,
where N denotes the number of imaging pixel, $\mathrm {d}_{\mathrm {i}}$ and $\mathrm {d}_{\mathrm {i}}^{\mathrm {GT}}$ denote the estimated depth map and the true depth map, respectively.

3.4.2 Edge estimation accuracy

Previous works only use depth reconstruction accuracy as the evaluation metric to reflect the performance of the algorithm on depth reconstruction from noisy SPAD measurements. In order to show the edge reconstruction accuracy, we first introduce quantitative metrics to evaluate edge reconstruction accuracy. For this purpose, we firstly employ the Sobel operator [30] to calculate the gradient maps of the estimated depth map and the true depth map respectively. Then calculate the RMSE between the gradient map of the estimated depth map and the gradient map of the true depth map(GradRMSE).

  • • RMSE of the gradient maps between the estimated depth map and the true depth map: $\operatorname {GradRMSE}=\sqrt {\frac {1}{\mathrm {~N}} \sum _{\mathrm {i}=1}^{\mathrm {N}}\left (\mathrm {g}_{\mathrm {i}}-\mathrm {g}_{\mathrm {i}}^{\mathrm {GT}}\right )^{2}}$,
where N denotes the number of imaging pixel, $\mathrm {g}_{\mathrm {i}}$ and $\mathrm {g}_{\mathrm {i}}^{\mathrm {GT}}$ denote the gradient maps of estimated depth map and the true depth map, respectively. GradRMSE is used to indicate the accuracy of edge reconstruction of depth map.

4. Experiments

4.1 Implementation Details

We employ a simulated dataset to perform network training and compare performance with other methods on a simulated test data and a dataset of captured real scenarios.

The training set of SPAD measurements are simulated from the NYU v2 data set [31]. In simulation, randomly crop the depth maps of the training set to a resolution size of 32*32, and the histogram of 896 time bins is generated by sampling the inhomogeneous Poisson process of Eq. (2 according to the depth values of the pixels. A total of 16,061 images and 3,633 images were used for training and validation respectively. In training process, we employ the ADAM [32] solver as the optimizer with a batch size of 4 and a learning is initialized as $10^{-5}$ with a decay rate of 0.9 after 100 iterations. The network is realized by PyTorch and trained on NVIDIA 2080Ti GPU for 4 epochs. Each epoch takes about 8.5 hours and iterates 16k times in total.

4.2 Simulated data

we compare the performance differences between our proposed method and other methods for depth reconstruction of simulated datasets from both qualitative and quantitative perspectives.

4.2.1 Quantitative evaluation

We assess our proposed network EeNet on the 6 test scenes of the Middlebury dataset under 2 low signal-to-noise ratio(SBR) environment(2:50, 2:100) with comparison of approximate maximum likelihood estimator(MLE) [33], Lindell et al. [14], Peng et al.[17], and Zang et al. [18]. We retrain the network for the two different noisy scenes. We list the root mean squared error(RMSE) of depth map and the RMSE of the gradient map between the estimated depth map and the true depth map (GradRMSE) in Table 1 for the 6 Middlebury scenes reconstructed with the various methods. As can be seen from the results, our method obtains a better performance for deep reconstruction over both evaluation metrics than MLE, Lindell’s method and Zang’s method. Besides, our method is comparable to Peng’s method. It demonstrates the effectiveness of our method for enhancing the accuracy of edge reconstruction. Meanwhile, the result also shows that improving the edge reconstruction accuracy is meaningful for improving the overall depth reconstruction accuracy.

Tables Icon

Table 1. Quantitative comparison of the different reconstruction methods on 6 Middlebury scenes. Two tables for two low SBR levels - 0.04 and 0.02. The evaluation metrics RMSE and GradRMSE indicates the depth reconstruction accuracy and edge reconstruction accuracy, respectively. The lower the value of the evaluation metrics, the higher the reconstruction accuracy. Bold numeric characters represent the best performance, underlined numbers represent the second best.

4.2.2 Qualitative evaluation

We use two exemplar scenes to qualitatively compare the performance of different methods for deep reconstruction under different noise conditions, as shown in Fig. 4. The top two rows of the figure denote the results of various methods to reconstruct the scene named Art and the below two rows denotes the results of various methods to reconstruct the scene named Reindeer. As we can see from the closeup views, MLE leads to poor reconstruction and signal is almost drowned by noise. For Lindell’s method, we adopt the sensor-fusion model to conduct comparison. Although there is a better denoising effect than MLE and the main structure can be recovered, fine edges of depth map are lost. For Peng’s method, the depth map is better reconstructed, but some details are partially lost. For Zang’s method, we adopt uncompressed model to conduct comparison. The fine edges have been restored, but blurred details have also been introduced. This means that noise is introduced in the process of performing feature fusion. For the proposed method, the reconstructed depth map has sharp and clean edges.

 figure: Fig. 4.

Fig. 4. Exemplar scenes "Art" and "Reindeer" from Middlebury dataset are adopted for qualitatively compare the performance of different methods. (a) shows the ground truth 2D depth image of exemplar scenes "Art". (b) shows the reconstructed 2D depth map with MLE [33]. (c) shows the reconstructed 2D depth map with lindell’s method [14]. (d) shows the reconstructed 2D depth map with Peng’s method [17]. (e) shows the reconstructed 2D depth map with Zang’s method [18]. (f) shows the reconstructed 2D depth map with our proposed method.(a*), (b*), (c*), (d*), (e*) and (f*) are the closeup views of (a)-(f). (h)-(m) are the corresponding images from another exemplar scenes "Reindeer" with (h*)-(m*) Indicating the closeup views.

Download Full Size | PDF

4.3 Captured data

Apart from the simulated data, we use real scene data collected by the experimental setup to verify the performance of different methods. The captured data is obtained by conducting the indoor experiments with the experimental setup showed in Fig. 5. The experimental setup is divided into two main parts: the illumination light path and the imaging light path. We adopt the flood illumination method, where the laser (A.L.S. EIG1000AF Picosecond Diode Laser, PIL063F) emits laser pulses at a repetitive rate of 20 MHz and wavelength of 637nm through a diffuser to cover the imaging scene. The 2mW average power of the laser is affordable for the illumination method at a close distance(1.5m). The imaging optical path consists of a SPAD array sensor (Photon Force PF32) with the spatial resolution size of 32*32 and an imaging lens. All pixels of the SPAD array sensor can timestamp synchronously photon arrival time to generate the histogram data with a temporal resolution of 55 picosecond.

 figure: Fig. 5.

Fig. 5. Experimental setup for static scenes captured using the SPAD camera. The pulsed laser flood illuminates the scene through the diffuser triggered by synchronization signal. The SPAD camera timestamps the echo photon from the scene triggered by synchronization signal simultaneously. The experiments were conducted in a dark environment due to the low laser power.

Download Full Size | PDF

We use the experimental setup to capture three static scenes, including a windmill, two letters, hat and cup as the multi-object scene. Here the networks of Lindell, Peng, Zang and ours are all trained on the aforementioned simulated dataset in Sec.4.1, which are employed to reconstruct the captured data directly. The qualitative comparisons with other methods are shown in Fig. 6(c). As we can see, for the approximate maximum likelihood estimator(MLE), the method fails to filter out noise and the reconstructed object edges appear jagged. For lindell’s method, although there is a better denoising effect than MLE, but they fail to reconstruct the finer edge structure of objects. For Peng’s method, the main structure of the target has been effectively reconstructed, but the weak echo region marked by the blue boxes can’t be reconstructed, such as the thin rod of the windmill. For Zang’s method, compared with the first three methods, the weak echo edge region is reconstructed, but the recovered depth image is not sharp enough at the edges compared with the intensity image, such as the upper part of the windmill and the inner edge part of the letter N marked by red boxes. As a comparison, the reconstructed depth map by our method is more similar to the intensity map compared to other methods. Our method not only reconstructs the weak echo region, but also reconstructs the depth map with sharp edges. Through qualitative comparison, it is found that the method by improving the edge reconstruction accuracy achieves more superior performance for depth reconstruction, compared to the previous method.

 figure: Fig. 6.

Fig. 6. (a) Intensity images of the three scenes including letter, windmill and hat-cup.(b) SPAD data captured by experimental setup shown in Fig.5. (c) Reconstructed 2D depth maps computed from the captured data. The first column is the reconstruction of the depth with MLE [32], the second column is the reconstruction of the depth with Lindell’s method [14], the third column is the reconstruction of the depth with Peng’s method [17], the fourth column is the reconstruction of the depth with Zang’s method [18], and the final column is the reconstruction of the depth with our proposed method. Blue boxes in figures mark out regions with extremely weak signal return, red boxes in figures mark out regions with sharp edges.

Download Full Size | PDF

4.4 Ablation study

In this section, we investigate what effect the addition of the ADAG module, AG module and the addition of gradient regularization play on the performance of the network, respectively. For comparison, we use four versions of EeNet: one without the gradient regularization(EeNet w/o gradient), one without the ADAG module(EeNet w/o ADAG), one without the AG module(EeNet w/o AG), and one with all of these components(EeNet). We train each version under the SBR level 0.04.

Quantitative results of different versions from the Middlebury dataset are showed in Table 2. As can be seen from the results, the complete version of EeNet obtains the best result over both evaluation metrics. This result again proves that the improvement of edge accuracy is helpful for depth reconstruction.

Tables Icon

Table 2. Quantitative comparison from the ablation study of proposed EeNet. The reconstructed result of the three versions on 6 Middlebury scenes with SBR level 0.04. The evaluation metrics RMSE and GradRMSE indicate the depth reconstruction accuracy and edge reconstruction accuracy, respectively. The lower the value of the evaluation metrics, the higher the reconstruction accuracy. Bold numeric characters represent the best performance.

We use exemplar scene "Art" to qualitatively compare the performance of different versions for deep reconstruction, as shown in Fig. 7. For the version lack of gradient regularization(EeNet w/o gradient), reconstructed depth maps have false edges. For example, the long pens have burrs at the edge as shown in Fig. 7(b*). For the version lack of ADAG module(EeNet w/o ADAG), details of the reconstructed depth map are lost. For example, the top of the long pens is missing as shown in Fig. 7(c*). For the version lack of AG module(EeNet w/o AG), the top of the long pens is restored, but the torso is missing as illustrated in Fig. 7(d*). It benefit from introduction of the CBAM module. The CBAM module captures the non-local spatial-temporal correlations within 3D SPAD measurements, which is beneficial for depth reconstruction to enhance structural edge information. For the complete version of EeNet, it can be seen from the reconstruction results as shown in Fig. 7(e*). Firstly the reconstruction of fine edges is improved compared to the version lack of AG module(EeNet w/o AG), because of adding the AG module into the the process of data transfer from encoder end to decoder end within UNet++. Secondly cleaner and sharper edge is achieved than the version lack of gradient regularization(EeNet w/o gradient). It benefit from the addition of the gradient regularization, which impose constraints on edges in order to obtain higher reconstruction accuracy.

 figure: Fig. 7.

Fig. 7. Exemplar scenes "Art" from Middlebury dataset is adopted for qualitatively compare from the ablation study of proposed EeNet. (a) shows the ground truth 2D depth image of exemplar scenes "Art". (b) shows the reconstructed 2D depth map with the version lack of gradient regularization(EeNet w/o gradient). (c) shows the reconstructed 2D depth map with the version lack of ADAG module(EeNet w/o ADAG). (d) shows the reconstructed 2D depth map with the version lack of AG module(EeNet w/o AG). (e) shows the reconstructed 2D depth map with the complete version of EeNet. (a*), (b*), (c*), (d*) and (e*) are the closeup views of (a)-(e).

Download Full Size | PDF

5. Conclusion

We analyze the distribution of SPAD data at different locations in the imaging scene and find that the echo signal recorded at the edge location of the object is weaker than that at the non-edge location due to light scattering. The range images can be seen as a patchwork of smooth regions divided by edges. Therefore we propose the edge reconstruction accuracy enhancement method to improve the depth map reconstruction accuracy. The method is implemented by adding an attention-directed attention gate module to the U-Net++ network. Besides we add the gradient regularization function to constrain edge reconstruction. Extensive experiments on simulated data demonstrate that our proposed method achieves significant reconstruction accuracy improvements compared to other state-of-the-art methods. Meanwhile, it is verified that improvement of the edge reconstruction accuracy promotes the overall depth reconstruction accuracy by our proposed evaluation metrics. In addition to the excellent performance shown on the simulated data, our method also shows optimal performance in deep reconstruction of real scenes compared to other methods. Therefore, we believe that the proposed edge enhancement method can be extended to other applications of active 3D imaging with strict restrictions on optical flow.

Funding

National Natural Science Foundation of China (61875088, 62005128).

Acknowledgments

Thanks to the David B. Lindell for providing the code in [14].

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. A. N. Angelopoulos, H. Ameri, D. Mitra, and M. Humayun, “Enhanced depth navigation through augmented reality depth mapping in patients with low vision,” Sci. Rep. 9(1), 11230 (2019). [CrossRef]  

2. Z. Ren, J. Meng, and J. Yuan, “Depth camera based hand gesture recognition and its applications in human-computer-interaction,” in 2011 8th International Conference on Information, Communications & Signal Processing, (IEEE, 2011), pp. 1–5.

3. M. Beer, O. M. Schrey, J. F. Haase, J. Ruskowski, W. Brockherde, B. J. Hosticka, and R. Kokozinski, “Spad-based flash lidar sensor with high ambient light rejection for automotive applications,” in Quantum Sensing and Nano Electronics and Photonics XV, vol. 10540 (International Society for Optics and Photonics, 2018), p. 105402G.

4. J. Hu, M. Ozay, Y. Zhang, and T. Okatani, “Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), (IEEE, 2019), pp. 1043–1051.

5. R. Horaud, M. Hansard, G. Evangelidis, and C. Ménier, “An overview of depth cameras and range scanners based on time-of-flight technologies,” Mach. vision applications 27(7), 1005–1020 (2016). [CrossRef]  

6. Z.-P. Li, X. Huang, P.-Y. Jiang, Y. Hong, C. Yu, Y. Cao, J. Zhang, F. Xu, and J.-W. Pan, “Super-resolution single-photon imaging at 8.2 kilometers,” Opt. Express 28(3), 4076–4087 (2020). [CrossRef]  

7. M. O’Toole, D. B. Lindell, and G. Wetzstein, “Confocal non-line-of-sight imaging based on the light-cone transform,” Nature 555(7696), 338–341 (2018). [CrossRef]  

8. D. B. Lindell and G. Wetzstein, “Three-dimensional imaging through scattering media based on confocal diffuse tomography,” Nat. Commun. 11(1), 4517 (2020). [CrossRef]  

9. A. Kirmani, D. Venkatraman, D. Shin, A. Colaco, F. Wong, J. H. Shapiro, and V. K. Goyal, “First-photon imaging,” Science 343(6166), 58–61 (2014). [CrossRef]  

10. D. Shin, A. Kirmani, V. K. Goyal, and J. H. Shapiro, “Photon-efficient computational 3-d and reflectivity imaging with single-photon detectors,” IEEE Trans. Comput. Imaging 1(2), 112–125 (2015). [CrossRef]  

11. D. Shin, F. Xu, D. Venkatraman, R. Lussana, F. Villa, F. Zappa, V. K. Goyal, F. N. Wong, and J. H. Shapiro, “Photon-efficient imaging with a single-photon camera,” Nat. Commun. 7(1), 12046 (2016). [CrossRef]  

12. J. Rapp and V. K. Goyal, “A few photons among many: Unmixing signal and noise for photon-efficient active imaging,” IEEE Trans. Comput. Imaging 3(3), 445–459 (2017). [CrossRef]  

13. S. Chen, A. Halimi, X. Ren, A. McCarthy, X. Su, S. McLaughlin, and G. S. Buller, “Learning non-local spatial correlations to restore sparse 3d single-photon data,” IEEE Trans. on Image Process. 29, 3119–3131 (2020). [CrossRef]  

14. D. B. Lindell, M. O’Toole, and G. Wetzstein, “Single-photon 3d imaging with deep sensor fusion,” ACM Trans. Graph. 37(4), 1–12 (2018). [CrossRef]  

15. Z. Sun, D. B. Lindell, O. Solgaard, and G. Wetzstein, “Spadnet: deep rgb-spad sensor fusion assisted by monocular depth estimation,” Opt. Express 28(10), 14948–14962 (2020). [CrossRef]  

16. J. Huang, A. B. Lee, and D. Mumford, “Statistics of range images,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), vol. 1 (IEEE, 2000), pp. 324–331.

17. J. Peng, Z. Xiong, X. Huang, Z.-P. Li, D. Liu, and F. Xu, “Photon-efficient 3d imaging with a non-local neural network,” in European Conference on Computer Vision, (Springer, 2020), pp. 225–241.

18. Z. Zang, D. Xiao, and D. D.-U. Li, “Non-fusion time-resolved depth image reconstruction using a highly efficient neural network architecture,” Opt. Express 29(13), 19278–19291 (2021). [CrossRef]  

19. F. Heide, S. Diamond, D. B. Lindell, and G. Wetzstein, “Sub-picosecond photon-efficient 3d imaging using single-photon sensors,” Sci. Rep. 8(1), 17726 (2018). [CrossRef]  

20. A. Gupta, A. Ingle, A. Velten, and M. Gupta, “Photon-flooded single-photon 3d cameras,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 6770–6779.

21. A. Ingle, A. Velten, and M. Gupta, “High flux passive imaging with single-photon sensors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 6760–6769.

22. D. Renker, “Geiger-mode avalanche photodiodes, history, properties and problems,” Nucl. Instrum. Methods Phys. Res., Sect. A 567(1), 48–56 (2006). [CrossRef]  

23. Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,” IEEE Trans. Med. Imaging 39(6), 1856–1867 (2020). [CrossRef]  

24. J. Liu, Q. Li, R. Cao, W. Tang, and G. Qiu, “A contextual conditional random field network for monocular depth estimation,” Image Vis. Comput. 98, 103922 (2020). [CrossRef]  

25. O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, and D. Rueckert, “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999 (2018).

26. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 7794–7803.

27. K. Yue, M. Sun, Y. Yuan, F. Zhou, E. Ding, and F. Xu, “Compact generalized non-local network,” arXiv preprint arXiv:1810.13125 (2018).

28. S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), (2018), pp. 3–19.

29. H. Tan, J. Peng, Z. Xiong, D. Liu, X. Huang, Z.-P. Li, Y. Hong, and F. Xu, “Deep learning based single-photon 3d imaging with multiple returns,” in 2020 International Conference on 3D Vision (3DV), (IEEE, 2020), pp. 1196–1205.

30. I. Sobel and G. Feldman, “A 3×3 isotropic gradient operator for image processing,” a talk at the Stanford Artificial Project pp. 271–272 (1968).

31. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in European conference on computer vision, (Springer, 2012), pp. 746–760.

32. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).

33. I. Gyongy, S. W. Hutchings, A. Halimi, M. Tyler, S. Chan, F. Zhu, S. McLaughlin, R. K. Henderson, and J. Leach, “High-speed 3d sensing via hybrid-mode imaging and guided upsampling,” Optica 7(10), 1253–1260 (2020). [CrossRef]  

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (7)

Fig. 1.
Fig. 1. (a) Schematic of SPAD-based pulsed LiDAR imaging system including a pulsed laser source, synchronous signal generator and a SPAD detector with a integrated TCSPC module. The TCSPC module timestamp photon arrival event to generate a 3D spatial-temporal data. Each data cube contains a histogram, which is generated by accumulating echo photons. (b) Figure(I) represents the imaging scene, figure(II) denotes the histogram data recorded at background pixel, figure(III) denotes the histogram data recorded at the edge pixel of imaged object, figure(IV) denotes the histogram data recorded at non-edge pixel of imaged object. As we can see, the recorded echo photons at the edge position are significantly less than the non-edge position.
Fig. 2.
Fig. 2. Schematic of the proposed network’s architecture. We adopt the multi-scale integrated U-Net network proposed by Zang et al. [18] as the backbone of our network. Unlike the multi-scale integrated U-Net network to directly cascading of features, we introduce ADAG module into the node of the cascade of convolutional output features(the dashed arrows) in the first row and up-sampling features(the green arrows) from the second row. Meanwhile, we employ the hybrid loss functions to constrain the train process. Consequently, the proposed network can obtain stronger performance in edge information extraction. The reconstructed 2D depth map obtains a higher accuracy of edge reconstruction.
Fig. 3.
Fig. 3. (a) The overview of ADAG module. The output of CBAM module is adopted as the gating signal(G). The attention coefficients($\alpha$) is obtained by calculating the correlation between the input features(X) and the gating signal(G). Input features(X) are scaled by attention coefficients calculated in ADAG. The output(Y) of the ADAG module is obtained by cascading the refined input features and the output of CBAM. (b) The overview of CBAM module. The module contains two sequential sub-modules:channel attention module($M_{c}$) and spatial-temporal attention module($M_{s}$). We can obtain the long-range correlation of feature channels and SPAD measurements through these modules. Aiming to improve extraction of weak signals, the final refinement results are obtained by summing attention coefficients and original input features by element.
Fig. 4.
Fig. 4. Exemplar scenes "Art" and "Reindeer" from Middlebury dataset are adopted for qualitatively compare the performance of different methods. (a) shows the ground truth 2D depth image of exemplar scenes "Art". (b) shows the reconstructed 2D depth map with MLE [33]. (c) shows the reconstructed 2D depth map with lindell’s method [14]. (d) shows the reconstructed 2D depth map with Peng’s method [17]. (e) shows the reconstructed 2D depth map with Zang’s method [18]. (f) shows the reconstructed 2D depth map with our proposed method.(a*), (b*), (c*), (d*), (e*) and (f*) are the closeup views of (a)-(f). (h)-(m) are the corresponding images from another exemplar scenes "Reindeer" with (h*)-(m*) Indicating the closeup views.
Fig. 5.
Fig. 5. Experimental setup for static scenes captured using the SPAD camera. The pulsed laser flood illuminates the scene through the diffuser triggered by synchronization signal. The SPAD camera timestamps the echo photon from the scene triggered by synchronization signal simultaneously. The experiments were conducted in a dark environment due to the low laser power.
Fig. 6.
Fig. 6. (a) Intensity images of the three scenes including letter, windmill and hat-cup.(b) SPAD data captured by experimental setup shown in Fig.5. (c) Reconstructed 2D depth maps computed from the captured data. The first column is the reconstruction of the depth with MLE [32], the second column is the reconstruction of the depth with Lindell’s method [14], the third column is the reconstruction of the depth with Peng’s method [17], the fourth column is the reconstruction of the depth with Zang’s method [18], and the final column is the reconstruction of the depth with our proposed method. Blue boxes in figures mark out regions with extremely weak signal return, red boxes in figures mark out regions with sharp edges.
Fig. 7.
Fig. 7. Exemplar scenes "Art" from Middlebury dataset is adopted for qualitatively compare from the ablation study of proposed EeNet. (a) shows the ground truth 2D depth image of exemplar scenes "Art". (b) shows the reconstructed 2D depth map with the version lack of gradient regularization(EeNet w/o gradient). (c) shows the reconstructed 2D depth map with the version lack of ADAG module(EeNet w/o ADAG). (d) shows the reconstructed 2D depth map with the version lack of AG module(EeNet w/o AG). (e) shows the reconstructed 2D depth map with the complete version of EeNet. (a*), (b*), (c*), (d*) and (e*) are the closeup views of (a)-(e).

Tables (2)

Tables Icon

Table 1. Quantitative comparison of the different reconstruction methods on 6 Middlebury scenes. Two tables for two low SBR levels - 0.04 and 0.02. The evaluation metrics RMSE and GradRMSE indicates the depth reconstruction accuracy and edge reconstruction accuracy, respectively. The lower the value of the evaluation metrics, the higher the reconstruction accuracy. Bold numeric characters represent the best performance, underlined numbers represent the second best.

Tables Icon

Table 2. Quantitative comparison from the ablation study of proposed EeNet. The reconstructed result of the three versions on 6 Middlebury scenes with SBR level 0.04. The evaluation metrics RMSE and GradRMSE indicate the depth reconstruction accuracy and edge reconstruction accuracy, respectively. The lower the value of the evaluation metrics, the higher the reconstruction accuracy. Bold numeric characters represent the best performance.

Equations (11)

Equations on this page are rendered with MathJax. Learn more.

s i , j [ n ] = n Δ t ( n + 1 ) Δ t R i , j s ( t 2 z i , j c ) d t + b λ ,
h i , j [ n ] P { N [ η s i , j [ n ] + b d ] } ,
h ^ i , j = f ( h i , j ; θ ) ,
n ^ i , j = S ( h ^ i , j ) = n n h ^ i , j , d ^ i , j = n ^ i , j 2 c Δ t ,
X i , j = { CBR ( Cat [ MaxPool ( X i 1 , j ) , AvgPool ( X i 1 , j ) ] ) , i > 0 , j = 0 CBR ( ADAG ( Cat [ X i , k ] k = 0 j 1 , U ( X i + 1 , j 1 ) , i = 0 , 0 < j < l C B R ( Cat [ X i , k ] k = 0 j 1 , U ( X i + 1 , j 1 ) ) , 0 < i < l , 0 < j < l , S ( CBAM ( Conv ( X i , j ) ) ) , i = 0 , j = l
α = σ 2 ( ψ T ( σ 1 ( w x T X + w g T G + b x + b g ) ) + b ψ ) , Y = Cat [ X α , G ] ,
F = M c ( F i ) F i , F = M s ( F ) F , F r = F i F ,
D K L ( h i , j , h ^ i , j ) = n h i , j [ n ] log h i , j [ n ] h ^ i , j [ n ] ,
T V ( d ^ ) = i , j ( | d ^ i + 1 , j d ^ i , j | + | d ^ i , j + 1 d ^ i , j | ) ,
L g r a d = 1 n n [ ( x d x , y ) 2 + ( y d x , y ) 2 ] ,
L total  = D K L + λ 1 L grad  + λ 2 T V .
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.