Partial hard occluded target reconstruction of Fourier single pixel imaging guided through range slice

Xu Yang; Xu Yang; Hui Zhang; Hancui Zhang; Long Wu; Long Wu; Lu Xu; Yong Zhang; Zhen Yang; Zhen Yang

doi:10.1364/OE.522516

1. Introduction

In comparison to the imaging approach of array detectors technology [1], Single Pixel Imaging (SPI) [2–4] offers advantages in flexibility, high sensitivity, and spectral range, leading to a broader range of applications. The single pixel detector measures each pixel of the scene individually, reconstructing the final image through computational processes. Consequently, extensive measurements are necessary to acquire sufficient spatial information [5] for reconstructing images with fine details. The practicality imposed by imaging efficiency is limited by lower image acquisition efficiency. In the context, the introduction of Fourier Single Pixel Imaging (FSPI) [6–10] provides an effective solution. The region of interest is illuminated using grayscale Fourier basis patterns in FSPI, and the Fourier spectrum of the target image is obtained. As a significant portion of the power in natural images is concentrated in the low-frequency components, various under-sampling strategies can be employed to reduce the number of samples. Ultimately, the intensity image is reconstructed through Fourier inverse transformation.

In practical applications, situations are often encountered where the region of interest is occluded. Occlusion refers to a scenario where a part or the entire region of interest target is obstructed by other objects, obstacles, or the environment, making it impossible to see the entire region of interest target. In this scenario, the accurate reconstruction of the complete region of interest has become a challenging task. Therefore, the paper integrates optical imaging issues with image visual tasks, framing the hard occlusion problem encountered in FSPI as an image inpainting task. Inpainting the complete and clear area of interest involves inpainting the obstructed parts. In recent years, the deep learning architecture [11–14] is flourishing in the fields of image inpainting [15–17] and image reconstruction [18,19]. Convolutional Neural Network (CNN) [20,21] and Vision Transformer (VIT) [22–24] have demonstrated tremendous potential in computer vision tasks. However, convolution operates only on regular grids, making it challenging to handle irregular input images with random masks. The visual attention model of the Vision Transformer architecture provides a new paradigm for image processing. VIT accomplishes this by segmenting the image into patches and applying attention mechanisms [25], serializing the image information, thereby enabling interaction based on global context and feature extraction. Images inherently contain spatially redundant information, and the lost image details can be recovered from neighboring pixels. By randomly masking a significant portion of the image, the strategy effectively reduces redundant information in the image while posing a challenging self-supervised task. However, VIT also encounters issues such as high data requirements, computational demands, and limitations on the number of stacked layers.

Based on these considerations, the paper proposes a distance-based segmentation method for slicing segmentation of occlusion and the target of interest. By exploiting the difference in distance between occlusion and the target of interest, a single pixel detector sequentially obtains the echo signals of occlusion and the target of interest, reconstructing their intensity images. A mask is created by identifying the region of occluded pixels in the occlusion intensity image, resulting in the Mask region. Accordingly, a multi-scale image inpainting network that combines Sparse Convolution and Vision Transformer architecture is proposed to repair the occluded region. The integration aims to assist the network in more effectively learning feature information. The encoder selectively extracts features only from non-zero pixels, significantly reducing computational costs. Convolution operates only on regular grids, making it challenging to handle irregular input images with random masks. Additionally, the convolutional approach involves overlapping sliding windows, which consider masked pixels during image processing, resulting in a significant issue of pixel distribution shift. Therefore, the paper introduces sparse convolution [26], where sparse convolution refers to the convolution operation being performed only in the unmasked regions of the image (masked parts are set to 0). It also performs deep-shallow feature fusion, calculating the similarity of features between each patch. The decoder then utilizes the obtained feature information to regressively predict and restore the occluded parts of the image, resulting in the reconstructed image. A reconstruction network is designed based on the restored complete image, incorporating Channel Attention Mechanism (CAM), Attention Gate [27], and Double Skip Connection. This network leverages the multi-scale feature information of the image to reconstruct the original information of the image. Simulation and real-world experiments indicate that the proposed method in the paper can effectively repair and enhance Fourier single pixel imaging in scenarios with hard occlusion. Even under conditions of high occlusion rates and low sampling rates, the method can still reconstruct and restore relatively complete and clear images.

The main contributions of this paper are as follows:

1) The proposed method for distinguishing occlusion from the target of interest is introduced through the approach of range slice method.
2) Challenges presented by occluded images, where effective performance is difficult to achieve with conventional convolution methods, are addressed by the integration of sparse convolution and Transformer architecture.
3) A reconstruction network combining channel attention mechanism and attention gate is proposed to achieve high-quality imaging of Fourier Single Pixel Imaging.

2. Related work

2.1 Fourier single pixel imaging

FSPI is an imaging technique based on Fourier transformation. The imaging method utilizes pre-programmed patterns for laser spatial distribution modulation for active illumination and employs algorithms to compute and reconstruct the spatial information of the target. In FSPI, the information of the target is obtained in the frequency space, where the frequency information is reconstructed through the inverse Fourier transform. Due to the concentration of most power in the low-frequency region of natural images, various under-sampling strategies can be employed in FSPI to reduce the number of samples while maintaining high-quality images. When selecting the sampling rate, it's necessary to balance the relationship between data complexity, computational costs, and the quality of reconstructed images. Lower sampling rates may be suitable for applications with low demands on detail, such as rapid image reconstruction or low-resolution image processing. Conversely, higher sampling rates may be more appropriate for tasks requiring finer details, such as high-resolution image reconstruction. With the flourishing development of deep learning in the field of computer vision, neural networks have found applications in FSPI as well. DL-based FSPI techniques often involve the use of Deep Neural Networks (DNN) to learn implicit priors from extensive training data. A trained DNN can be employed to recover information loss caused by under-sampling, enhancing imaging quality while maintaining high computational efficiency. Rizvi et al. [28] proposed a deep convolutional autoencoder network with symmetric skip connections to mitigate the impact of reduced measurements on the quality of Fourier single pixel imaging. The architecture was utilized for real-time Fourier single pixel imaging at very low sampling rates (5-8%) with a resolution of 96 × 96. Qiu et al. [29] proposed a sampling strategy that performs density variation in the Fourier space specifically for high-resolution images. The method is capable of reconstructing clear images with a resolution of 256 × 256 pixels even with a sampling rate as low as 10%. Li et al. [30] employed up-sampling and jittered Fourier basis patterns to approximate and simulate rapid Fourier transformations of grayscale speckle. The method allows the production of high-quality images with a resolution of 1024 × 768 on digital microscope devices. Yao et al. [31] addressed the limitation of existing sampling methods in capturing crucial high-frequency Fourier coefficients. An adaptive sampling method based on the continuity of spectral energy was introduced. The approach significantly improves the image quality at low sampling rates. Yang et al. [32] proposed a Fourier single pixel imaging method based on Generative Adversarial Networks (GANs). Jiang et al. [33] designed a reconstruction network for Fourier Single-Pixel Imaging (FSPI) with an optimized sampling strategy. To sum up, the frequency space information of the target scene is deconstructed by FSPI through algorithms. The outstanding performance of deep neural networks is leveraged by these methods, and better image information is reconstructed by learning implicit priors. The achievement of high-resolution imaging at low sampling rates is demonstrated by various strategies, including GANs and adaptive sampling, marking advancements in the field by FSPI.

2.2 Image inpainting

Image inpainting is the process of repairing or reconstructing missing or damaged parts of an image, which holds significant importance in computer vision. Commonly used methods for image inpainting include interpolation, edge-preserving techniques, and graph cut methods. In the process, the values of the missing regions are estimated by using the pixel values from the surrounding areas. With the development of deep learning, image inpainting methods based on deep learning have made significant advancements. These methods leverage large-scale image datasets for training, enabling them to learn advanced features and contextual information of images. As a result, they produce more accurate and natural restoration outcomes. Pathak et al. [34] were among the pioneers who initially applied generative adversarial methods for image inpainting. Demir et al. [35] embedded residual learning and PatchGAN into the network, further enhancing the network's restoration capabilities. Yan et al. [36] proposed the use of Shift-Connection for feature rearrangement. The introduced Shift connection layer involves a shifting operation, allowing the network to efficiently borrow information from the nearest neighbors of missing portions. The refinement process aids in enhancing both the global semantic structure and local texture details of the generated components. The methods mentioned above all utilize convolution for feature extraction. However, due to the nature of convolutional operations when dealing with unseen pixels, there is a tendency for pixel shifts, making it challenging to recover finer details in the image. Sparse convolution was initially applied to 3D point clouds [37] to handle irregular and sparse point cloud data. The sparse characteristics of 3D point cloud data are analogous to 2D images with occlusions. Therefore, Tian et al. [38] introduced sparse convolution in visual tasks. Sparse convolution can directly skip unseen pixels, thereby avoiding the influence of masked-out pixels. With the rise of Natural Language Processing (NLP), promising results have been achieved through self-supervised pre-training based on autoregressive language modeling in Generative Pre-Trained Transformer (GPT) [39] and masked autoencoders in Bidirectional Encoder Representation from Transformers (BERT). The concept is concise and ingenious: removing a portion of content from the data and learning to predict the deleted content. Given the success of BERT, it's hard not to consider how to apply this approach to the field of image processing. Vaswani et al.'s work, known as Vision Transformer, marked the first application of the transformer architecture to visual tasks. The asymmetric structure of MAE [40] allows it to recover high-quality image information even after masking out 75% of the pixels. ConvMAE [41], employing a combination of convolutional and transformer encoders, has demonstrated promising results in downstream tasks. Local inductive biases [42–47] and hierarchical designs [48,49] have been employed to enhance the performance of VIT. In summary, convolution contributes to local information extraction and context awareness, while the Transformer provides the advantage of global relationship modeling and self-attention mechanisms for image inpainting tasks. The enables models to better adapt to diverse scales and structures in the inpainting tasks.

3. Approach

3.1 Principle of forward imaging systems

The principle of forward imaging system is illustrated in Fig. 1. The Laser emitted from the source is first expanded by a beam expander. Then, it passes through a transmitting antenna and is directed onto a Digital Micromirror Device (DMD). The computer controlled DMD modulates the spatial distribution of the laser, thereby illuminating the target object and the occlusion. Subsequently, the reflected information from the scenario is recorded by a detector through a receiving antenna. The computer processes the total reflected laser intensity from targets scene to obtain the spatial spectrum distribution of the target scene.

Fig. 1. Fourier Single Pixel Imaging Schematic.

Download Full Size | PDF

In the FSPI, the spatial distribution of the modulation pattern can be expressed as:

(1)$${P_\varphi }({x,y;{f_x},{f_y}} )= a + b\cos ({2\pi {f_x}y + 2\pi {f_y}y + \varphi } )$$

where a represents the average light intensity of the illumination pattern and b represents the amplitude of the modulation pattern. The coordinates (x, y) denote the spatial frequency domain positions, where (${f_x},{f_y}$) represents spatial frequency, and φ represents the initial phase. In the proposed system, the pulse laser is utilized as illuminating source. The maximum laser intensity is denoted as E₀ and the rate function of laser pulse is expressed as S(t). Consequently, the modulated speckle distribution can be represented as:

(2)$$E({x,y,t} )= {P_\phi }({x,y;{f_x},{f_y}} ){E_0}S(t )$$

the intensity of the laser undergoes fluctuations in accordance with variations in pulse duration.

In conventional FSPI, a single pixel detector without time resolution is typically employed. The light field reflected by the target is accumulated over time, and the resultant total echo energy is utilized for reconstructing the target scene. Consequently, the reconstructed intensity image encompasses both occluded objects and the target of interest within the field of view. However, the target of interest is often partially blocked by occluding objects, resulting in incomplete representation, which poses challenges for downstream tasks such as object detection and recognition. To address this issue, the proposed method utilizes image inpainting techniques to recover the complete intensity image of the target of interest. However, determining the areas corresponding to occlusions and the areas corresponding to the target of interest from the 2D reconstructed intensity image in FSPI is still a challenging task. To overcome this challenge, the proposed method employs a pulsed laser as the light source and a single pixel detector with time resolution to detect the echo signal of the target scene. The occluded object and the target of interest are different in longitudinal distance, so the echo signal detected by single pixel detector exhibits multiple peaks in the time dimension. The intensity values of the peaks are determined from the echo signal and the intensity image slices at different time positions can be obtained by FSPI reconstruction algorithm with the intensity values of peaks. The intensity image slice closest to the system is identified as the occluded object, while the other intensity image slices are considered as targets of interest, which realizes the segmentation process of the occluded object and targets of interest. When the distance between the occlusion and the target is too close, causing the received waveforms to mix, waveform decomposition can be achieved using the Levenberg-Marquardt [50–52] (LM) algorithm. The LM algorithm is an optimization method utilized for nonlinear minimization problems, often applied in curve fitting and parameter estimation tasks. It leverages a combination of gradient descent and Gauss-Newton methods to effectively locate the minimum of the objective function. Additional information can be found in the Supplement 1.

Simplifying the derivation of segmentation process, the presence of only one occlusion and one target of interest is assumed. The waveform of the received echo signal intensity by the detector is depicted in Fig. 2, where the horizontal axis represents time, and the vertical axis represents signal intensity. ${R_1}({x,y} )$ represents the reflection distribution of the occlusion and ${R_2}({x,y} )$ represents the reflection distribution of the target. ${t_1}$ and ${t_2}$ respectively represent the times of first and second peaks of the echo signal received by the single pixel detector. Therefore, the signal recorded by single pixel detector can be expressed as:

(3)$$\begin{aligned} D_\varphi ^{\prime}({{f_x},{f_y},t} )&= {D_n} + k\int {\int {{R_1}({x,y} )E({x,y} )S({t - {t_1}} )dxdy} } \\ & + k\int {\int {{R_2}({x,y} )E({x,y} )S({t - {t_2}} )dxdy} } \end{aligned}$$

where k represents the intensity modulation coefficient used to control the speckle intensity. ${D_n}$ represents background noise. In proposed FSPI, three-step phase shifting is employed to measure the Fourier spectrum values of the target object in the spatial frequency domain. It can be represented as:

(4)$$\begin{aligned} F({x,y,t} )&= [{2{D_0}({{f_x},{f_y},t} )- {D_{2\pi /3}}({{f_x},{f_y},t} )- {D_{4\pi /3}}({{f_x},{f_y},t} )} ]\\ & + \sqrt 3 j \cdot [{{D_{2\pi /3}}({{f_x},{f_y},t} )- {D_{4\pi /3}}({{f_x},{f_y},t} )} ]\end{aligned}$$

where j represents the imaginary unit. Due to conjugate symmetry of spatial frequency distribution, the Fourier coefficients that need to be measured are half of the total pixels in the image. ${D_0}({{f_x},{f_y},t} )$, ${D_{2\pi /3}}({{f_x},{f_y},t} )$ and ${D_{4\pi /3}}({{f_x},{f_y},t} )$ are representing the measurements of a single pixel detector under phase shifts of 0, 2$\pi$/3 and 4$\pi$/3.

Fig. 2. The intensity signal diagram of occlusion and the target of interest.

Download Full Size | PDF

When F₁(x, y, t) and F₂(x, y, t) are expressed the spatial frequency spectrum of obstruction and target of interest respectively, Eq. (4) can be represented as:

(5)$$F({x,y,t} )= {F_1}({x,y,t} )S({t - {t_1}} )+ {F_2}({x,y,t} )S({t - {t_2}} )$$

If the echo signal of the occluded object and the echo signal of the target of interest can be clearly distinguished in the time dimension, when $t = {t_1}$, $S({t - {t_1}} )$ is much greater than $S({t - {t_2}} )$. Therefore, Eq. (5) can be expressed as: $F({x,y,t} )= {F_1}({x,y,t} )S({t - {t_1}} )$. When $t = {t_2}$, ${S_0}({t - {t_1}} )$ is much smaller than ${S_0}({t - {t_2}} )$. Equation (5) can be expressed as: $F({x,y,t} )= {F_2}({x,y,t} )S({t - {t_2}} )$. Therefore, the intensity image distributions for both the occlusion and the target of interest can be respectively reconstructed at different time position.

In the practical FSPI detection process, in order to improve the imaging efficiency, it is usually necessary to obtain the information of the target scene under the condition of down-sampling. Due to the fact that the power of the image spectrum in natural scenes is mostly concentrated in the low-frequency region, it is only necessary to sample the information-rich low-frequency areas to approximate reconstruct the image. The down-sampling process can be expressed as $D({\cdot} )$. Hence, the reconstructed image of the target of interest ${I_{target}}$ and the reconstructed image of the occlusion ${I_{occlusion}}$ can be represented as:

(6)$$\left\{ \begin{array}{l} {I_{target}} = IFFT\{{D[{{F_2}({x,y,t} )} ]} \}\\ {I_{occlusion}} = IFFT\{{D[{{F_1}({x,y,t} )} ]} \}\end{array} \right.$$

where IFFT represents the Inverse Fourier Transform. Based on the pixel region where the occlusion intensity image ${I_{occlusion}}$ is located, a masking operation is performed, and the result is denoted as ${I_{mask}}$. By feeding ${I_{target}}$ and ${I_{mask}}$ into the Image Inpainting Network, the restored image can be obtained, and the process can be expressed as:

(7)$${I_{inpainting}} = II({{I_{target}},{I_{mask}}} )$$

where II represents the Image Inpainting Network (II). ${I_{inpainting}}$ represents the inpainting result obtained through the neural network.

When the sampling rate is low, the reconstructed image information has lower resolution, making it difficult to identify the target image. Therefore, the output target object image is fed into a neural network for image reconstruction, aiming to reconstruct a high-resolution image. The process can be represented as:

(8)$${I_{reconstruction}} = IR({{I_{inpainting}}} )$$

where ${I_{reconstruction}}$ represents the reconstructed image. The Image Reconstruction Network (IR) is employed for reconstructing the information of the original image.

3.2 Network architecture

As shown in Fig. 3, the overall system comprises three stages, namely, the Forward Model, Image Inpainting (II) and Image Reconstruction (IR). Intensity imaging for occluded scenes is performed by the forward model. Due to the distance difference between occlusion and the region of interest, distance slicing is applied to the intensity image to segment occlusion and the region of interest. A mask is obtained based on the region where occlusion intensity image pixels are located. The mask and the intensity image of interest are then input into an image inpainting network that combines multi-scale sparse convolution and Transformer architecture. The image inpainting network mainly consists of Multi-Scale Feature Fusion (MSFF) and Double Attention Regression Network (DARN) structures. To address the loss of image details after under-sampling, the restored image is finally fed into an image reconstruction network to reconstruct the original information of the image. The image reconstruction network primarily includes Attention Gate (AG) and Channel Attention Mechanism (CAM) structures. The detailed introductions for each module are provided below.

Fig. 3. The overall network.

Download Full Size | PDF

3.2.1 Distance-based image segmentation

The Fourier single pixel imaging of the target scene is conducted by the FSPI forward system when occlusion is present in front of the target of interest, as illustrated in Fig. 3. Multiple peaks in the echo signal corresponding to occlusion and the target of interest are observed by the receiver due to their different distances, as depicted in Fig. 2. Distance-based intensity slicing on targets within the field of view can be performed by the FSPI system, then utilizing the FSPI reconstruction algorithm according to formula (6), intensity image slices at different temporal positions can be obtained, leading to the intensity images of occlusion and the target of interest.

As shown in Fig. 4. The occlusion and the target intensity image are patched into non-overlapping square patches. ${I_{mask}}$ can be obtained by detecting whether occluded pixels exist in each patch of ${I_{occlusion}}$; if they are present, all pixels in the patch are set to 0, otherwise, they are set to 1.Finally, ${I_{mask}}$ and ${I_{target}}$ are fed into the image inpainting network for inpainting the occluded regions.

Fig. 4. Principle of Distance-Based Image Segmentation.

Download Full Size | PDF

3.2.2 Image inpainting network

The network framework for inpainting is illustrated in Fig. 5. The input to the inpainting network is a single-channel, under-sampled image with a mask, and the output of the network is the image after inpainting the masked regions. Like all autoencoder networks, the network in the article consists of an encoder and a decoder. The encoder maps information to a latent representation, and the decoder reconstructs the information, recovering the original data of the image. The network primarily consists of six structures: sparse convolutional down-sampling, deconvolutional up-sampling, skip connections, Multi Scale Feature Fusion (MSFF), Transformer Block, and Double Attention Regression Network (DARN).

Fig. 5. Image inpainting network framework.

Download Full Size | PDF

The graphic representation of sparse convolution and regular convolution is shown in Fig. 6. In the figure, the leftmost part represents a binary image, where the white blocks of M indicate the obscured part with pixel values of 0, and the black blocks represent the unobscured part with pixel values of 1. The rightmost part of the figure, from top to bottom, represents the output results after sparse convolution and conventional convolution, respectively. As shown in the Fig. 6, sparse convolution does not alter the pixel distribution of the original image, making it more suitable for image inpainting tasks. Additionally, the exceptional attention architecture of Transformers can be leveraged to extract the global pixel distribution in images, computing similarities between each patch. This allows attention weights to be simultaneously computed for each position within each patch, enabling different attention weights to be allocated to different parts of the input by the model.

Fig. 6. Sparse Convolution Illustration.

Download Full Size | PDF

In image inpainting tasks, the most crucial aspect during the training process is the design of the mask ratio and mask region. The design in this article involves independently dividing the image into non-overlapping square patches and applying relatively large-scale random masks to these square patches. A larger mask ratio can eliminate redundancy between pixels, preventing easy extrapolation of image inpainting through neighboring patches. The highly sparse input creates favorable conditions for designing an effective inpainting network.

After the image is fed into the network, it undergoes down-sampling through sparse convolution, generating three feature maps with different resolutions. For an image with a resolution of H × W, the corresponding feature maps are denoted as ${S_1}$, ${S_2}$ and ${S_3}$, with resolutions of $\frac{H}{4} \times \frac{W}{4}$, $\frac{H}{8} \times \frac{W}{8}$ and $\frac{H}{{16}} \times \frac{W}{{16}}$. Each feature map during the up-sampling stage is denoted as ${D_1}$, ${D_2}$ and ${D_3}$. Each feature map ensures that the mask proportions and positions correspond to the areas in the original image that are masked or obscured. Taking feature map ${S_3}$ as an example, it undergoes sparse convolution down-sampling by ${S_2}$ and then is padded with the embedded mask to obtain feature map ${S_3}$. Similarly, ${S_1}$, ${S_2}$, ${D_1}$, ${D_2}$ and ${D_3}$ (The resolutions are respectively $\frac{H}{4} \times \frac{W}{4}$, $\frac{H}{8} \times \frac{W}{8}$, $\frac{H}{4} \times \frac{W}{4}$, $\frac{H}{8} \times \frac{W}{8}$ and $\frac{H}{{16}} \times \frac{W}{{16}}$) could be obtained. The specific implementation details can be expressed as:

(9)$${S_i} = {\gamma _i}({{S_{i - 1}}} )\odot {M_i}\quad\quad({i \in \{ 3,2,1\} } )$$

(10)$${D_i} = {U_i}({{D_{i + 1}}} )\odot {M_i}\quad\quad({i \in \{ 3,2,1\} } )$$

where ${\gamma _i}$ represents each sparse convolution, ${M_i}$ represents a mask image that corresponds in size to each feature map, ${U_i}$ represents each deconvolution operation, ${\odot}$ represents element-wise multiplication to mask each feature map at specific positions.

To prevent the loss of details and features in the input image caused by deepening the network layers, skip connections are incorporated during each up-sampling step in the deconvolution process to facilitate the fusion of shallow and deep features. Ensuring the same mask positions during the feature fusion process can also effectively reduce the impact caused by gradient vanishing. The final feature map can be converted into a two-dimensional vector by extracting the non-masked regions and mapping the three-dimensional features (Patch Embedding). Additionally, position encoding (Position Embedding) can be added to the vector. In the end, the features are input into a Transformer Block in the form of patches to compute similarity, namely attention. The attention function can be generalized as mapping a query and a series of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output results from a weighted sum. The approach allows each patch to obtain semantic information from any position, enabling comprehensive interaction across the entire input. The formula for positional encoding and the calculation matrix for attention are as follows:

(11)$$P{E_{(pos,2i)}} = \sin ({pos/{{10000}^{2i/{d_{model}}}}} )$$

(12)$$P{E_{(pos,2i + 1)}} = \cos ({pos/{{10000}^{2i/{d_{model}}}}} )$$

In the context, sine and cosine functions with different frequencies have been selected, where pos represents the position and i denotes the dimension. Indeed, each dimension of the positional encoding corresponds to a sine or cosine curve, where ${d_{model}}$ and the positional encoding share the same dimensionality.

(13)$$\textrm{Attention}({Q,K,V} )= softmax\left( {\frac{{Q{K^T}}}{{\sqrt {{d_k}} }}} \right)V$$

Q, K, and V represent Query, Key, and Value respectively. The softmax function is utilized for calculating weights, where ${d_k}$ denotes the feature dimension. This formula involves taking the dot product of the query and key, followed by normalization. This process yields weights for each value. Then, multiplying these weights by the corresponding values and summing them up results in the similarity between each pair of patches.

The Decoder section also follows the Transformer architecture, with the incorporation of the Double Attention Regression Network (DARN) and Attention Connection (AC) modules. The DARN module is designed to capture hierarchical feature information from the network's multiple layers. The added AC module helps the network converge more effectively and mitigates the issue of gradient disappearance. This section reconstructs the input by predicting the pixel values of each masked patch. The final layer of the decoder is a linear output layer with a number of channels equal to the total pixel values of the image. The output is reshaped into a two-dimensional form to compose the reconstructed image. The loss is calculated only on the masked patches to facilitate the training process.

The loss function in the paper consists of three components: pixel-wise loss, mean squared error loss, and perceptual loss.

Pixel loss and mean square error loss can be respectively represented as ${L_1}$ and ${L_2}$, and are expressed as following:

(14)$${L_1} = \frac{1}{n}\sum\limits_{i = 1}^n {|{y_a} - y|}$$

(15)$${L_2} = \frac{1}{n}\sum\limits_{i = 1}^n {{{({{y_a} - y} )}^2}}$$

where n represents the number of pixels in each image, ${y_a}$ represents the result of image restoration, and y represents the original image.

The perceptual loss, computed by a pre-trained VGG-19, is represented as ${L_{per}}$, and is expressed as following:

(16)$${L_{per}} = \frac{1}{m}\sum\limits_{i = 1}^m {{{[{{\phi_i}({{y_a}} )- {\phi_i}(y )} ]}^2}}$$

here, m represents the number of layers in the VGG-19 network, and ${\phi _i}$ represents each layer used for calculating the loss in VGG-19. The total loss can be represented as:

(17)$$L = {L_1} + {L_2} + \lambda {L_{per}}$$

In the proposed network model, during training, the parameters $\lambda$ are set to 0.01. The ADAM optimizer is employed for optimization with an initial learning rate set at 0.001.

3.2.3 Image reconstruction network

After the image inpainting network repairs the obscured parts, it then passes through a reconstruction network to fill in the detailed information and eliminate the ringing artifacts. The design of the reconstruction network consists of an encoder and a decoder. The encoder extracts image features, while the decoder reconstructs the original image information. The network is illustrated in Fig. 7.

Fig. 7. The framework of the image reconstruction network.

Download Full Size | PDF

The input to the reconstruction network is a single-channel image with a resolution of 256 × 256, and the output is a reconstructed image with the same resolution. The network architecture, as shown in Fig. 7, mainly consists of convolutional layers, deconvolutional layers, and normalization layers. The encoder performs down-sampling to extract features, resulting in a feature map with 1024 channels and a resolution of 8 × 8. The decoder then up-sampling the feature map to restore fine-grained details in the image. In the paper additionally introduces dual skip connections between the encoder and decoder to prevent gradient vanishing, accelerate training convergence, and incorporates Channel Attention Mechanism (CAM). As shown in the bottom left corner of Fig. 6, the Attention Gate (AG) is added in the dual skip connections. Due to the aliasing effects and loss of detail information in images obtained from down-sampling, noise and ringing artifacts may be present. The AG mechanism can assist the model in suppressing noise by allowing it to focus attention on the most relevant parts of the input, thereby mitigating or reducing sensitivity to noise. CAM enables the model to selectively focus on and enhance information from specific channels. By learning the weights for each channel, the model can adaptively adjust the importance between channels, thereby better capturing features. This allows the model to attend to relevant information in a channel-wise manner, improving its ability to represent and utilize important features during the training process. The model can be aided in better learning detailed information in images, suppressing the effects of noise and ringing artifacts, leading to improved reconstruction results, through the use of AG and CAM.

The loss function for the reconstruction part uses Mean Squared Error (MSE). It is employed to calculate the ${L_2}$ loss between the repaired image result and the reconstructed image. The ${L_2}$ loss can be expressed as:

(18)$${L_2} = \frac{1}{m}\sum\limits_{i = 1}^m {{{({{y_b} - {y_a}} )}^2}}$$

where m represents the number of pixels in each image, ${y_b}$ represents the result after image reconstruction, ${y_a}$ represents the result after image inpainting.

4. Numerical experiments

4.1 Data set preparation and training

In the image inpainting phase, self-supervised pre-training is conducted on the ImageNet-1 k training set. 100,000 images from natural scenes are selected, with 80,000 images randomly assigned to the training set and 20,000 images to the test set. Each image in the dataset is resized to 256 × 256, converted to grayscale, and subjected to large-scale random masking for network pre-training, without undergoing down-sampling processing. During the training stage, the GELU optimizer is used for iterations, with a learning rate set to 0.001. The network is trained for 200 epochs, and the batch size is set to 16.

The loss curves during the training process of the restoration network are shown in Fig. 8. With the increase of epochs, all three losses gradually decrease and stabilize after epoch 150, indicating that the network converges and reaches stability.

Fig. 8. The loss function curves during the training process of the restoration network, (a) represents per-pixel loss, (b) represents mean squared error loss, (c) represents perceptual loss, and (d) represents total loss.

Download Full Size | PDF

Afterwards, the reconstruction network module is trained, using 15,000 images from ImageNet-1 k for network training. Resize the images in the dataset to 256 × 256, convert them to grayscale, and perform Fourier under-sampling on the images. Use the original images for self-supervised training, sampling from the central circular region. Four sampling rates (1%, 3%, 5%, 10%) were chosen. The training is conducted separately at four different sampling rates: 1%, 3%, 5%, and 10%, resulting in four distinct sets of pre-training weights. Further reasons for opting for independent training are elaborated in the Supplement 1. Throughout the training process, the ADAM optimizer should be utilized for iterations, with the learning rate set to 0.0002. The network is trained for 60 batches, and the best model during training is saved. Both network models are trained on Python version 3.9 and PyTorch 1.13. The training is performed on a GEFORCE RTX-4090 GPU to accelerate computations.

4.2 Image inpainting results

The superiority of proposed method compared with three other methods, namely GLCIC [53], CTSDG [54], and MAE [40] is demonstrated in numerical results. During testing, mask proportions are set to 30%, 40%, 50%, 60%, and 70%, with a sampling rate of 10%. In the stage, mask is used to have the image divided into non-overlapping patches, and random occlusion is applied to these patches. The inpainting results of each method under different occlusion proportions are being evaluated. All methods underwent pre-training on the same dataset. The results with 30%, 50% and 70% mask proportions are shown in Fig. 9 and additional results are provided in the Supplement 1.

Fig. 9. Results of inpainting with different occlusion proportions for different methods.

Download Full Size | PDF

The results of inpainting under random occlusion at three different proportions are shown in Fig. 9. From left to right, the images display Ground Truth, Masked Areas, GLCIC, CTSDG, MAE, and SCT (ours). It can be observed that the GLCIC method, under an occlusion rate of 70%, produces images with more artifacts, and some of the appearance shapes of the target objects are missing. At occlusion rates of 30% and 50%, in the regions marked with red boxes in the images, both the GLCIC and CTSDG methods fail to restore the branches in the occluded areas, and the overall details in the recovered images are relatively poor. MAE and STC (ours) show good restoration results for the images at different occlusion rates. Both methods can successfully restore the shape of the target, but our method demonstrates superior performance in terms of fine-textured details.

To objectively evaluate the performance of the proposed Hierarchical Sparse Convolution and Transformer-based inpainting networks, quantitatively assessing each method on the ImageNet test set at various occlusion rates is shown in Table 1 and Table 2.

Table 1. The PSNR of the inpainting results for different methods under various occlusion ratios is measured, where higher values indicate better results. The results with the best and second-best metrics are respectively highlighted in red and blue.

View Table | View all tables in this article

Table 2. The SSIM of the inpainting results for different methods under various occlusion ratios is evaluated, where higher values indicate better results. The results with the best and second-best metrics are respectively highlighted in red and blue.

View Table | View all tables in this article

The method proposed consistently outperforms existing deep learning methods (GLCIC, CTSDG, MAE) across various metrics. The method proposed in the paper achieves the best inpainting performance metrics under all occlusion rates, as shown in Table 1 and Table 2. The formulas for PSNR and SSIM [55–58] are as follows.

(19)$$PSN{R_{inpainting}} = 20 \times {\log _{10}}\left( {\frac{{MA{X_I}}}{{\sqrt {MSE} }}} \right)$$

(20)$$SSI{M_{inpainting}} = \frac{{({2{\mu_x}{\mu_y} + {c_1}} )({{\sigma_{xy}} + {c_2}} )}}{{({\mu_x^2 + \mu_y^2 + {c_1}} )({\sigma_x^2 + \sigma_y^2 + {c_2}} )}}$$

the formula represents the Mean Squared Error (MSE) for image inpainting, where $MA{X_I}$ denotes the maximum pixel value in the image. In the formula, ${\mu _x}$ and ${\mu _y}$ represent the mean pixel values of the original and restored images, $\sigma _x^2$ and $\sigma _y^2$ denote the variances of the original and restored images, ${\sigma _{xy}}$ represents the covariance, c₁ and c₂ are constants introduced to prevent division by zero.

4.3 Image reconstruction results

To validate the superior performance of the image reconstruction network, comparisons were made with existing deep learning reconstruction networks, including FUnIE [59], BSRGAN [60], and ESRGAN [61]. The comparison is conducted under different sampling rates of 1%, 3%, 5%, and 10%. Figure 10 illustrates the reconstruction results of various methods.

Fig. 10. Reconstruction Results at Different Sampling Rates.

Download Full Size | PDF

As shown in the Fig. 10, the leftmost image represents the ground truth. Under four different sampling rates, under-sampling is performed. The second image from the left depicts the sampled region. The resulting images are then restored under the condition of 50% random occlusion. Finally, the images undergo reconstruction to restore details and textures. The images on the right show the reconstruction results for four different methods. When the sampling rate is set at 1%, the reconstruction results appear relatively blurry with pronounced ringing artifacts, and the restored image details are poor. As the sampling rate increases, the quality of the reconstructed images gradually improves. Both visually and in terms of metrics, our method outperforms the others. Even at a 10% sampling rate, both FUnIE and BSRGAN still struggle to completely eliminate the ringing artifacts. While ESRGAN produces images with better detail recovery, our method's reconstructed images closely resemble the ground truth in visual appearance.

Based on the various methods described above, the inference times for the data acquisition stage of the Fourier Forward Model, Image inpainting, and Image reconstruction stages were tested here. In the Forward Model (FM), the reconstruction time for generating 256 × 256 images are tested at sampling rates of 1%, 3%, 5%, and 10%. Each sampling rate was tested 100 times, and Table 3 shows the average reconstruction time for each sampling rate. In the image inpainting and image reconstruction stages, inference time testing is conducted for 1000 images using each comparative method. The inference times are presented in Tables 4 and 5. All processes were conducted on an i7-12700KF @3.60 GHz CPU and an NVIDIA 3090 GPU and the time units are in seconds.

Table 3. The average time for Fourier forward model sampling.

View Table | View all tables in this article

Table 4. The average time for each method in the image inpainting stage.

View Table | View all tables in this article

Table 5. The average time for each method in the image reconstruction stage.

View Table | View all tables in this article

4.4 Ablation experiment

The section explores the influence of sparse convolution, hierarchical scales, and perceptual loss on the image restoration network. The first framework does not utilize any of the mentioned methods, containing only a Transformer in the network. The second framework retains only the L2 loss (Ours_v1). The third framework replaces sparse convolution with regular convolution (Ours_v2). The fourth framework eliminates hierarchical scale for feature extraction (Ours_v3). The fifth framework includes all the mentioned structures (Ours). The ablation experiment images after removing each network module are illustrated in the Fig. 11.

Fig. 11. The inpainting results of the ablation experiment.

Download Full Size | PDF

The ablation experiment is conducted on the ImageNet test set with a random occlusion rate of 50%. The testing process involved evaluating the performance on five categories of target objects. As shown in Fig. 11, it can be observed that in the network with only the Transformer, the restored region exhibits a noticeable boundary, and the recovered image details are relatively poor. In the network retaining only L2 loss (Ours_v1), the inpainting results show some improvement in detail compared to the network with only the Transformer. However, there is still a significant boundary effect between the restored and non-occluded regions. Both Oursv2 and Oursv3 networks show an improvement in the restored network details. They can recover better texture details, and the boundary effect is reduced. This further emphasizes the superiority of sparse convolution and hierarchical scale in feature extraction. Our final network demonstrates varying degrees of improvement compared to each previous version. In conclusion, the proposed network exhibits superior performance in the test results of the five object categories. This is attributed to the combined capability of sparse convolution, hierarchical scale, and Transformer for feature extraction. It enables effective restoration of target objects under challenging scenarios of hard occlusion.

5. Real-world experiments

The real-world experimental system is depicted in Fig. 12, where TA represents the transmitting antenna, and RA represents the receiving antenna. The laser used in the experiment is the OEM-I-532 model, with a wavelength of 532 nm. It is manufactured by Beijing Ming Lei Technology Company. Additionally, it is equipped with a water refrigeration unit, with a power supply of 220 V and a cooling capacity of 300W. Additionally, it operates with a repetition frequency of 2 kHz, a pulse width of 10 nanoseconds, and a polarization ratio greater than 100:1. The laser's pulse energy is specified at 2 millijoules (2mJ). The laser beam is expanded by a beam expander and directed onto a computer-controlled Digital Micromirror Device (DMD). The maximum frequency of the DMD is 22 kHz, but to maintain synchronization, the frequency of the DMD is set to 2 kHz. The detector (H117006P) receives the light signals reflected by the target and obstacles. These signals are then stored on a data acquisition card (M4x.4450-x4), manufactured by SPECTRUM Company. The target and the occlusion are distinguished based on the time sequence of received light intensity signals. A narrowband filter is placed in front of the detector, and the data acquisition card has two channels with a sampling rate of 400 MS/s. In the numerical experiments, we utilized the ImageNet dataset for training purposes. The models trained during this phase are subsequently employed for testing in real-world experiments to assess their performance.

Fig. 12. Real-world Experiment System.

Download Full Size | PDF

To validate the effectiveness of the proposed method, a doll was chosen as the experimental subject. In the experiment, the sampling rate is set to 10%. The distance between occlusion and the target is 10 meters, and the distance between the laser and the target is 32 meters. The resolution of the reconstructed images is 256 × 256. After reconstruction, the occluded and target images are fed into an image inpainting network for repair. Subsequently, the repaired images undergo image reconstruction to produce the final result, which removes occlusion and restores image details. Throughout the repair and reconstruction process, the image size remains 256 × 256. The real-world experiments are shown in Fig. 13(a), and the results are presented in Fig. 13(c). From left to right, as shown in Fig. 13(c), they are respectively the original image of the doll used in the real-world experiment, the intensity-based segmentation mask region, the image restoration result, and the image reconstruction result. As shown in Fig. 13(b), detecting the peak of the received intensity signal allows obtaining the time ${t_{target}}$ and ${t_{occlusion}}$ corresponding to occlusion and the target of interest. At time ${t_{target}}$ and ${t_{occlusion}}$, reconstructing the spectral information yields the intensity image of occlusion and the target of interest. Finally, masking is applied based on the pixel region indicated by the occlusion image to obtain the masked area. This masked area is then fed into the restoration and reconstruction network to obtain a clear and complete image of the target of interest.

Fig. 13. Real-world Experiment, (a) represents real-world experimental scenarios, (b) represents the signal waveforms of occlusion and the target of interest received by the detector, (c) represents real-world experiment inpainting and reconstruction results image.

Download Full Size | PDF

6. Conclusion

The paper introduces a self-supervised image inpainting and reconstruction network for the restoration and enhancement of Fourier single pixel imaging under hard occlusion. It addresses situations where the region of interest is occluded and the imaging quality is low during Fourier single-pixel imaging. The proposed method includes a distance-based segmentation stage, an image inpainting stage, and an image reconstruction stage. The occlusion and region of interest are segmented based on the distance from the single-pixel detector. The features are extracted for image inpainting using sparse convolution with hierarchical scale and a Transformer structure. Sparse convolution extracts local features and the Transformer facilitates global information interaction. In the image reconstruction stage, a U-net network is employed, incorporating channel attention and gate attention modules. These modules help the network better learn feature information from different channels and scales to reconstruct high-quality images. The effectiveness of the proposed approach is validated through simulation experiments and real-world experiments. Compared to other methods, the approach consistently produces better superior detail information and visual results. Additionally, ablation experiments verify the generalization capability of the network. The work proposes a method based on distance segmentation of occlusion in the presence of hard obstacles, broadening its application scope and opening up new possibilities for further exploration in the field.

In summary, while the method proposed in the work shows promise, there are still areas for improvement, particularly regarding imaging resolution and imaging time. Future research directions could focus on enhancing imaging resolution by refining the design and algorithms of the imaging system, leveraging deep learning for signal processing to decompose waveforms and enhance resolution. Additionally, efforts to reduce imaging time, such as increasing the modulation speed of the DMD and optimizing hardware system response times, could lead to significant advancements in single pixel imaging technology.

Funding

National Natural Science Foundation of China (62301493, 62371163).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. H. T. Philipp, M. W. Tate, K. S. Shanks, et al., “Very-high dynamic range, 10,000 frames/second pixel array detector for electron microscopy,” Microsc. Microanal. 28(2), 425–440 (2022). [CrossRef]

2. G. M. Gibson, S. D. Johnson, and M. J. Padgett, “Single-pixel imaging 12 years on: a review,” Opt. Express 28(19), 28190–28208 (2020). [CrossRef]

3. R. Zhu, H. Feng, Y. Xiong, et al., “All-fiber reflective single-pixel imaging with long working distance,” Opt. Laser Technol. 158, 108909 (2023). [CrossRef]

4. Y. Wang, K. Huang, J. Fang, et al., “Mid-infrared single-pixel imaging at the single-photon level,” Nat. Commun. 14(1), 1073 (2023). [CrossRef]

5. J. Yan, H. Liu, Y. Wu, et al., “Recent progress of self-immobilizing and self-precipitating molecular fluorescent probes for higher-spatial-resolution imaging,” Biomaterials 301, 122281 (2023). [CrossRef]

6. H. Deng, X. Gao, M. Ma, et al., “Fourier single-pixel imaging using fewer illumination patterns,” Appl. Phys. Lett. 114(22), 221906 (2019). [CrossRef]

7. S. Rizvi, J. Cao, K. Zhang, et al., “Deringing and denoising in extremely under-sampled Fourier single pixel imaging,” Opt. Express 28(5), 7360–7374 (2020). [CrossRef]

8. Z. Tang, T. Tang, J. Chen, et al., “Spatial temporal Fourier single-pixel imaging,” Opt. Lett. 48(8), 2066–2069 (2023). [CrossRef]

9. Q. Y. Wu, J. Z. Yang, J. Y. Hong, et al., “An edge detail enhancement strategy based on Fourier single-pixel imaging,” Opt. Lasers Eng. 172, 107828 (2024). [CrossRef]

10. M. Wenwen, S. Dongfeng, H. Jian, et al., “Sparse Fourier single-pixel imaging,” Opt. Express 27(22), 31490–31503 (2019). [CrossRef]

11. A. Kamilaris and F. X. Prenafeta-Boldú, “Deep learning in agriculture: A survey,” Computers and electronics in agriculture 147, 70–90 (2018). [CrossRef]

12. A. Mohammed and R. Kora, “A comprehensive review on ensemble deep learning: Opportunities and challenges,” Journal of King Saud University-Computer and Information Sciences, (2023).

13. C. Zheng, W. Wu, C. Chen, et al., “Deep learning-based human pose estimation: A survey,” ACM Comput. Surv. 56(1), 1–37 (2024). [CrossRef]

14. R. Miikkulainen, J. Liang, E. Meyerson, et al., “Evolving deep neural networks,” Artificial intelligence in the age of neural networks and brain computing, Academic Press, 2024: 269–287.

15. M. Bertalmio, G. Sapiro, V. Caselles, et al., “Image inpainting,” Proceedings of the 27th annual conference on Computer graphics and interactive techniques. 2000: 417–424.

16. J. Jam, C. Kendrick, K. Walker, et al., “A comprehensive review of past and present image inpainting methods,” Computer vision and image understanding 203, 103147 (2021). [CrossRef]

17. H. Xiang, Q. Zou, M. A. Nawaz, et al., “Deep learning for image inpainting: A survey,” Pattern Recognition 134, 109046 (2023). [CrossRef]

18. S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image reconstruction: a technical overview,” IEEE Signal Process. Mag. 20(3), 21–36 (2003). [CrossRef]

19. V. Antun, F. Renna, C. Poon, et al., “On instabilities of deep learning in image reconstruction and the potential costs of AI,” Proc. Natl. Acad. Sci. 117(48), 30088–30095 (2020). [CrossRef]

20. Z. Li, F. Liu, W. Yang, et al., “A survey of convolutional neural networks: analysis, applications, and prospects,” IEEE Trans. Neural Netw. Learning Syst. 33(12), 6999–7019 (2021). [CrossRef]

21. J. Gu, Z. Wang, J. Kuen, et al., “Recent advances in convolutional neural networks,” Pattern recognition 77, 354–377 (2018). [CrossRef]

22. K. Han, Y. Wang, H. Chen, et al., “A survey on vision transformer,” IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 87–110 (2022). [CrossRef]

23. S. Khan, M. Naseer, M. Hayat, et al., “Transformers in vision: A survey,” ACM computing surveys 54(10s), 1–41 (2022). [CrossRef]

24. T. Yao, Y. Li, Y. Pan, et al., “Dual vision transformer,” IEEE Trans. Pattern Anal. Mach. Intell. 45(9), 10870–10882 (2023). [CrossRef]

25. A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Advances in neural information processing systems 30, 6000–6010 (2017).

26. B. Liu, M. Wang, H. Foroosh, et al., “Sparse convolutional neural networks,” Proceedings of the IEEE conference on computer vision and pattern recognition2015: 806–814.

27. O. Oktay, J. Schlemper, L. L. Folgoc, et al., “Attention u-net: Learning where to look for the pancreas,” arXiv, arXiv:1804.03999 (2018). [CrossRef]

28. S. Rizvi, J. Cao, K. Zhang, et al., “Improving imaging quality of real-time Fourier single-pixel imaging via deep learning,” Sensors 19(19), 4190 (2019). [CrossRef]

29. Z. Qiu, X. Guo, T. Lu, et al., “Efficient Fourier single-pixel imaging with Gaussian random sampling,” Photonics 8(8), 319 (2021). [CrossRef]

30. J. Li, K. Cheng, S. Qi, et al., “Full-resolution, full-field-of-view, and high-quality fast Fourier single-pixel imaging,” Opt. Lett. 48(1), 49–52 (2023). [CrossRef]

31. J. Yao, Z. Jiang, X. Lv, et al., “Adaptive Fourier single-pixel imaging based on directional energy continuity in high frequencies,” Opt. Lasers Eng. 162, 107406 (2023). [CrossRef]

32. X. Yang, P. Jiang, M. Jiang, et al., “High imaging quality of Fourier single pixel imaging based on generative adversarial networks at low sampling rate,” Opt. Lasers Eng. 140, 106533 (2021). [CrossRef]

33. X. Yang, X. Jiang, P. Jiang, et al., “S2O-FSPI: Fourier single pixel imaging via sampling strategy optimization,” Opt. Laser Technol. 166, 109651 (2023). [CrossRef]

34. D. Pathak, P. Krahenbuhl, J. Donahue, et al., “Context encoders: Feature learning by inpainting,” Proceedings of the IEEE conference on computer vision and pattern recognition2016: 2536–2544.

35. U. Demir and G. Unal, “Patch-based image inpainting with generative adversarial networks,” arXiv, arXiv:1803.07422, (2018). [CrossRef]

36. Z. Yan, X. Li, M. Li, et al., “Shift-net: Image inpainting via deep feature rearrangement,” Proceedings of the European conference on computer vision (ECCV). 2018: 1–17.

37. X. Zhao, H. Wei, H. Wang, et al., “3D-CNN-based feature extraction of ground-based cloud images for direct normal irradiance prediction,” Sol. Energy 181, 510–518 (2019). [CrossRef]

38. K. Tian, Y. Jiang, Q. Diao, et al., “Designing bert for convolutional networks: Sparse and hierarchical masked modeling,” arXiv, arXiv:2301.03580, (2023). [CrossRef]

39. L. Floridi and M. Chiriatti, “GPT-3: Its nature, scope, limits, and consequences,” Minds and Machines 30(4), 681–694 (2020). [CrossRef]

40. K. He, X. Chen, S. Xie, et al., “Masked autoencoders are scalable vision learners,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition2022: 16000–16009.

41. P. Gao, T. Ma, H. Li, et al., “Convmae: Masked convolution meets masked autoencoders,” arXiv, arXiv:2205.03892, (2022). [CrossRef]

42. A. Srinivas, T. Y. Lin, N. Parmar, et al., “Bottleneck transformers for visual recognition,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition2021: 16519–16529.

43. P. Gao, J. Lu, H. Li, et al., “Container: Context aggregation network,” arXiv, arXiv:2106.01401, (2021). [CrossRef]

44. K. Li, Y. Wang, J. Zhang, et al., “Uniformer: Unifying convolution and self-attention for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 45(10), 12581–12600 (2023). [CrossRef]

45. Z. Dai, H. Liu, Q. V. Le, et al., “Coatnet: Marrying convolution and attention for all data sizes,” Advances in neural information processing systems 34, 3965–3977 (2021).

46. S. d’Ascoli, H. Touvron, M. L. Leavitt, et al., “Convit: Improving vision transformers with soft convolutional inductive biases,” International Conference on Machine Learning. PMLR, 2021: 2286–2296.

47. T. Xiao, M. Singh, E. Mintun, et al., “Early convolutions help transformers see better,” Advances in neural information processing systems 34, 30392–30400 (2021).

48. Z. Liu, Y. Lin, Y. Cao, et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012–10022.

49. W. Wang, E. Xie, X. Li, et al., “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” Proceedings of the IEEE/CVF international conference on computer vision. 2021: 568–578.

50. F. Xu, F. Li, and Y. Wang, “Modified Levenberg–Marquardt-based optimization method for LiDAR waveform decomposition,” IEEE Geosci. Remote Sensing Lett. 13(4), 530–534 (2016). [CrossRef]

51. G. Mountrakis and Y. Li, “A linearly approximated iterative Gaussian decomposition method for waveform LiDAR processing,” ISPRS journal of photogrammetry and remote sensing 129, 200–211 (2017). [CrossRef]

52. d. J. de Jesús, “Stability analysis of the modified Levenberg–Marquardt algorithm for the artificial neural network training,” IEEE Trans. Neural Netw. Learning Syst. 32(8), 3510–3524 (2020). [CrossRef]

53. S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent image completion,” ACM Trans. Graph. 36(4), 1–14 (2017). [CrossRef]

54. X. Guo, H. Yang, and D. Huang, “Image inpainting via conditional texture and structure dual generation,” Proceedings of the IEEE/CVF International Conference on Computer Vision.2021: 14134–14143.

55. D. R. I. M. Setiadi, “PSNR vs SSIM: imperceptibility quality assessment for image steganography,” Multimed. Tools Appl. 80(6), 8423–8444 (2021). [CrossRef]

56. A. Hore and D. Ziou, “Image quality metrics: PSNR vs. SSIM,” 2010 20th international conference on pattern recognition. IEEE, 2010: 2366–2369.

57. A. Tanchenko, “Visual-PSNR measure of image quality,” Journal of Visual Communication and Image Representation 25(5), 874–878 (2014). [CrossRef]

58. G. Palubinskas, “Image similarity/distance measures: what is really behind MSE and SSIM?” International Journal of Image and Data Fusion 8(1), 32–53 (2017). [CrossRef]

59. M. J. Islam, Y. Xia, and J. Sattar, “Fast underwater image enhancement for improved visual perception,” IEEE Robot. Autom. Lett. 5(2), 3227–3234 (2020). [CrossRef]

60. K. Zhang, J. Liang, L. Van Gool, et al., “Designing a practical degradation model for deep blind image super-resolution,” Proceedings of the IEEE/CVF International Conference on Computer Vision2021: 4791–4800.

61. X. Wang, K. Yu, S. Wu, et al., “Esrgan: Enhanced super-resolution generative adversarial networks,” Proceedings of the European conference on computer vision (ECCV) workshops2018: 0-0.

	GLCIC	CTSDG	MAE	SCT(ours)
30%	22.73	22.85	24.99	25.12
40%	22.25	22.51	24.57	24.70
50%	20.58	20.91	23.30	23.52
60%	19.71	19.83	22.36	22.47
70%	18.04	18.91	21.74	21.98
AVG	20.66	21.00	023.40	23.59

	GLCIC	CTSDG	MAE	SCT(ours)
30%	0.85	0.88	0.89	0.91
40%	0.84	0.87	0.86	0.90
50%	0.80	0.82	0.84	0.88
60%	0.76	0.78	0.81	0.84
70%	0.71	0.74	0.78	0.81
AVG	0.79	0.81	00.83	0.87

	GLCIC	CTSDG	MAE	SCT(ours)
30%	22.73	22.85	24.99	25.12
40%	22.25	22.51	24.57	24.70
50%	20.58	20.91	23.30	23.52
60%	19.71	19.83	22.36	22.47
70%	18.04	18.91	21.74	21.98
AVG	20.66	21.00	023.40	23.59

	GLCIC	CTSDG	MAE	SCT(ours)
30%	0.85	0.88	0.89	0.91
40%	0.84	0.87	0.86	0.90
50%	0.80	0.82	0.84	0.88
60%	0.76	0.78	0.81	0.84
70%	0.71	0.74	0.78	0.81
AVG	0.79	0.81	00.83	0.87

Partial hard occluded target reconstruction of Fourier single pixel imaging guided through range slice

Abstract

1. Introduction

2. Related work

2.1 Fourier single pixel imaging

2.2 Image inpainting

3. Approach

3.1 Principle of forward imaging systems

3.2 Network architecture

3.2.1 Distance-based image segmentation

3.2.2 Image inpainting network

3.2.3 Image reconstruction network

4. Numerical experiments

4.1 Data set preparation and training

4.2 Image inpainting results

4.3 Image reconstruction results

4.4 Ablation experiment

5. Real-world experiments

6. Conclusion

Funding

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (13)

Tables (5)

Equations (20)

Optics Express