Adaptive locating foveated ghost imaging based on affine transformation

Chang Zhou; Jie Cao; Jie Cao; Qun Hao; Qun Hao; Qun Hao; Huan Cui; Haifeng Yao; Yaqian Ning; Haoyu Zhang; Moudan Shi; Moudan Shi

doi:10.1364/OE.511452

1. Introduction

Ghost imaging (GI) is a second-order correlation imaging technique based on optical fields. It differs from traditional pixel array device imaging techniques because it uses spatial light modulation (SLM) devices to modulate the object being imaged. Instead of using single-pixel detectors with no spatial resolution, GI obtains the spatial intensity information of the target through several iterations and reconstructs it using correlation algorithms [1–4]. GI offers the advantage of offline operation due to its independent sampling and reconstruction processes [5]. It also has anti-disturbance ability, high sensitivity, and a wide spectral range [6], making it promising for various applications such as spectral imaging [7–9], X-ray imaging [10–12], terahertz imaging [13–15], three-dimensional imaging [16–18].

In practical applications, the imaging speed of ghost imaging is influenced by several factors, including the modulation rate of spatial light modulators like digital micromirror device (DMD) or some machining modulators, etc., the reconstruction algorithms utilized, and the design of the spatial light field. Achieving a balance between imaging quality and imaging efficiency presents a significant challenge that must be addressed to enhance the performance of GI. As a considerable number of patterns must be projected within a single imaging process, the design of patterns plays a crucial role in improving the sampling speed and imaging quality. In 2017, Phillips et al. [19] were inspired by the foveated region at the center of the retina to propose a modulation pattern that emulates the retinal structure. Through experimental validation, they demonstrated that retina-like patterns enable high-quality imaging of regions of interest (ROI) and low-quality imaging of non-interest areas while maintaining a consistent number of samples. Subsequently, this method has continuously captured attention and has been extended in subsequent studies [20–23]. In 2021, Cao et al. [24] focused on improving the imaging quality in scenarios with non-orthogonal patterns, where the lack of a priori knowledge posed a challenge. Their proposed approach involved optimizing retinal-like patterns by utilizing principal component analysis (PCA) to generate ROI for pattern-filling. This ROI incorporated prior information about the sparsity of the object, leading to enhanced imaging quality in the ROI. However, the foveated region in the literature cannot be adaptive adjusted according to the varying sizes and positions of objects quickly. Consequently, this region is unsuitable for more intricate real-life scenarios. For instance, in certain situations, we want to concentrate exclusively on vehicles on the road, pedestrians, or other ROI with diverse sizes and positions while disregarding other background areas. Therefore, it is imperative to dynamically cover the foveated region with the section containing these essential targets.

The landscape of GI has been drastically altered and pushed forward through the adoption of deep learning [25]. The rapid advancement of deep learning has led to the emergence of various methodologies utilizing this technology to address fundamental challenges in GI [26]. For instance, in 2019, Zhai et al. [27] introduced the foveated ghost imaging based on deep learning (DPFGI) approach. DPFGI integrates the foveated patterns technique, inspired by the human visual system, that gradually reduces resolution from the center towards the periphery with a target detection network. This integration aims to enhance the imaging quality and efficiency of targets within the identified ROI. However, the pattern shape used in the method is limited to rectangular and cannot be flexibly spatially transformed, in addition to being time-consuming in terms of pattern generation time for different ROIs. In 2021, Yang et al. [28] introduced a deep learning-based method for parallel single-pixel imaging, target localization, and classification. The technique aims to locate and classify single or multiple targets in a scene using a single-pixel camera while achieving multi-task learning with an end-to-end approach. Currently, most existing methods for extracting the ROI in the target are primarily based on target detection. However, these methods require generating a great number of projection patterns with varying locations and target sizes, leading to time accumulation. This accumulation seriously impedes the imaging efficiency of ghost imaging.

Therefore, inspired by human consciousness controlling the eyes to acquire the ROI, we propose a novel foveated pattern affine transformer method based on deep learning for efficient GI. Based on the advantages of flexibly spatial transformation of the foveated pattern and effective improvement of the reconstruction quality of the ROI, the approach suggests a novel method with low parameters and floating point of operations (FLOPs) requirements, which is the retina affine transformer (RAT) network. The backbone network part of the model is divided into two paths, each with a distinct purpose. The context path extracts the feature information of the object in the complex picture background, while the spatial path models the location distribution of the features in space. The information from both paths achieves a parallel adaptive change of variable-resolution foveated patterns for targets of different sizes and locations, accomplished by predicting a small number of affine matrix parameters. The proposed architecture makes excellent results on the speed and location performance. Additionally, to perform multi-target ROI detection simultaneously, we integrate the recurrent neural network (RNN) into the RAT network to create the RNN-RAT network. This allows for continuously identifying and localizing multiple targets in complex backgrounds. Its efficiency and superiority have been validated through numerical simulation and experimentation. It is helpful to promote the development of GI in real-time detection and other resource-constrained multi-task detection fields.

2. Methods

2.1 Foveated GI reconstruction method

In the foveated GI system, the detection value acquired by single-pixel detectors with no spatial resolution can be written as:

(1)$${I_t} = \sum\limits_x {\sum\limits_y {{F_t}({x,y} )O} } ({x,y} ),$$

where O(x,y) represents the target to be imaged; x and y index the position in the Cartesian coordinate system. F_t(x,y) represents the variable-resolution foveated patterns while keeping a constant number of samples, variable-resolution patterns allow for high-quality imaging of ROI and low-quality imaging of non-interest regions. Application of log-polar transformation to obtain a retina-like variant-resolution allocation structure on GI, and this design has been shown in previous work [20].

The reconstruction algorithm partially uses the total variation (TV) regularization prior algorithm, which is based on the principle of compressive sensing (CS) algorithms, and even in the case of under-sampling, still recovers a better quality image and reduces the number of measurements needed to reconstruct [29]. Since the CS reconstruction method is a process of solving underdetermined equations, it is an NP-hard problem that can be transformed into an L1 norm optimization problem, provided that it satisfies the restricted isometry property (RIP), shown in the expression as follows:

(2)$$\begin{array}{lc} \min &{{||{c^{\prime}} ||} _{l1}}\\ s.t. {}& GO = c^{\prime}\\{}& AO = b, \end{array}$$

where ${l_1}$ the norm is the number of non-zero elements in the coefficient vector $c^{\prime}$ corresponding to the computed image is the number of norms. G is the computation matrix of the gradient of the image; $O \in {R^{a \times 1}}$ is the target to be reconstructed as a vector alignment; $A \in {R^{m \times a}}$ is the modulation matrix of the light field; m is the number of pixels in each of the m modulated patterns; a is the number of pixels in each of the patterns and $b \in {R^{m \times 1}}$ is the measurement vector.

2.2 Foveated GI method based on an affine transformer

In this section, the ghost imaging process based on the proposed RAT method is presented, shown in Fig. 1 schematically.

Fig. 1. Schematic diagram of GI method based on Retina Affine Transformer network.

Download Full Size | PDF

The target object is initially under-sampled through uniform-resolution patterns to obtain an under-sampled object, which is then fed into the network. To achieve efficient inference performance, we present a network design comprising spatial and semantic paths and efficiently fuse the final feature layers of the two paths. The semantic path extracts the feature information of the object in the complex picture background, while the spatial path models the location distribution of the features in space. The fused region-aware feature embedding, denoted as F in Fig. 1, is inputted into the affine matrix generation module, where C represents the entire feature layer. Then the feature vectors are passed through f(·) which consists of two fully connected layers. This transformation yields six parameters of the affine matrix. The inverse transformation matrix is obtained by the inverse transform. After the above steps, the information from two paths achieves a parallel adaptive change of foveated patterns for objects of different sizes and locations, accomplished by predicting a small number of affine matrix parameters. Next, feed the under-sampled object and the inverse transformation matrix into the under-sampled object transformer module, the under-sampled object undergoes an affine transformation using the affine matrix followed by Sobel filtering to separate the object from the background and filling in the connected domain to get the corresponding label. The Intersection over Union (IoU) loss is then compared to the ground truth and used as feedback for training the network module. The inverse transformation matrix proceeds to the foveated pattern transformer module, where the foveated pattern is centered at the origin and mapped by the inverse transformation matrix to generate the transformed pattern. This pattern covers the region of interest in the object, including its foveated area. Finally, the reconstruction module is activated and the foveated pattern is projected onto the imaging object, resulting in the ultimate reconstructed image.

Next, the affine transformation method applied in the proposed transformer module is described in detail. The affine transformation can dynamically and actively transform the input image or feature map according to different input samples to achieve an adaptive selection of the region of interest during training. The following equation represents the transformation of the input and output points in the 2D affine transformation:

(3)$$\left( {\begin{array}{{c}} {x_i^s}\\ {y_i^s} \end{array}} \right) = {\mathrm{{\rm T}}_\theta }({G_i}) = {{\rm A}_\theta }\left( {\begin{array}{{c}} {x_i^t}\\ {y_i^t}\\ 1 \end{array}} \right) = \left[ {\begin{array}{{ccc}} {{\theta_1}}&{{\theta_2}}&{{\theta_3}}\\ {{\theta_4}}&{{\theta_5}}&{{\theta_6}} \end{array}} \right]\left( {\begin{array}{{c}} {x_i^t}\\ {y_i^t}\\ 1 \end{array}} \right),$$

where $({x_i^t,y_i^t} )$ are the coordinate points in the output feature map; $({x_i^s,y_i^s} )$ are the coordinate source point in the input feature map. ${A_\theta }$ is the affine transformation matrix. $\theta $ is the transformation parameter. ${\mathrm{{\rm T}}_\theta }$ is the sampling grid, which represents the affine transformation.

It is worth noting that for the generated image, each target pixel point needs to be affine transformed from the under sampled image to obtain the original image coordinates. In practice, the calculated original image coordinates are fractional, and the fractional pixel points need to be expressed in the original image by bilinear interpolation:

(4)$$V_i^c = \sum\limits_n^H {\sum\limits_m^W {U_{nm}^c} } \max ({0,1 - |{x_i^s - m} |} )\max ({0,1 - |{y_i^s - n} |} ),$$

where $V_i^c$ is the pixel value of the output obtained by bilinear interpolation, and $U_{nm}^c$ is the pixel value of the input at location (n, m) on the channel c. The model can be trained directly using the gradient descent approach because the bilinear interpolation method renders the model invertible at that moment, allowing the gradient of the function about the parameter to flow through the network in the opposite direction. Therefore, the integration of affine transformation throughout the process allows for training the network and transforming the variable-resolution foveated pattern, resulting in coverage of regions of interest with varying sizes and locations.

2.3 Network architecture

To solve the problem of the inefficiency of existing methods to obtain the a priori knowledge of the target position, we combine the advantages of variable-resolution patterns and affine transformation to propose the RAT network. The structure of the schematic diagram is shown in Fig. 2. The traditional affine module uses fully connected layers for parameter prediction [30]. However, for the same capacity, the fully connected layer needs to learn more parameters, is computationally expensive, and is prone to overfitting, while the convolutional layer requires far fewer parameters and less computation than the fully connected layer, which is more suitable for the image recognition task. Therefore, we choose to use the convolutional layer instead of the fully connected layer, which makes the new model shorter in time and lighter in weight, and the specific improvements are as follows:

Fig. 2. Structure of Retina Affine Transformer Network.

Download Full Size | PDF

To improve the operation speed of the network and the discriminative ability, spatial information, and receptive field are essential to achieve high accuracy [31]. Therefore, the design concept of a bilateral network with two paths is proposed: one path learns the spatial information, and the other learns the semantic information. These two components are designed to counteract the loss of spatial information and the contraction of the receptive field, respectively, and to identify the location information and semantic features of the target area, as shown in Local net in Fig. 2.

For the spatial information path, the acquisition of spatial location information is improved by stacking three convolutional layers and adding a CoordConv layer. The traditional convolutional layer has translation invariance, making it difficult for the convolutional layer to perceive spatial location information. To solve this problem, the CoordConv layer [32] was added. This layer consists of two channels, i coordinate and j coordinate, which allows the convolutional layer to retain fewer parameters and be more computationally efficient in the input by adding the two channels to help the model more accurately locate and identify the ROI. In addition, the CoordConv layer can both perform constant mapping and learn an additional part of coordinate information, which can be dynamically adjusted according to different task requirements. This layer also has a stronger generalization ability, which reduces the convergence time of the network during training and allows the path to use a minimal amount of computation to achieve a high degree of positional accuracy. After the CoordConv layer, the convolutional layer includes a convolutional kernel size of 3 × 3 with a stride size of 2, followed by batch normalization and ReLU. Therefore, the output feature map extracted by this method is 1/8 of the original image and can encode rich spatial information due to the large spatial size of the feature map.

For the semantic information path, while spatial pathways encode rich spatial information, semantic pathways should provide sufficient receptive fields for the network. In pixel-level localization tasks such as semantic segmentation and object detection, the receptive field is crucial for the performance of the network, and we need to focus the patterns precisely on the target, so efficiently increasing the receptive field of the network is equally essential for our network. To increase the receptive field of the network, previous authors have used methods such as Pyramid Pooling Module (PPM) [33] and Atrous Convolution [34]. However, PPM requires a large amount of computation and memory consumption, resulting in low computational efficiency, and Atrous Convolution stacks multiple convolution kernels with the same void rate multiple times, resulting in some pixels in the grid not participating in the computation from the beginning, which is unfriendly to pixel-level prediction. To achieve large receptive fields and efficient computation simultaneously, we propose a semantic information path. The semantic information path uses a lightweight model and global average pooling to provide a larger receptive field and fast down sampling of the feature map to encode high-level semantic contextual information.

In the semantic information path, we use an attention refinement module to refine the features at each stage. As shown in Fig. 2, this module includes an attention vector to direct the network as it learns the features and employs a global average pool to capture the global context. It can easily integrate the global context information without any up-sampling operation. Therefore, it requires negligible computational cost [35].

Furthermore, we develop the RNN-RAT network by merging the RNN and RAT networks to allow the network to predict multiple targets in the task [36]. The RNN-RAT network considers previous inputs so that the earlier data affects the current results, in this work, using the RNN model can continuously modify the transformation matrix for different targets. We modify this model by letting an RNN predict the transformation matrices as follows:

(5)$$\left\{ {\begin{array}{{c}} {c = {f_{RAT}}({{G_i}} )}\\ {{h_t} = f_{loc}^{rnn}({c,{h_{t - 1}}} )}\\ {{A_\theta } = g({{h_t}} )} \end{array}} \right.,$$

where f_RATis the RAT network with G_i input; $f_{loc}^{rnn}$ represents an RNN network; ${h_t}{h_{t - 1}}$ represent hidden layers at different moments; g(.) is a linear layer for outputting results. From the above equation, the current output is not only dependent on the input of the current layer but is also influenced by the output of the previous layer. This allows the network to locate targets by generating them sequentially and amplifying each element before each prediction.

The loss function of the network is described using the weights of the L2 loss together with the Intersection over Union (IoU) loss. The L2 loss, also known as the least square loss function, minimizes the sum of squares of the difference between the predicted value and the actual value. The intersection and merge ratio loss, which is the ratio of intersection and merge of the predicted edges to the actual edges, is used as a measure of the accuracy between the predicted portion and the true portion. It is defined as:

(6)$$L = \alpha {L_2} + \beta {L_{IoU}},$$

(7)$$\left\{ {\begin{array}{{c}} {{L_2} = \frac{1}{m}\sum\limits_{i = 1}^m {{{({{y_i} - f({{x_i}} )} )}^2}} }\\ {{L_{IoU}} ={-} \ln \frac{{({{y_i} \cap f({{x_i}} )} )}}{{({{y_i} \cup f({{x_i}} )} )}}} \end{array}} \right..$$

where $\alpha $ and $\beta $ are the weighting coefficients assigned to each L2 loss and the IoU loss. It was experimentally verified that the parameter values of $\alpha $ and $\beta $ are selected as 0.3 and 0.7. y_i is the true value; f(x_i) is the predicted value; $f\left( {x_i} \right)$ is the number of samples.

2.4 Performance evaluation

Moreover, peak signal-to-noise ratio (PSNR) [37] and structural similarity index measure (SSIM) [38] are used as metrics to evaluate the image quality, which quantitatively describes the performance of the reconstructed image, and the higher the PSNR and SSIM, the better the reconstruction quality of the image. The formulas for PSNR and SSIM are defined as follows:

(8)$$\left\{ {\begin{array}{{c}} {MSE = \frac{1}{{\hat{n} \times \hat{n}}}\sum\limits_{x = 1}^{\hat{n}} {\sum\limits_{y = 1}^{\hat{n}} {{{({T({x,y} )- O({x,y} )} )}^2}} } }\\ {PSNR = 10{{\log }_{10}}\left( {\frac{{MAX_{_I}^2}}{{MSE}}} \right) = 10{{\log }_{10}}\frac{{{{({{2^k} - 1} )}^2}}}{{MSE}}}\\ {SSIM = \frac{{({2{\mu_T}{\mu_O} + {c_1}} )({2{\sigma_{TO}} + {c_2}} )}}{{({\mu_T^2 + \mu_O^2 + {c_1}} )({\sigma_T^2 + \sigma_O^2 + {c_2}} )}}} \end{array}} \right..$$

where MSE represents the mean square error; T(x,y) represents the original image; O(x,y) represents the reconstructed image; k is the number of bits set to 8; $\hat{n}$ represents the image size, and here we use an image of size 128 × 128 pixels. µ_T and µ_O represents the mean of the original and reconstructed images respectively; σ_TO is the covariance of the original and reconstructed images. While σ_T and σ_O is the variance; c₁ and c₂ are two constants. To prevent the division of SSIM up and down to the case of division by zero; and c₁= (k₁×L)² c₂= (k₂×L)², where k₁= 0.01, k₂= 0.03, L = 1.

3. Results

3.1 Numerical simulation results

In this section, the RAT network designed in Section 2.2 is combined with GI to demonstrate the effectiveness and superiority of our method through numerical simulations.

3.1.1 Single-target GI simulation

10,000 images from the dataset are used as the training set, while 60,000 images are used as the test set. All images are resized to 128 × 128 pixels and converted to grayscale. To enhance the precision of network training and bring the output findings nearer to the GI outcomes, we initially execute the GI simulation of traditional uniform-resolution patterns at a 0.04 sampling rate on the images attained through the execution of the random affine transformation, shown in Fig. 3 (a). The outcomes obtained with the transformation of the patterns are demonstrated in Fig. 3. The findings of the training conducted with RAT net are presented.

Fig. 3. MNIST figures were trained by the network, and corresponding patterns transform outcomes. (a) A blurred image is formed through uniform-resolution patterns under-sampling. The image size is 128 × 128 pixels, and the sampling rate is 0.04. (b) The in-center image is transformed by multiplying (a) with the affine matrix that is output by the network. (c) Variable resolution patterns at the center of the foveated region. (d) Multiply (c) the patterns of the foveated area at the center with the inverse of the affine matrix to get the patterns of the foveated area covering the numbers. (e) Demonstrate that the patterns obtained by (d) overlap (a).

Download Full Size | PDF

The MNIST dataset images at various locations undergo reconstruction using foveated ghost imaging (FGI) and random uniform-resolution ghost imaging (RGI) techniques, as illustrated in Fig. 4. The images are both 128 × 128 pixels in size. At a sampling ratio of 0.04, it is evident that the PSNR and SSIM values obtained using RGI are lower than those obtained using our suggested method with FGI. The outcome is a higher reconstruction quality. Therefore, the proposed method can provide exact coverage of targets at varied locations while reconstructing the highest quality images.

Fig. 4. MNIST dataset simulation result.

Download Full Size | PDF

Next, to enhance the demonstration of the effectiveness and scalability of the proposed method, we employed three complex target images, in addition to the MNIST dataset for our experiments. The three scenarios involved independent cars located in the lower left corner, cars with obstructing objects located in the lower right corner (Fig. 5 (a) (b)), and two car scenarios as shown in Fig. 6. This was done to further evaluate the outcome of the experiment.

Fig. 5. Comparison of single target numerical simulation results. The reconstruction results for the car at the lower left corner are presented in (a). (b) Reconstruction results of the car with interfering objects located in the lower right corner. (c) (d) Comparison of PSNR and SSIM results.

Download Full Size | PDF

Fig. 6. Multi-objective simulation results. (a) RGI experimental reconstruction results. (b) FGI experimental reconstruction results. (c) Comparison of PSNR and SSIM results for the left and right cars.

Download Full Size | PDF

The initial experiment is the single-target car experiment positioned at the lower left corner. Each image is 128 × 128 pixels in size, with the corresponding sampling ratio set at 0.10, 0.15, 0.20, 0.25, and 0.30. The results displaying numerical simulation results for a single target positioned at different locations are presented. The required regions of interest are demonstrated separately. Ghost imaging images with uniform resolution are noisier. The image quality acquired through FGI is markedly superior to that achieved by RGI. The quality of the image reconstructed using FGI approaches that of the original image at a sampling rate of 0.3.

The PSNR and SSIM are quantitatively assessed from the perspective of the ROI to investigate the structural data. Figures 5 (c) and (d) show the comparison curves of PSNR and SSIM using RGI and FGI, respectively, in which the orange color represents the result of using the random patterns, and the blue color represents the result of using the variable-resolution foveated patterns. The SSIM and PSNR of the foveated of the RGI are continuously lower than those of the FGI at a constant sampling rate. Furthermore, using FGI results in a much more pronounced enhancement of both PSNR and SSIM than RGI. The degree of enhancement obtained from FGI increases faster with the rise in sampling rate. This result confirms that the FGI acquired using the method described in this paper has superior image quality over RGI. Furthermore, it demonstrates that our approach is efficient for simple targets and more intricate targets with backgrounds, verifying progress and adaptability.

To evaluate network performance and approach real-world conditions, we selected the car situated in the bottom-right corner with obstructions for simulation. Results are presented in Fig. 5 (b) (d), revealing FGI consistently outperforms RGI with effect improvement augmenting alongside sampling rate escalation. Numerical simulation results validate the superior imaging quality of our method.

3.1.2 Multi-target GI simulation

To further verify the performance of the system, multi-target detection simulations are conducted. Here, the network architecture of RNN-RAT is used for multi-target detection. Multi-target patterns are produced as follows: the two closest centers of the circle are identified and then plumb lines are dropped and used as the dividing line to splice the separately produced variable-resolution patterns, which cover the targets. The patterns obtained are shown in Fig. 6 pattern.

Figure 6 illustrates the simulation results of the multi-targets comparison. Precisely, Fig. 6 (c) displays the PSNR and SSIM comparison curves for the vehicle situated on the left side, obtained at different sampling rates of 0.1, 0.15, 0.2, 0.25, and 0.3, corresponding to the projected numbers of the illumination patterns are 1638, 2458, 3277, 4096, and 4915, respectively. The orange curve represents the outcomes using FGI, while the blue curve represents the results obtained via RGI. The observed simulation phenomena align with prior findings. Increasing the sampling rate improves the performance of all schemes measured by SSIM and PSNR while also capturing richer information from the acquired target images. The variation between the two is minimal at lower sampling rates but expands as the sampling rate increases. At a sampling rate of 0.3, the PSNR achieved through FGI approaches 47.5 dB, which is more than 20 dB higher than that of RGI. The SSIM also shows an increase of approximately 0.2. Consequently, the reconstructed image exhibits significantly improved quality.

Moreover, to quantitatively analyze the time of the pattern generation time at different resolutions using conventional object detection methods and our method. The experimental platform is set up with an AMD EPYC 9654 CPU, 60 GB of RAM, Nvidia GeForce RTX 4090 graphics card, and the Ubuntu 22.04 operating system. We present the pattern generation time in Table 1. Here, the pattern generation time is the time to generate 1024 patterns at different resolutions in different columns of the table. It can be seen that our method significantly reduces computational time compared to conventional methods. This is attributed to the transformation from Cartesian coordinates to log-polar coordinates, which is necessary for generating variable resolution patterns using conventional methods [19]. This transformation involves numerous logarithmic and exponential operations, greatly increasing computational complexity. Additionally, the compiler struggles to vectorize certain instructions and contains a large number of branch prediction instructions, resulting in increased computational latency.

Table 1. Comparison of pattern generation time (ms)

View Table | View all tables in this article

3.2 Experimental results

The viability of the proposed methodology has been established in previous sections. This section reinforces its efficacy through practical experimentation and exhibits reconstructed results of varied objectives achieved at varying sampling ratios. Figure 7 displays the experimental setup comprising light-emitting diode (LED), projection lens, DMD, detectors, and a data acquisition board. The LED functions within the wavelength range of 400-760 nm with a power of 20 W. The DMD is based on the DLP Discovery 4100 development kit of Texas Instruments and consists of an array of 1024 × 768 micromirrors. The highest binary modulation rate achievable is 22 kHz, the projection lens has a focal length of 150 mm, and for detection, a single-point detector (Thorlabs PDA36A, with an active area of 13 mm²) is used in conjunction with a data acquisition card (DAQ, PICO6404E). Furthermore, the computer setup consists of a 6226R CPU ×2 and 128GB RAM.

Fig. 7. Experimental setup.

Download Full Size | PDF

3.2.1 Single-target GI experiments

Figure 8 depicts the reconstruction results of single-target objects at various locations. The reconstructed images obtained through RGI are noticeably more contaminated by noise. Additionally, Fig. 8 (c) illustrates the graphs of PSNR and SSIM results acquired using variable-resolution patterns and random patterns at a sampling rate of 0.03, 0.05, 0.07, and 0.09 representing the corresponding values, with all images having a resolution of 128 × 128 pixels. In the actual experiments, the PSNR and SSIM obtained from the experiments will be inferior to the results obtained from the simulation because both the detector and the light source are not in the ideal situation. The network output shows that, at 128 × 128 pixels, the object to be imaged is well-covered by the variable-resolution patterns. Moreover, the reconstructed quality aligns with the simulation results and proves to be superior to the results obtained when using uniform resolution patterns.

Fig. 8. Results of single-target experiments conducted at different locations. (a) Imaging results of the car are located in the lower left corner. The object is the single target located in the lower left corner, and RGI is the results reconstructed using random 0 and 1-filled uniform resolution patterns. FGI is the results reconstructed using variable-resolution patterns at different sampling rates. (b) The imaging results of the car are in the upper right corner. (c) Comparison of PSNR and SSIM curves for a single target in the lower left corner and a single target in the upper right corner.

Download Full Size | PDF

3.2.2 Multi-target GI experiments

In practical applications, it is likely that multiple regions of interest may be present, and therefore, the utilization of the RNN-RAT model is necessary for the detection of these regions. The pattern is generated in the same way as the multi-target pattern is generated in the simulation.

The results and analysis of the experiment are presented in Fig. 9. For multi-targets at various locations, the obtained results are similar to those of single targets. Furthermore, the employment of the proposed method for FGI produces higher PSNR and SSIM, with an improvement of approximately 2 dB in PSNR, and SSIM improved by about 1.5 times compared to RGI. The experimental results confirm that our method achieves accurate and effective coverage for individual targets at various sampling rates. Furthermore, the RNN-RAT network enables the efficient acquisition of ROI for multiple targets.

Fig. 9. Multi-target experimental results. Results for targeting car and aircraft are presented in (a) and (b) respectively. (c) Comparison of PSNR and SSIM curves for different targets.

Download Full Size | PDF

3.3 Network performance results

To demonstrate the efficiency and rationale of the network design, ablation experiments were carried out on the network with three distinct backbones. Firstly, experiments were carried out on the COCO dataset and a custom extended dataset comprising approximately 130,000 images in the training set and around 7,000 images in the validation set for training and validation purposes, respectively. The L2 Loss of the anticipated affine matrix and the IoU Loss of the affine-transformed target object were employed as the measuring criteria to evaluate the performance of this network. In this experiment on prediction tasks, a substantial discrepancy was observed regarding the generalization ability between using the linear layer in the original Spatial Transformer Networks (STN) network model [30] and utilizing the lightweight semantic path model we designed. This difference was evident in the graphs displaying the IoU loss on the left and the L2 loss on the right in Fig. 10. The L2 loss between the affine matrix obtained from neural network inference and the ground truth diminished from 14.7% to 8.9%. Furthermore, the IoU between the affine-transformed target and the ideal target location increased from 62.9% to 81.2%. We have utilized the semantic paths as the foundation and augmented them with spatial paths for better performance accuracy in our model. As a result, the L2 loss has reduced to 1.8%, whilst the IoU performance has increased to 92.3%. This observation demonstrates that our spatial paths encode extensive spatial information details, establishing the superiority of the network in the pixel-level location prediction task.

Fig. 10. IoU loss and L2 loss along with the iteration steps from 1 to 500. (a) IoU loss graph. (b) L2 loss graph.

Download Full Size | PDF

Because conventional target identification techniques require the patterns to be repeatedly generated in alignment with the position of the target, a significant amount of computational delay is produced. To address this, we assessed the metrics by contrasting the inference time, latency, FLOPS, parameters, and model memory size of the developed network with those of many real-time target detection networks, including SSD512 [39], Faster-RCNN [40], YOLOv3 [41], RTMDet [42], and YOLOX [43] to illustrate the efficiency of the proposed network in Table 2. To ensure fairness in the comparison, all models were trained for 300 epochs with an input resolution of 640 × 640, without distillation or pruning, and using a pattern generation number of 8192. The proposed model utilizes significantly fewer parameters and computational resources than mainstream object detection models, reducing around 80%. Moreover, the proposed model inference speed reaches 826 fps, thus demonstrating superior performance compared to the previous methods.

Table 2. Computational comparison table

View Table | View all tables in this article

4. Discussions and conclusions

In this paper, we introduce a novel foveated GI affine transform method based on deep learning techniques, in which we present a new learnable module, the Retina Affine Transformer network that explicitly allows the spatial manipulation of patterns within the network. Our results from simulations and experiments demonstrate that the proposed method effectively focuses on the ROI with low computational cost while enhancing the quality of the reconstructed image. Our approach has various advantages. First, by predicting the parameters of the affine matrix, the speed of target localization and pattern generation in ROI can be increased to improve the efficiency of GI; second, the imaging quality of ROI can be improved by using foveated patterns, while keeping the field of view unchanged. In addition, our proposed method is not only applicable to single-target ROI acquisition but also can simultaneously localize and reconstruct multiple ROIs in complex scenes. Finally, our proposed method aims to expedite the process of generating speckle patterns and reconstructing images through the utilization of GPU acceleration and parallel computing techniques. By harnessing the parallel computational power of GPUs, these tasks can be executed concurrently, thereby the time for ROI localization and pattern generation is substantially reduced by a factor of 1 × 10⁵ compared with the previous methods, and the image quality of ROI is improved by more than 4 dB. Furthermore, it could help track and reconstruct dynamic targets of ROI, reducing the amount of time needed for such tracking, and advancing the development of ghost imaging technologies in real time.

Funding

State Key Laboratory Foundation of applied optics (SKLA02022001A11); National Natural Science Foundation of China (62105029); Young Elite Scientists Sponsorship Program by CAST (YESS20220600); Beijing Nature Science Foundation of China (4222017).

Acknowledgments

The authors thank the editor and the anonymous reviewers for their valuable suggestions.

Disclosures

The authors declare that there are no conflicts of interest related to this paper.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. B. I. Erkmen and J. H. Shapiro, “Ghost imaging: from quantum to classical to computational,” Adv. Opt. Photon. 2(4), 405 (2010). [CrossRef]

2. T. B. Pittman, Y. H. Shih, D. V. Strekalov, et al., “Optical imaging by means of two-photon quantum entanglement,” Phys. Rev. A 52(5), R3429–R3432 (1995). [CrossRef]

3. J. H. Shapiro, “Computational ghost imaging,” Phys. Rev. A 78(6), 061802 (2008). [CrossRef]

4. R. S. Bennink, S. J. Bentley, and R. W. Boyd, ““Two-Photon” Coincidence Imaging with a Classical Source,” Phys. Rev. Lett. 89(11), 113601 (2002). [CrossRef]

5. B. Sun, S. Jiang, Y. Ma, et al., “Application and development of single pixel imaging in the special wavebands and 3D imaging,” Infrared Laser Eng. 49(3), 303016 (2020). [CrossRef]

6. H. Cui, J. Cao, Q. Hao, et al., “Foveated panoramic ghost imaging,” Opt. Express 31(8), 12986–13002 (2023). [CrossRef]

7. Z. Li, J. Suo, X. Hu, et al., “Efficient single-pixel multispectral imaging via non-mechanical spatio-spectral modulation,” Sci. Rep. 7(1), 41435 (2017). [CrossRef]

8. L. Bian, J. Suo, G. Situ, et al., “Multispectral imaging using a single bucket detector,” Sci. Rep. 6(1), 24752 (2016). [CrossRef]

9. Y. Sun, H. Jian, D. Shi, et al., “Cosinusoidal encoding multiplexed structured illumination multispectral ghost imaging,” Opt. Express 30(18), 31728–31741 (2022). [CrossRef]

10. M. P. Olbinado, D. M. Paganin, Y. Cheng, et al., “X-ray phase-contrast ghost imaging using a single-pixel camera,” Optica 8(12), 1538–1544 (2021). [CrossRef]

11. A. Zhang, Y. He, L. Wu, et al., “Tabletop x-ray ghost imaging with ultra-low radiation,” Optica 5(4), 374–377 (2018). [CrossRef]

12. O. Sefi, Y. Klein, E. Strizhevsky, et al., “X-ray imaging of fast dynamics with single-pixel detector,” Opt. Express 28(17), 24568–24576 (2020). [CrossRef]

13. L. Olivieri, J. S. T. Gongora, L. Peters, et al., “Hyperspectral terahertz microscopy via nonlinear ghost imaging,” Optica 7(2), 186–191 (2020). [CrossRef]

14. C. M. Watts, D. Shrekenhamer, J. Montoya, et al., “Terahertz compressive imaging with metamaterial spatial light modulators,” Nat. Photonics 8(8), 605–609 (2014). [CrossRef]

15. R. I. Stantchev, X. Yu, T. Blu, et al., “Real-time terahertz imaging with a single-pixel detector,” Nat. Commun. 11(1), 2535 (2020). [CrossRef]

16. M. J. Sun, M. P. Edgar, G. M. Gibson, et al., “Single-pixel three-dimensional imaging with time-based depth resolution,” Nat. Commun. 7(1), 12010 (2016). [CrossRef]

17. Z. Zhang, S. Liu, J. Peng, et al., “Simultaneous spatial, spectral, and 3D compressive imaging via efficient Fourier single-pixel measurements,” Optica 5(3), 315–319 (2018). [CrossRef]

18. M. J. Sun and J. M. Zhang, “Single-Pixel Imaging and Its Application in Three-Dimensional Reconstruction: A Brief Review,” Sensors 19(3), 732 (2019). [CrossRef]

19. D. B. Phillips, M. Sun, J. M. Taylor, et al., “Adaptive foveated single-pixel imaging with dynamic supersampling,” Sci. Adv. 3(4), e1601782 (2017). [CrossRef]

20. K. Y. Zhang, J. Cao, Q. Hao, et al., “Modeling and Simulations of Retina-Like Three-Dimensional Computational Ghost Imaging,” IEEE Photonics J. 11(1), 1–13 (2019). [CrossRef]

21. Q. Hao, Y. Tao, J. Cao, et al., “Retina-like Imaging and Its Applications: A Brief Review,” Appl. Sci 11(15), 7058 (2021). [CrossRef]

22. E. Akbas and M. P. Eckstein, “Object detection through search with a foveated visual system,” PLoS Comput. Biol. 13(10), e1005743 (2017). [CrossRef]

23. F. Huang, H. Ren, X. Wu, et al., “Flexible foveated imaging using a single Risley-prism imaging system,” Opt. Express 29(24), 40072–40090 (2021). [CrossRef]

24. J. Cao, D. Zhou, Y. Zhang, et al., “Optimization of retina-like illumination patterns in ghost imaging,” Opt. Express 29(22), 36813–36827 (2021). [CrossRef]

25. X. Liu, T. Han, C. Zhou, et al., “Low sampling high quality image reconstruction and segmentation based on array network ghost imaging,” Opt. Express 31(6), 9945–9960 (2023). [CrossRef]

26. F. Wang, C. Wang, C. Deng, et al., “Single-pixel imaging using physics enhanced deep learning,” Photon. Res. 10(1), 104–110 (2022). [CrossRef]

27. X. Zhai, Z. Cheng, Y. Hu, et al., “Foveated ghost imaging based on deep learning,” Opt. Commun. 448, 69–75 (2019). [CrossRef]

28. Z. Yang, Y.-M. Bai, L.-D. Sun, et al., “SP-ILC: Concurrent Single-Pixel Imaging, Object Location, and Classification by Deep Learning,” Photonics 8(9), 400 (2021). [CrossRef]

29. L. H. Bian, J. L. Suo, Q. H. Dai, et al., “Experimental comparison of single-pixel imaging algorithms,” J. Opt. Soc. Am. A 35(1), 78–87 (2018). [CrossRef]

30. M. Jaderberg, K. Simonyan, A. Zisserman, et al., “Spatial Transformer Networks,” in Proceedings of the 28th International Conference on Neural Information Processing Systems (2016), pp. 2017–2025.

31. C. Yu, J. Wang, C. Peng, et al., “BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV, 2018), pp. 334–349.

32. R. Liu, J. Lehman, P. Molino, et al., “An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems (2018), pp. 9628–9639.

33. H. Zhao, J. Shi, X. Qi, et al., “Pyramid Scene Parsing Network,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 6230–6239.

34. F. Yu and V. Koltun, “Multi-Scale Context Aggregation by Dilated Convolutions,” arXiv, arXiv:1511.07122 (2016). [CrossRef]

35. J. Hu, L. Shen, S. Albanie, et al., “Squeeze-and-Excitation Networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2019), pp. 7132–7141.

36. S. K. Sønderby, C. K. Sønderby, L. Maaløe, et al., “Recurrent Spatial Transformer Networks,” arXiv, arXiv:1509.05329 (2015). [CrossRef]

37. W. L. Gong, “Performance comparison of computational ghost imaging versus single-pixel camera in light disturbance environment,” Opt. Laser Technol. 152, 108140 (2022). [CrossRef]

38. Z. Wang, A. C. Bovik, H. R. Sheikh, et al., “Image Quality Assessment: From Error Visibility to Structural Similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

39. W. Liu, D. Anguelov, D. Erhan, et al., “SSD: Single Shot MultiBox Detector,” in Proceedings of the European Conference on Computer Vision (ECCV, 2016), pp. 21–37.

40. S. Ren, K. He, R. Girshick, et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE, 2016), pp. 1137–1149.

41. J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” arXiv, arXiv:1804.02767 (2018). [CrossRef]

42. C. Lyu, W. Zhang, H. Huang, et al., “RTMDet: An Empirical Study of Designing Real-Time Object Detectors,” arXiv, arXiv:2212.07784 (2022). [CrossRef]

43. Z. Ge, S. Liu, F. Wang, et al., “YOLOX: Exceeding YOLO Series in 2021,” arXiv, arXiv:2107.08430 (2021). [CrossRef]

Image size (pixel)	64	128	256
Conventional methods	6.24 × 10³	9.08 × 10⁴	8.67 × 10⁵
Our method	3.45 × 10⁻¹	3.58 × 10⁻¹	1.14

Method	Inf time(fps)	Latency(ms)	FlOPs(G)	Parameters(M)	Mem(GB)
SSD-512	30.7	32.6	181.5	138.36	19.4
Faster-RCNN	21.4	46.7	246	60	40
YOLOv3	95.2	10.5	157.3	63.0	7.4
YOLOX	57.8	17.3	281.9	99.1	28.5
RTMDet	322.6	2.44	141.67	94.86	27.6
Ours	826	1.21	31.05	27.92	5.69

Image size (pixel)	64	128	256
Conventional methods	6.24 × 10³	9.08 × 10⁴	8.67 × 10⁵
Our method	3.45 × 10⁻¹	3.58 × 10⁻¹	1.14

Method	Inf time(fps)	Latency(ms)	FlOPs(G)	Parameters(M)	Mem(GB)
SSD-512	30.7	32.6	181.5	138.36	19.4
Faster-RCNN	21.4	46.7	246	60	40
YOLOv3	95.2	10.5	157.3	63.0	7.4
YOLOX	57.8	17.3	281.9	99.1	28.5
RTMDet	322.6	2.44	141.67	94.86	27.6
Ours	826	1.21	31.05	27.92	5.69

Adaptive locating foveated ghost imaging based on affine transformation

Abstract

1. Introduction

2. Methods

2.1 Foveated GI reconstruction method

2.2 Foveated GI method based on an affine transformer

2.3 Network architecture

2.4 Performance evaluation

3. Results

3.1 Numerical simulation results

3.1.1 Single-target GI simulation

3.1.2 Multi-target GI simulation

3.2 Experimental results

3.2.1 Single-target GI experiments

3.2.2 Multi-target GI experiments

3.3 Network performance results

4. Discussions and conclusions

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (10)

Tables (2)

Equations (8)

Optics Express