Single-pixel imaging based on self-supervised conditional mask classifier-free guidance

Qianxi Li; Qianxi Li; Qiurong Yan; Jiawei Dong; Jiawei Dong; Jia Feng; Jiaxin Wu; Jiaxin Wu; Jianzhong Cao; Guangsen Liu; Guangsen Liu; Hao Wang; Hao Wang

doi:10.1364/OE.518455

1. Introduction

In antecedent endeavors, the theory of compressive sensing posited that if data manifests sparsity in a certain transform domain, then the primal signal can invariably be reconstituted [1–3]. This theory finds significant applications across various domains, and SPI stands as a noteworthy exemplar [4–8]. SPI features cost-effectiveness and high sensitivity, offering broad application prospects in spectral measurement and biomedical detection fields [9–12].

However, the computational complexity of compressive sensing is considerable, and the reconstruction time exhibits exponential growth with increasing image dimensions. Consequently, subsequent to this, several deep learning reconstruction algorithms emerged, exemplified by methodologies such as MPIGAN and Bsr2Net [13,14]. These approaches adeptly address the impediment of sluggish reconstruction, thereby realizing the aspiration to reconstruct high-quality images at low measurement rates.

These deep learning methodologies fundamentally operate by optimizing a loss function, transforming the reconstruction of images into a regression problem [13–20]. They leverage the image to be reconstructed to seamlessly fit the original image in an end-to-end manner. However, from a probabilistic perspective, it may be more reasonable to interpret the image reconstruction process. The measurements at low rates should correspond to the likelihood of reconstructing the original image rather than replicating it. From an optimization standpoint, it is easier to understand that probabilistic generative models can generate multiple images with a certain accuracy that closely aligns with the original image. These images can be subsequently improved for determinism and reduced diversity through overlay optimization. That is, through superimposition, endeavoring to extract the utmost utility from low measurement values, harboring a superior reconstruction upper limit.

Hence, this encourages the utilization of generative models, where grounded in low measurement values, the progression from end-to-end reconstruction transitions into the aspiration of producing original images with a specified degree of accuracy. Presently, generative models encompass GANs (Generative Adversarial Networks), VAE (Variational Autoencoders) [14,17,21–23]. GANs entail the simultaneous training of two adversarial networks, susceptible to issues like gradient explosions or pattern collapse. The generation process of a VAE tends to produce blurry samples. This is attributed to its stochastic generation process, which cannot guarantee the production of high-quality samples for each instance. Additionally, the strong prior in VAE, often modeled as a strict Gaussian distribution, can result in an averaging effect, leading to a tendency for images to appear indistinct in the reconstruction process.

In 2022, Ho, the creator of Denoising Diffusion Probabilistic Models (DDPM), introduced CFG based on the foundation of DDPM, which is a generative model evolving from Markov chains [24]. CFG combines the training-conditioned DDPM with the unconditional DDPM. By merging their score functions, a fused denoising noise is obtained. This denoising noise is then continuously applied in the reverse process to iteratively denoise and generate images. Its inception signifies an improved balance between pattern coverage and sample fidelity, making it undoubtedly valuable for specific applications in low-measurement-rate single-pixel reconstruction.

However, applying CFG to SPI in a rational manner presents certain challenges. What is the condition information provided for the reconstruction task? How can measurements at low rates be judiciously inserted as conditions? How can we cleverly leverage the features of CFG for rapid reconstruction and further enhance the effect? All these aspects require careful consideration. To address this, we have devised a pixel-level SCM-CFG for the reconstruction of single-pixel images. We integrate the characteristics of CFG with the distinctive attributes of single-pixel reconstruction, resulting in three substantial contributions as follows.

1.In terms of condition acquisition, we introduce an intriguing pre-training method. This method dynamically utilizes the least squares method to instantaneously fit the weight matrices of two fully connected layers within the autoencoder network, ensuring their mutual invertibility throughout the training process. Transforming the formulation of the weight matrices into a fitting problem not only alleviates the challenge of weight training but also brings simulation experiments closer to real-world scenarios.

2.Concerning the incorporation of conditions, in contrast to the recent DDPM-based Ghost Imaging (DDPMGI) [25] approach for condition insertion, this paper subjected conditions to convolutional processing across distinct semantic dimensions. Through ablation experiments, we have devised an optimal conditional insertion method. By combining experimental results and analysis, we have derived a summarized conditional insertion approach with strong generalization properties.

3.Combining the characteristics of CFG, and in order to further enhance the reconstruction quality and optimize the upper bounds, we drew inspiration from the principles of self-supervised learning and block painting literature [26–29]. We introduced a self-supervised pixel-level improvement to CFG, crafting a self-supervised conditional mask for CFG. While demonstrating strong generalization, SCM-CFG achieved promising results in a relatively short number of epochs and further training led to improved reconstruction, surpassing other network methods. The application of SCM-CFG, combined with overlay optimization operations, better captures pixel relationships, resulting in a significantly improved optimization of reconstruction quality. This further validates the efficacy of the generative diffusion model in optimizing upper limits.

2. Methods

2.1 Simulation experiment optimization

Figure 1 portrays the single-photon compression imaging system in our laboratory [8,13,30–32]. The complete optical setup consists of an LED, a collimating light pipe, attenuating elements, and an aperture, resulting in an exceedingly feeble light source at the single-photon level. The imaging subject consists of a transmissive pattern etched onto a glass substrate. When illuminated by the system's light source, the imaging subject is projected through a lens onto a digital micromirror device (DMD). The DMD (TI: 0.7 XGA DDR DMD) consists of (1024 × 786) individual micromirrors, each independently controllable and serving as a spatial light modulator. It continuously loads measurement matrices, facilitating random modulation of spatial light [33]. Each micromirror measures $13.68\; \mu m$ ${\times} 13.68\; \mu m\; $ in dimensions and offers two reflective states: $+ 12^\circ $ and $- 12^\circ $, symbolizing “on” and “off” modulation. The binary random matrix loaded onto the DMD simultaneously governs the orientation of each micromirror, achieving modulation of the input light. The modulated optical signal then enters a photomultiplier tube (PMT) that operates in photon counting mode (Hamamatsu Photonics H10682 -110PMT). This PMT in photon counting mode serves as a point detector, allowing simultaneous collection of light intensity values from multiple pixels on the DMD during a single acquisition, subsequently outputting discrete pulses to the receiver. A purpose-built FPGA controls and counting circuitry facilitate the loading of the binary random matrix into the DMD controller for each measurement and compute the PMT's single-photon pulse output. The resulting photon pulse count values, denoted as ${y_1}\textrm{,}{y_2}{,\; }\ldots \textrm{,}{y_n}$, are fed into the SCM-CFG for target image reconstruction.

Fig. 1. The SPI system enhanced by SCM-CFG

Download Full Size | PDF

This process can be modeled mathematically as:

(1)$$Y = {W_1}X$$

Herein, X denotes the pristine image, ${W_1}$ represents the measurement matrix, where the row and column dimensions of ${W_1}$ correspond, respectively, to the image dimensions and the product of image dimensions with the measurement rate, and Y signifies the measurements, denoted as ${y_1}\textrm{,}{y_2}{,\; }\ldots \textrm{,}{y_n}$.

Generally, it is customary to employ a pre-training methodology, involving paired fully connected layers with mutually invertible weight matrices, to align simulation experiments with their real-world counterparts.

(2)$$\begin{array}{{c}} {{W_1}X{W_2} = X^{\prime}}\\ {{W_2} = {{({W_1^T{W_1}} )}^{ - 1}}W_1^T} \end{array}$$

In Eq. (2), X is flanked on either side by the weight matrices ${W_1}$ and ${W_2}$ of two fully connected layers, yielding the initial reconstruction $X\mathrm{^{\prime}}$. The acquisition of Y takes place during the training process outlined in Eq. (2), as depicted in Fig. 2. This processing method will make it difficult to train ${W_2}$ weights and rely more on parameter adjustment. Therefore, we consider the method of least squares, that is, in the training process, to satisfy Eq. (3).

(3)$${W_1}{W_2} = E$$

Fig. 2. Pretrained Two Fully Connected Layer Network

Download Full Size | PDF

As illustrated in Fig. 2, given ${W_1}$ and E, we then require the weight matrix ${W_2}$ of the second fully connected layer to satisfy Eq. (3) during the training process, where E in Eq. (3) represents the identity matrix. This ensures not only mutual invertibility but also facilitates training. This method of learning possesses a certain level of inspiration and generalization. In the context of neural networks, it involves extracting the weights of a network layer, dynamically fitting them through methods like least squares, and subsequently assigning the fitted weights to the corresponding weights of the second network layer. This departure from the conventional approach of constraining weights between different network layers through mathematical transformations reduces the complexity of training while achieving an approximate effect.

2.2 Design of conditional diffusion model

After pre-training, we insert the initial reconstructed image obtained at a low measurement rate as a condition in the CFG [24] conditional diffusion model section.

We know that the reconstruction process of the conditional diffusion model is actually the construction of the generation process $p({{x_{t - 1}}\mathrm{\mid }{x_t},y} )$, as articulated in Eq. (4).

(4)$$p({{x_{t - 1}}\mathrm{\mid }{x_t},y} )= \frac{{p({{x_{t - 1}}\mathrm{\mid }{x_t}} )p({y\mathrm{\mid }{x_{t - 1}},{x_t}} )}}{{p({y\mathrm{\mid }{x_t}} )}}$$

After the introduction of y, the forward diffusion process is not affected, and Eq. (5) is still satisfied.

(5)$${x_t} = \sqrt {\overline {{\alpha _t}} } {x_0} + \sqrt {1 - \overline {{\alpha _t}} } \overline {{z_t}} $$

Given that the noise contributes no discernible benefit to the classification, the inclusion of ${x_t}$ exerts negligible influence on the classification. Hence, we have $p({y\mathrm{\mid }{x_{t - 1}},{x_t}} )= p({y\mathrm{\mid }{x_{t - 1}}} )$, thus leading to the Formulation of Eq. (6) as follows.

(6)$$p({{x_{t - 1}}\mathrm{\mid }{x_t},y} )= \frac{{p({{x_{t - 1}}\mathrm{\mid }{x_t}} )p({y\mathrm{\mid }{x_{t - 1}}} )}}{{p({y\mathrm{\mid }{x_t}} )}} = p({{x_{t - 1}}\mathrm{\mid }{x_t}} ){e^{logp({y\mathrm{\mid }{x_{t - 1}}} )- logp({y\mathrm{\mid }{x_t}} )}}$$

In Eq. (6), the maximum value of t is T. When T is sufficiently large, the variance of $p({{x_t}\mathrm{\mid }{x_{t - 1}}} )$ is small enough, signifying that the probability is significantly greater than zero only when ${x_t}$ and ${x_{t - 1}}$ are in close proximity. Conversely, the same holds true; $p({{x_{t - 1}}\mathrm{\mid }{x_t},y} )$ or $p({{x_t}\mathrm{\mid }{x_{t - 1}},y} )$ is notably greater than zero only when ${x_t}$ and ${x_{t - 1}}$ are in close proximity. Hence, we need to concentrate our attention on the variations in probability within this range. To achieve this, we employ a Taylor expansion to derive Eq. (7).

(7)$$\textrm{log}p({y\mathrm{\mid }{x_{t - 1}}} )- \textrm{log}p({y\mathrm{\mid }{x_t}} )\approx ({{x_{t - 1}} - {x_t}} )\cdot {\nabla _{{x_t}}}\textrm{log}p({y\mathrm{\mid }{x_t}} )$$

$p({{x_{t - 1}}\mathrm{\mid }{x_t},y} )$ can be formulated as Eq. (8) and Eq. (9).

(8)$${p({{x_{t - 1}}\mathrm{\mid }{x_t},y} )}\quad{ \propto {e^{ -{\parallel} {x_{t - 1}} - \mu ({{x_t}} ){\parallel ^2}/2\sigma _t^2 + ({{x_{t - 1}} - {x_t}} )\cdot {\nabla _{{x_t}}}\textrm{log}p({y\mathrm{\mid }{x_t}} )}}}$$

(9)$${p({{x_{t - 1}}\mathrm{\mid }{x_t},y} )}\quad{ \propto {e^{ -{\parallel} {x_{t - 1}} - \mu ({{x_t}} )- \sigma _t^2{\nabla _{{x_t}}}\textrm{log}p({y\mathrm{\mid }{x_t}} ){\parallel ^2}/2\sigma _t^2}}}$$

Thus, we arrive at Eq. (10).

(10)$${x_{t - 1}} = \mu ({{x_t}} )+ \sigma _t^2{\nabla _{{x_t}}}\textrm{log}p({y\mathrm{\mid }{x_t}} )+ {\sigma _t}\varepsilon ,\varepsilon \sim \mathrm{{\cal N}}({0,I} )$$

The corresponding scoring function can be expressed as Eq. (11).

(11)$${\hat{s}_\theta }({{x_t},t,y} )\approx {\nabla _{{x_t}}}\textrm{log}p({{x_t}\mathrm{\mid }y} )$$

Evidently, we derived the conditional DDPM segment of CFG from the formulaic perspective of the scoring function. However, the judicious insertion of conditions into network training is also a facet that demands consideration.

In the year 2023, the article on DDPMGI, by incorporating conditional information into the initial portion of the down-sampled Unet, achieved the implementation of a conditional diffusion model for the first time in the realm of approximate ghost imaging. Given its relatively simplistic insertion method and the lack of consideration for the utilization of conditional convolutional semantic dimensions, there is room for improvement in its effectiveness. This paper innovates upon its foundations by extracting conditions from various convolutional semantic dimensions and inserting them in a manner akin to ResNet, leading to superior outcomes.

We understand that conditions corresponding to different tasks encompass varying semantic information from low to high dimensions. To enable the conditional DDPM to utilize the measurement values and reconstruct the original image with the highest accuracy, it is crucial to delve into the placement of conditional insertions and extract both high-dimensional and low-dimensional semantic information from these conditions. To this end, we have devised a Unet network, as depicted in Fig. 3, along with a method for condition handling. Eight positions (1-8) for condition insertion have been designated within the Unet (Fig. 3(1)), and convolutional semantic information is extracted from the ‘conditions’ (Fig. 3(2)), categorized into five semantic dimensions labeled from a to e, spanning from low to high semantic dimensions. The reason for inserting t into the Unet network is to adapt the trained denoising noise to the change of the reverse diffusion process t from T to 1. In the subsequent experiment in Section 3.1, we explore the effects, conduct ablation experiments with different insertion methods, identify the optimal condition insertion approach, and distill the experience of condition insertion, surpassing the methodology employed in DDPMGI [25].

Fig. 3. (1) Unet network structure and conditional insertion positions 1-8 (2) The conditions are extracted from the feature map dimension into five semantic dimensions a-e

Download Full Size | PDF

2.3 Classifier-free guidance at the pixel level of image

Having elucidated the method of condition insertion for the conditional DDPM segment of CFG, the subsequent consideration lies in how to harness the distinctive features of CFG for a more efficient reconstruction of images.

In the year 2021, Open AI's team published a paper titled “Diffusion models beat GANs on image synthesis [34].” In this paper, they introduced a Classifier Guidance. This improvement significantly enhanced the quality of image generation by diffusion models, surpassing GAN models in terms of IS and FID scores.

In 2022, within the Google Research Brain team, the author Ho proposed the concept of CFG on the foundation of DDPM [24]. Relative to Classifier Guidance, although the training cost increased, the efficacy demonstrated superiority. During the training process, a conditional diffusion model is trained. During the sampling process, a conditional DDPM and an unconditional DDPM are trained. They are then fused using a proposed adjustment coefficient $\omega $ to combine the denoised noise obtained from training both DDPMs. The size of the adjustment coefficient $\omega $ determines the proportion of influence of the two DDPMs on the denoised noise, where a larger $\omega $ indicates a stronger influence of the conditional DDPM. The fused denoised noise ${\tilde{\epsilon }_\theta }({{x_t},y,t} )$ is applied to Gaussian noise for the reverse diffusion process as t varies from T to 1, continually denoising the Gaussian noise to generate the reconstructed image. This method achieves a better balance between pattern coverage and sample fidelity.

It evolves from the conditioned diffusion model, incorporating the gradient score component with a hyperparameter $\gamma $, formulated as in Eq. (12).

(12)$$p({{x_{t - 1}}\mathrm{\mid }{x_t},y} )= \frac{{p({{x_{t - 1}}\mathrm{\mid }{x_t}} ){e^{\gamma \cdot \textrm{sim}({{x_{t - 1}},y} )}}}}{{Z({{x_t},y} )}},Z({{x_t},y} )= \mathop \sum \nolimits_{{x_{t - 1}}} p({{x_{t - 1}}\mathrm{\mid }{x_t}} ){e^{\gamma \cdot \textrm{sim}({{x_{t - 1}},y} )}}$$

Here, $\textrm{sim}({{x_{t - 1}},y} )$ represents a certain measure of similarity or correlation between the generated result ${x_{t - 1}}$ and the condition y.

(13)$$p({{x_{t - 1}}\mathrm{\mid }{x_t},y} )\approx \mathrm{{\cal N}}({{x_{t - 1}};\mu ({{x_t}} )+ \sigma_t^2\gamma {\nabla_{{x_t}}}sim({{x_t},y} ),\sigma_t^2I} )$$

At this juncture, the mean of ${x_{t - 1}}$ can be expressed as:

(14)$$\mu ({{x_t}} )+ \sigma _t^2\gamma {\nabla _{{x_t}}}\textrm{log}p({y\mathrm{\mid }{x_t}} )= \gamma [{\mu ({{x_t}} )+ \sigma_t^2{\nabla_{{x_t}}}\textrm{log}p({y\mathrm{\mid }{x_t}} )} ]- ({\gamma - 1} )\mu ({{x_t}} )$$

Subsequently, the introduction of the parameter $\omega = ({\gamma - 1} )$ into the scheme yields Eq. (15). The paper on CFG has confirmed its capability to balance the diversity and quality of generated images by adjusting the $\omega $ coefficient. As the $\omega $ coefficient increases, the conditional influence on sample generation gradually intensifies, with typical $\omega $ coefficients being 0.0, 0.5 and 2.0.

(15)$${\tilde{\epsilon }_\theta }({{x_t},y,t} )= ({1 + \omega } ){\epsilon _\theta }({{x_t},y,t} )- \omega {\epsilon _\theta }({{x_t},t} )$$

The equation in question represents the fundamental Equation of CFG.

However, through experimentation, it has been discovered that simply applying CFG does not achieve the desired results in reconstructing low-measurement-rate images within a relatively short number of epochs. This analysis stems from the intrinsic high-dimensional and low-dimensional feature information, such as texture, present in images. Training CFG solely with the joint training of conditional images (${\epsilon _\theta }({{x_t},y,t} )$) and unconditional images (${\epsilon _\theta }({{x_t},t} )$) is insufficient for reconstructing high-quality corresponding low-measurement-rate images in a limited number of epochs.

To address this issue, we have reverted our focus to the characteristics of the image reconstruction task. Image reconstruction fundamentally involves the probabilistic restoration of relationships between pixels. Therefore, we aim to refine the presence or absence of conditions at the pixel level, transforming ${\epsilon _\theta }({{x_t},y,t} )$ into ${\epsilon _\theta }({{x_t},1 \times y,t} )$ and ${\epsilon _\theta }({{x_t},t} )$ into ${\epsilon _\theta }({{x_t},0 \times y,t} )$. Equation (15) is adjusted to a pixel-level formulation, as in Eq. (16).

(16)$${\tilde{\epsilon }_\theta }({{x_t},{y_{\textrm{single - pixel}}},t} )= ({1 + \omega } ){\epsilon _\theta }({{x_t},1 \times {y_{\textrm{single - pixel}}},t} )- \omega {\epsilon _\theta }({{x_t},0 \times {y_{\textrm{single - pixel}}},t} )$$

In this way, the joint training of conditional and unconditional images in CFG can be viewed as the joint training of individual pixels with and without conditions.

We draw inspiration from the ideologies of self-supervised training and block painting [26–29]. Our intention is to transform ‘conditional images’ and ‘unconditional images’ into randomly generated ‘approximate conditional pixel images’ and ‘approximate unconditional pixel images.’ This metamorphosis is aimed at intervening in training to achieve a self-supervised effect. It aids the network in grasping the inter-pixel relationships of reconstructed images more adeptly, guided by disparate yet proximate ‘conditional information’ and ‘unconditional information.’ Moreover, since each instance of the approximate ‘conditional information’ and ‘unconditional information’ differs, the images generated each time exhibit a certain accuracy in approximation to the original but remain distinct. This lays the groundwork for a colossal enhancement through subsequent superimposed optimizations, elucidated further in the 3.4 experiment.

At the pixel-level formula derivation, it can be expressed as: by replacing the previous conditional coefficients ‘1’ and ‘0’ with the given probabilities ${k_1} \to 1$ and ${k_2} \to 0$, as shown in Eq. (17).

(17)$${\tilde{\epsilon }_\theta }({{x_t},{y_{\textrm{single - pixel}}},t} )= ({1 + \omega } ){\epsilon _\theta }({{x_t},{k_1} \times {y_{\textrm{single - pixel}}},t} )- \omega {\epsilon _\theta }({{x_t},{k_2} \times {y_{\textrm{single - pixel}}},t} )$$

Here, ${k_1}$ and ${k_2}$ represent the probability masks for conditional and unconditional input, respectively. In this way, during training, the integration of the probability-generated approximate conditional DDPM and approximate unconditional DDPM can be achieved. This further aids the training model in better discerning relationships between pixels, thereby accelerating the training process to achieve desired results.

Uniting each pixel, we obtain masks ${m_1}$ and ${m_2}$, resulting in Eq. (18).

(18)$${\tilde{\epsilon }_\theta }({{x_t},y,t} )= ({1 + \omega } ){\epsilon _\theta }({{x_t},{m_1} \times y,t} )- \omega {\epsilon _\theta }({{x_t},{m_2} \times y,t} )$$

This forms the SCM-CFG method, as shown in Fig. 4. During the training process, the preliminary reconstructed condition y is obtained through the well-trained fully connected layer network. The conditional processing module (Fig. 3(2)) is then applied to y, obtaining conditional information of various convolutional dimensions (blue box in conditional processing). Subsequently, the generated mask ${k_1}$ is used to mask the conditional information. The masked conditional information is then input into the Unet network along with ${x_t}$ for training, resulting in the output ${\tilde{\epsilon }_\theta }({{x_t},t,y} )$. Train an approximate conditional DDPM with the constraint of the noise loss function ${\nabla _\theta }||{\epsilon \textrm{ - }{{\tilde{\epsilon }}_\theta }({{x_t},t,y} )} |{|^2}$.

Fig. 4. SCM-CFG flowchart

Download Full Size | PDF

During the sampling process, the samples to be reconstructed undergo the same conditional processing to obtain conditional information of different dimensions. Subsequently, the conditional information is duplicated twice (blue and yellow boxes in conditional processing). Following this, similar to the training process, we use masks ${m_1}$ and ${m_2}$ generated with probabilities ${k_1}$ and ${k_2}$, respectively, to conditionally mask the blue box conditional information and the yellow box conditional information. This trains an approximate conditional DDPM with probability ${k_1}$ and an approximate unconditional DDPM with probability ${k_2}$. Merge the two DDPMs using the formula ${\tilde{\epsilon }_\theta }({{x_t},y,t} )= ({1 + \omega } ){\epsilon _\theta }({{x_t},{m_1} \times y,t} )- \omega {\epsilon _\theta }({{x_t},{m_2} \times y,t} )$ to obtain denoised noise ${\tilde{\epsilon }_\theta }({{x_t},y,t} )$. Finally, through the reverse diffusion process, use the formula ${\textrm{x}_{t - 1}} = \frac{1}{{\sqrt {{\alpha _t}} }}\left( {{\textrm{x}_t} - \frac{{1 - {\alpha_t}}}{{\sqrt {1 - {{\bar{\alpha }}_t}} }}{{\tilde{\epsilon }}_\theta }({{x_t},y,t} )} \right) + {\sigma _t}\textrm{z}$ to progressively denoise the Gaussian noise ${x_t}$ (changing from T to 1) and generate reconstructed images.

The pseudocode for its training and sampling processes are presented Algorithm 1 and Algorithm 2, respectively. The validation of this design, along with the optimal conditional probability ${k_1}$ and unconditional probability ${k_2}$, will be discussed and demonstrated through Experiment 3.2.

Algorithm 1. Joint training a diffusion model with SCM-CFG

View Table | View all tables in this article

Algorithm 2. Conditional sampling with SCM-CFG

View Table | View all tables in this article

3. Experiment and application

3.1 Design of condition introduction

From the diagram in Section 2.2, we have devised the network insertion points (Fig. 3(1)) and five semantic dimensions (Fig. 3(2)) for condition processing. In this subsection, through the design of eight iterative experiments, we aim to explore the optimal condition insertion method. The assessment criterion employs the average PSNR after reconstructing 40 MNIST images [35], yielding the results depicted in Fig. 5.

Fig. 5. Comparative ablation experiments with condition insertion at different positions

Download Full Size | PDF

In Fig. 5, the curved arrows in the ‘Network Structure’ section represent the Unet, and the alphanumeric combinations within the boxes denote the insertion of feature dimension information into the corresponding positions of the Unet.

Comparing network structures 1-3 to structure 4 in Fig. 5, we observe that the downsampling of the Unet better preserves the integrity and authenticity of information than upsampling. For image reconstruction tasks, it is reasonable to insert conditions from downsampling; network structure 4 implies that conditions can be inserted in a low-dimensional way to obtain the reconstructed image, as confirmed by the findings in DDPMGI's article. However, a comparative analysis of network structures 4-8 reveals the imperfections in the structure of 4. It becomes evident that extracting and inserting information from low to high-dimensional conditions can significantly enhance the reconstruction effectiveness. Certainly, the comparison between network structures 7 and 8 further illustrates the redundancy in information insertion. This observation, from an alternative perspective, underscores the importance of judiciously and moderately incorporating feature information through ResNet-like methods, tailored to the requirements of the task.

Through ablation experiments, it is validated that inserting conditions from network downsampling into different semantic dimensions can better adapt to image reconstruction tasks. This also indicates that, with evolving task requirements, the handling of conditions needs to be adjusted reasonably.

3.2 Verification and optimization of SCM-CFG mask

In Section 2.3, we unveil the formula derivation and theoretical exposition of SCM-CFG. In this section, we shall substantiate the efficacy and eminence of this design by modulating the mask coefficients, ${k_1}$ and ${k_2}$, as delineated in Table 1. Subsequently, we shall ascertain the most judicious values for ${k_1}$ and ${k_2}$. The experimental findings are encapsulated in Table 2.

Table 1. The comparison of different models confirms

View Table | View all tables in this article

Table 2. Optimal mask factor selection

View Table | View all tables in this article

It should be noted that, inspired by the original CFG text and incorporating Eq. (15), we selected typical values for $\omega $, namely, $\omega = 0.0$, $\omega = 0.5$ and $\omega = 2.0$. These values represent the effect of conditions on the reconstruction process from small to large. With epoch set to 20, we averaged the Peak Signal-to-Noise Ratio (PSNR) as the evaluation metric for the reconstruction effects, validating the method's superiority and efficiency in shorter training epochs.

It is important to note that in Table 1, according to Eq. (17), due to the design of our mask coefficients, Eq. (17) can be transformed into CFG, conditional DDPM, and DDPM. In other words, when ${k_1}$=0.0, ${k_2}$=0.0, the denoising noise derived from Eq. (17) is ${\epsilon _\theta }({{x_t},t} )$, representing the unconditional diffusion model. Similarly, the derivation of Eq. (17) results in the masked conditional diffusion model for case 2, the conditional diffusion model for case 3, CFG for case 4, and our designed CFG with ${k_1}$=0.7, ${k_2}$=0.1 for case 5. Additionally, since the denoising noises derived for 1, 2, and 3 are independent of $\omega $, the PSNR should be essentially the same for different $\omega $ values.

By comparing entries 1, 2, and 3 in Table 1, we observe that the unconditional DDPM fails to reconstruct the corresponding images, and the conditional DDPM does not achieve satisfactory results in a short number of epochs. It is noteworthy that entry 2 indicates our mask design can also aid in the training of the conditional DDPM. Comparing entries 2, 4, and 5 in Table 1, we find that the original CFG method fails to achieve reconstruction results in a short number of epochs. However, the results obtained with our SCM-CFG method show good performance in a short epoch duration, and the effect of entry 5, using the designed CFG with ${k_1}$=0.7, ${k_2}$=0.1, surpasses the masked conditional DDPM effect in entry 2. This strongly illustrates the efficiency and superiority of our proposed method.

Further comparing the data 1, 2, and 3 with 1, 4, and 5 in Table 2, we can determine the most stable mask coefficients ${k_1}$ and ${k_2}$. Comparing the average PSNR, it is evident that ${k_1}$=0.7 and ${k_2}$=0.1 can achieve a better and more stable reconstruction effect. In the training process, SCM-CFG utilizes a mask generated with ${k_1}$=0.7 to mask the conditional variable y. Through training the noise loss function ${\nabla _\theta }||{\epsilon \textrm{ - }{{\tilde{\epsilon }}_\theta }({{x_t},t,y} )} |{|^2}$, an approximate conditional DDPM is trained. In the sampling process, n samples to be reconstructed are duplicated into two sets, ${n_1}$ and ${n_2}$ (where ${n_1}$=${n_2}$). Subsequently, the conditions for ${n_1}$ and ${n_2}$ samples are individually masked using masks ${m_1}$ generated with a probability of ${k_1}$=0.7 and ${m_2}$ generated with a probability of ${k_2}$=0.1. This trains an approximate conditional DDPM with a probability of ${k_1}$=0.7 and an approximate unconditional DDPM with a probability of ${k_2}$=0.1. The fusion of these DDPMs is achieved through the formula ${\tilde{\epsilon }_\theta }({{x_t},y,t} )= ({1 + \omega } ){\epsilon _\theta }({{x_t},{m_1} \times y,t} )- \omega {\epsilon _\theta }({{x_t},{m_2} \times y,t} )$, resulting in denoising noise ${\tilde{\epsilon }_\theta }({{x_t},y,t} )$. Finally, by applying this denoising noise in a stepwise reverse diffusion process to Gaussian noise ${X_T}$, denoised reconstructed images are generated.

This section validates the rationality and superiority of the SCM-CFG method, selecting the optimal mask adjustment coefficients and providing theoretical support for subsequent experiments.

3.3 Single pixel reconstruction experiment

3.3.1 Reconstruction experiment at low measurement rate

In this section, we will train the SCM-CFG using the MNIST dataset to explore the reconstruction effects under low measurement rates at different epochs. Subsequently, during the reconstruction process, the average Peak Signal-to-Noise Ratio (PSNR) for 40 test images was calculated, presenting the results in Table 3. Following that, we showcase the optimal training results for different measurement rates with the best masks in Fig. 6.

Fig. 6. Reconstruction results for the MNIST dataset, Epoch = 80, with mask adjustment coefficients k1 = 0.7, k2 = 0.1.

Download Full Size | PDF

Table 3. SCM-CFG reconstruction of different epochs at low measurement rates in the MNIST dataset

View Table | View all tables in this article

Firstly, from the PSNR values at different measurement rates in Table 3, it can be observed that as the measurement rate increases, the reconstruction PSNR gradually improves. This model performs exceptionally well at low measurement rates, especially below 0.1. Additionally, analyzing the different epochs in Table 3, it can be inferred that even with only 20 epochs, the model achieves satisfactory reconstruction results, efficiently completing the task of rapidly reconstructing accurate images from low measurement values. Comparing epochs 60, 70, and 80, the model stabilizes in PSNR between 70 and 80 epochs, indicating superior performance enhancement. We will compare it with different networks in Section 3.3.2 using epoch 80 to validate its superiority.

In conclusion, our method effectively accomplishes the task of rapidly reconstructing high-quality images under low measurement rates. Furthermore, with the increase in epochs, the fully trained network exhibits even more outstanding performance.

3.3.2 Reconstructed comparison under different datasets

We select a measurement rate of 0.1 and conducts training on the CelebA, MNIST, and Flower datasets [36]. The study explores the reconstruction capabilities of the SCM-CFG model for images of different sizes and datasets.

The link to the CelebA dataset is: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html. The flower dataset originates from a project on Kaggle, and the dataset source is: https://www.kaggle.com/datasets/alxmamaev/flowers-recognition. It includes images of five different types of flowers: daisies, dandelions, roses, sunflowers, and tulips, totaling 3539 images.

It should be noted that Tables 4 to 6 present experimental results comparing the CelebA, Flower, and MNIST datasets, respectively. In Table 4, the CelebA training set comprises 20,000 images of size 64 × 64, with a batch size of 64 and 20 epochs (epochs are less than or equal to those of other network training). In Table 5, the Flower training set consists of 3,539 images, also with a size of 64 × 64, a batch size of 64, SCM-CFG's T set to 1,000 (consistent with the DDPMGI method), and 20 epochs (epochs are less than or equal to those of other network training). In Table 6, the MNIST dataset uses the original training set, resized to 32 × 32, with a batch size of 64 and 80 epochs (epochs are less than or equal to those of other network training). Figure 7 displays the optimal results for CelebA, Flower, and MNIST for visual comparison.

Fig. 7. Optimal reconstruction results of the CelebA, Flower, and MNIST datasets at MR = 0.1.

Download Full Size | PDF

Table 4. Different network reconstruction effects of CelebA dataset with MR = 0.1

View Table | View all tables in this article

Table 5. Different network reconstruction effects of Flower dataset with MR = 0.1

View Table | View all tables in this article

Table 6. Different network reconstruction effects of MNIST dataset with MR = 0.1

View Table | View all tables in this article

From the comparison of CelebA test data in Table 4 (entries 1-6), it is evident that our model continues to perform well in reconstructing complex textured images, surpassing many current methods. Moreover, in comparison with the state-of-the-art single-pixel reconstruction method MPIGAN+, SCM-CFG achieves an improvement of 0.201 dB. It is worth noting that for images of size 64 × 64, the mask coefficients need slight adjustments to ${k_1}$=0.6, ${k_2}$=0.1 to achieve the optimal mask for this specific size.

Analyzing entries 1-5 in Table 5, it is evident that our model achieves significant improvement in reconstructing complex Flower images compared to conventional ghost imaging methods. When compared to the recent DDPMGI method, maintaining consistent training parameters and averaging results for test images F1-F4, SCM-CFG demonstrates a notable average PSNR increase of 1.09 dB. Furthermore, the fine-tuned coefficients of ${k_1}$=0.6 and ${k_2}$=0.1 once again achieved favorable results. This indicates that there is a certain relationship between mask coefficients and image size. However, once adjusted appropriately, these coefficients exhibit good generalization across different datasets with the same size.

Analyzing entries 1-5 in Table 6, it is evident that our method significantly improves the reconstruction performance for small-sized images with simple textures. Compared to the latest MPIGAN network, which achieves 25.79 dB, our method demonstrates an additional improvement of 0.38 dB. This reaffirms the superiority of our approach.

In conclusion, this section conducts experimental comparisons on images from different datasets, demonstrating the dataset generalizability of the SCM-CFG method and its superior generalization and reconstruction capabilities for images of varying sizes and texture complexities.

3.4 Overlay optimization based on mask design

In Section 3.3, we derive the CFG and, building upon its foundation, refine it at the pixel level. This refinement aims to empower the model to more effectively grasp the intricacies between pixels during training. Given that each probabilistically generated ‘approximate conditional information’ and ‘approximate unconditional information’ differs each time, each generated image aligns with the original image with varying degrees of accuracy. Consequently, each image reflects the potential for reconstructing the original image with conditional information from different perspectives. Thus, the superimposition of these images enhances determinism, leading to superior outcomes.

The formulaic expression of this approach is delineated as follows, emphasizing the reduction of diversity and the augmentation of determinism through the amalgamation process.

(19)$$\bar{x} = \smallint p({{x_i}\mathrm{\mid }y} ){x_i}$$

The lucidity of the contrast between the two sets of eight images in Fig. 8 serves as a more vivid demonstration. In the case of the dandelion images, the DDPMGI experiment experiences an elevation from 19.26 dB with two superimpositions to 19.68 dB, ascending to 20.08 dB with four superimpositions, and reaching 20.19 dB with eight superimpositions. Conversely, SCM-CFG exhibits an enhancement from 21.82 dB with two superimpositions to 27.89 dB, escalating to 28.16 dB with four superimpositions, and culminating at 28.52 dB with eight superimpositions. To delve deeper into verification, we introduce sunflower images, further substantiating that this phenomenon is not incidental. It rises from 21.62 dB, increasing to 28.88 dB with two superimpositions, ascending to 29.40 dB with four superimpositions, and reaching 29.57 dB with eight superimpositions. The average PSNR improvement for the two images is an impressive 7.3 dB.

Fig. 8. Demonstration of enhanced PSNR performance through superposition

Download Full Size | PDF

In conclusion, due to our meticulous consideration of pixel relationships during the initial mask design, our approach has achieved more pronounced effects during superimposition.

4. Real experiment

To further validate the effectiveness of the SCM-CFG method, we conducted a physical experiment based on SPI, as shown in Fig. 1. Dim collimated light from LEDs served as the light source, passed through a mask to create a pattern, underwent DMD modulation, and was then input to a PMT. The measured values were subsequently processed by an FPGA to obtain reconstructed images using a reconstruction algorithm. It's important to note that the reconstructed image size in real experiments is 64 × 64 pixels. To achieve clearer imaging, the frame rate of the DMD is set to 2 frames per second, with the total frames being the measurement rate × 64 × 64. As the measurement rate increases from 0.03 to 0.1, the data acquisition time increases from 61 seconds to 205 seconds. After obtaining the conditional data, we input it into the network for training, and after approximately 15 epochs, a good reconstruction effect can be achieved. Based on the denoised noise obtained from the network training, we conduct the reverse diffusion process of SCM-CFG, continuously denoising to generate images. The parameter T for the reverse diffusion process is 1000, so the time to generate reconstructed images from Gaussian noise is approximately 15 seconds.

Figure 9 demonstrates the performance of SCM-CFG at measurement rates of 0.03-0.1 and compares it with other traditional methods. Despite the presence of environmental noise and other disturbances during the actual experiment, SCM-CFG leveraged the advantages of generative models, resulting in reconstructed images with fewer artifacts and clearer outlines compared to other methods. The practical experiment showed significant improvement with SCM-CFG, highlighting its efficacy in real-world scenarios.

Fig. 9. Physical experimental results of SCM-CFG at different measurement rates

Download Full Size | PDF

5. Conclusion

In this paper, we introduced a diffusion model into single-pixel reconstruction and devised a self-supervised conditional mask termed CFG. The lightweight application of the CFG model in the SPI domain addressed the challenge of end-to-end mechanical replication of original images during single-pixel image reconstruction.

Through ablation experiments, we identified a reasonable conditional insertion approach for image reconstruction tasks. We improved the CFG at the pixel level to enhance the model's ability to capture relationships between pixels. We validated the effectiveness and superiority of this method. Simulations and physical experiments further demonstrated that employing this approach enables rapid, high-quality single-pixel image reconstruction at low measurement rates, with excellent dataset generalization. At a 10% sampling rate, SCM-CFG achieved an average PSNR of 26.17 dB on the MNIST dataset, surpassing existing methods on other datasets. Moreover, by refining CFG at the pixel level, the design of masks for self-supervised training significantly accelerated the training process. The combination of masks and generative patterns substantially improved the later-stage overlay optimization capability, leading to an average PSNR improvement of 7.3 dB. At this stage, the preprocessing part of this paper awaits further integration with algorithms. The design of mask coefficients could be more diverse. We acknowledge these limitations and plan to address them in future research, aiming to enhance the accuracy and robustness of this method.

Funding

This research was funded by the West Light Foundation of the Chinese Academy of Sciences, grant No. XAB2021YN15.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. B. S. Kashin, “Diameters of some finite-dimensional sets and classes of smooth functions,” Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya 11(2), 317–333 (1977). [CrossRef]

2. E. J. Candes, Justin K. Romberg, Terence Tao, et al., “Stable signal recovery from incomplete and inaccurate measurements,” Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 59(8), 1207–1223 (2006). [CrossRef]

3. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006). [CrossRef]

4. A. Gatti, E. Brambilla, M. Bache, et al., “Ghost imaging with thermal light: comparing entanglement and classicalcorrelation,” Phys. Rev. Lett. 93(9), 093602 (2004). [CrossRef]

5. M. F. Duarte, Mark A. Davenport, Dharmpal Takhar, et al., “Single-pixel imaging via compressive sampling,” IEEE Signal Process. Mag. 25(2), 83–91 (2008). [CrossRef]

6. J. Romberg, “Imaging via compressive sampling,” IEEE Signal Process. Mag. 25(2), 14–20 (2008). [CrossRef]

7. F. Ferri, D. Magatti, L. A. Lugiato, et al., “Differential ghost imaging,” Phys. Rev. Lett. 104(25), 253603 (2010). [CrossRef]

8. W.-K. Yu, Xue-Feng Liu, Xu-Ri Yao, et al., “Single photon counting imaging system via compressive sensing,” arXiv, arXiv:1202.5866 (2012). [CrossRef]

9. W. Becker, A. Bergmann, M.A. Hink, et al., “Fluorescence lifetime imaging by time-correlated single-photon counting,” Microsc. Res. Tech. 63(1), 58–66 (2004). [CrossRef]

10. V. Studer, Jérome Bobin, Makhlad Chahid, et al., “Compressive fluorescence microscopy for biological and hyperspectral imaging,” Proc. Natl. Acad. Sci. 109(26), E1679–E1687 (2012). [CrossRef]

11. X.-F. Liu, Wen-Kai Yu, Xu-Ri Yao, et al., “Measurement dimensions compressed spectral imaging with a single point detector,” Opt. Commun. 365, 173–179 (2016). [CrossRef]

12. J. D. Usala, Adrian Maag, Thomas Nelis, et al., “Compressed sensing spectral imaging for plasma optical emission spectroscopy,” J. Anal. At. Spectrom. 31(11), 2198–2206 (2016). [CrossRef]

13. B. Li, Qiu-Rong Yan, Yi-Fan Wang, et al., “A binary sampling Res2net reconstruction network for single-pixel imaging,” Rev. Sci. Instrum. 91(3), 1 (2020). [CrossRef]

14. S. Sun, Qiurong Yan, Yongjian Zheng, et al., “Single pixel imaging based on generative adversarial network optimized with multiple prior information,” IEEE Photonics J. 14(4), 1–10 (2022). [CrossRef]

15. K. Kulkarni, Suhas Lohit, Pavan Turaga, et al., “Reconnet: Non-iterative reconstruction of images from compressively sensed measurements,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016), 449–458.

16. X. Xie, Yuxiang Wang, Guangming Shi, et al., “Adaptive measurement network for cs image reconstruction,” in Computer Vision: Second CCF Chinese Conference, CCCV 2017, Tianjin, China, October 11–14, 2017, Proceedings, Part II, (Springer, 2017), 407–417.

17. A. Creswell, Tom White, Vincent Dumoulin, et al., “Generative adversarial networks: An overview,” IEEE Signal Process. Mag. 35(1), 53–65 (2018). [CrossRef]

18. F. Wang, Hao Wang, Haichao Wang, et al., “Learning from simulation: An end-to-end deep-learning approach for computational ghost imaging,” Opt. Express 27(18), 25560–25572 (2019). [CrossRef]

19. H. Yao, Feng Dai, Shiliang Zhang, et al., “Dr2-net: Deep residual reconstruction network for image compressive sensing,” Neurocomputing 359, 483–493 (2019). [CrossRef]

20. R. Zhu, Hong Yu, Zhijie Tan, et al., “Ghost imaging based on Y-net: A dynamic coding and decoding approach,” Opt. Express 28(12), 17556–17569 (2020). [CrossRef]

21. J. An and S. Cho, “Variational autoencoder based anomaly detection using reconstruction probability,” Special lecture on IE 2, 1–18 (2015).

22. A. Brock, Jeff Donahue, Karen Simonyan, et al., “Large scale GAN training for high fidelity natural image synthesis,” arXiv, arXiv:1809.11096 (2018). [CrossRef]

23. T. Miyato, Toshiki Kataoka, Masanori Koyama, et al., “Spectral normalization for generative adversarial networks,” arXiv, arXiv:1802.05957 (2018). [CrossRef]

24. J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv, arXiv:2207.12598 (2022). [CrossRef]

25. S. Mao, Yuchen He, Hui Chen, et al., “High-quality and high-diversity conditionally generative ghost imaging based on denoising diffusion probabilistic model,” Opt. Express 31(15), 25104–25116 (2023). [CrossRef]

26. A. Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, et al., “A survey on contrastive self-supervised learning,” Technologies 9(1), 2 (2020). [CrossRef]

27. R. Balestriero, Mark Ibrahim, Vlad Sobal, et al., “A cookbook of self-supervised learning,” arXiv, arXiv:2304.12210 (2023). [CrossRef]

28. J. Gui, Tuo Chen, Jing Zhang, et al., “A survey of self-supervised learning from multiple perspectives: Algorithms, theory, applications and future trends,” arXiv, arXiv:2301.05712 (2023). [CrossRef]

29. A. Lugmayr, Martin Danelljan, Andres Romero, et al., “Repaint: Inpainting using denoising diffusion probabilistic models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022), 11461–11471.

30. B. Sun, M. P. Edgar, R. Bowman, et al., “3D computational imaging with single-pixel detectors,” Science 340(6134), 844–847 (2013). [CrossRef]

31. W.-K. Yu, Xu-Ri Yao, Xue-Feng Liu, et al., “Three-dimensional single-pixel compressive reflectivity imaging based on complementary modulation,” Appl. Opt. 54(3), 363–367 (2015). [CrossRef]

32. O. S. Magaña-Loaiza and R. W. Boyd, “Quantum imaging and information,” Rep. Prog. Phys. 82(12), 124401 (2019). [CrossRef]

33. Q.-R. Yan, Hui Wang, Cheng-Long Yuan, et al., “Large-area single photon compressive imaging based on multiple micro-mirrors combination imaging method,” Opt. Express 26(15), 19080–19090 (2018). [CrossRef]

34. P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems 34, 8780–8794 (2021).

35. Y. LeCun, “Gradient-based learning applied to document recognition,” Proc. IEEE 86(11), 2278–2324 (1998). [CrossRef]

36. Z. Liu, Ping Luo, Xiaogang Wang, et al., “Deep learning face attributes in the wild,” in Proceedings of the IEEE international conference on computer vision, 2015), 3730–3738.

37. X. Hu, Jinli Suo, Tao Yue, et al., “Patch-primitive driven compressive ghost imaging,” Opt. Express 23(9), 11092–11104 (2015). [CrossRef]

38. C. Saharia, Jonathan Ho, William Chan, et al., “Image super-resolution via iterative refinement,” IEEE Trans. Pattern Anal. Mach. Intell. 45, 1–14 (2022). [CrossRef]

39. C. Zhang, Jiaxuan Zhou, Jun Tang, et al., “Deep unfolding for singular value decomposition compressed ghost imaging,” Appl. Phys. B 128(10), 185 (2022). [CrossRef]

40. M. Lyu, Wei Wang, Hao Wang, et al., “Deep-learning-based ghost imaging,” Sci. Rep. 7(1), 17865 (2017). [CrossRef]

Number	Mask	PSNR(dB)			Average PSNR(dB)
Number	Mask	$ω = 0.0$	$ω = 0.5$	$ω = 2.0$	Average PSNR(dB)
1	k₁= 0.0, k₂= 0.0	None	None	None	None
2	k₁= 0.7 k₂= 0.7	23.48	23.48	23.48	23.48
3	k₁= 1.0, k₂= 1.0	11.63	11.63	11.63	11.63
4	k₁= 1.0, k₂= 0.0	None	None	None	None
5	k₁= 0.7, k₂= 0.1	23.97	24.17	22.75	23.63

Number	Mask	PSNR(dB)			Average PSNR(dB)
Number	Mask	$ω = 0.0$	$ω = 0.5$	$ω = 2.0$	Average PSNR(dB)
1	k₁= 0.7, k₂= 0.1	23.97	24.17	22.75	23.63
2	k₁= 0.6, k₂= 0.1	23.95	24.01	22.42	23.46
3	k₁= 0.8, k₂= 0.1	23.61	23.42	21.68	22.90
4	k₁= 0.7, k₂= 0.0	24.16	24.13	21.68	23.32
5	k₁= 0.7, k₂= 0.2	23.18	23.91	21.93	23.01

Epoch	Mask factor k₁ k₂ and $ω$		MR = 0.03	MR = 0.05	MR = 0.08	MR = 0.1	MR = 0.2	MR = 0.3
20	k₁= 0.7, k₂= 0.1	$ω$ = 0.0	20.08	21.73	23.69	24.20	25.85	26.98
		$ω$ = 0.5	20.28	21.63	23.71	24.16	25.49	26.98
		$ω$ = 2.0	19.31	20.93	22.51	22.96	23.27	24.17
60	k₁= 0.7, k₂= 0.1	$ω$ = 0.0	21.25	22.75	24.53	25.72	28.05	28.21
		$ω$ = 0.5	21.09	22.99	24.97	25.98	28.06	27.78
		$ω$ = 2.0	20.75	22.43	24.19	25.28	26.15	25.11
70	k₁= 0.7, k₂= 0.1	$ω$ = 0.0	21.17	23.22	24.98	25.78	28.62	28.64
		$ω$ = 0.5	20.91	23.36	25.13	26.17	28.20	28.41
		$ω$ = 2.0	20.55	23.11	24.59	25.13	26.83	24.54
80	k₁= 0.7, k₂= 0.1	$ω$ = 0.0	21.24	22.98	24.92	25.71	28.32	28.64
		$ω$ = 0.5	21.34	23.44	25.01	26.17	28.54	28.10
		$ω$ = 2.0	20.81	23.06	24.52	25.43	26.69	25.16

Number	Methods			PSNR(dB)
1	Bsr2Net [13]			21.374
2	DR2Net			21.318
3	ReconNet			23.056
4	MPIGAN			23.515
5	MPIGAN+			23.596
6	SCM-CFG	$k_{1}$ =0.7, $k_{2}$ =0.1	$ω$ =0.0	23.611
		$k_{1}$ =0.7, $k_{2}$ =0.1	$ω$ =0.5	23.764
		$k_{1}$ =0.6, $k_{2}$ =0.1	$ω$ =0.0	23.354
		$k_{1}$ =0.6, $k_{2}$ =0.1	$ω$ =0.5	23.797

Number	Methods			PSNR(dB)
1	TV [37]			19.07 (F1-F4)
2	Unet [38]			19.28 (F1-F4)
3	ISTANet [39]			21.30 (F1-F4)
4	DDPMGI [25]			21.19 (F1-F4)
5	SCM-CFG	$k_{1}$ =0.7, $k_{2}$ =0.1	$ω$ =0.0	20.896
		$k_{1}$ =0.7, $k_{2}$ =0.1	$ω$ =0.5	19.880
		$k_{1}$ =0.6, $k_{2}$ =0.1	$ω$ =0.0	22.15 (F1-F5) 22.28 (F1-F4)
		$k_{1}$ =0.6, $k_{2}$ =0.1	$ω$ =0.5	20.215

Single-pixel imaging based on self-supervised conditional mask classifier-free guidance

Abstract

1. Introduction

2. Methods

2.1 Simulation experiment optimization

2.2 Design of conditional diffusion model

2.3 Classifier-free guidance at the pixel level of image

3. Experiment and application

3.1 Design of condition introduction

3.2 Verification and optimization of SCM-CFG mask

3.3 Single pixel reconstruction experiment

3.3.1 Reconstruction experiment at low measurement rate

3.3.2 Reconstructed comparison under different datasets

3.4 Overlay optimization based on mask design

4. Real experiment

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (8)

Equations (19)

Optics Express