CoCoCs: co-optimized compressive imaging driven by high-level vision

Honghao Huang; Honghao Huang; Honghao Huang; Chengyang Hu; Chengyang Hu; Chengyang Hu; Jingwei Li; Xiaowen Dong; Hongwei Chen; Hongwei Chen

doi:10.1364/OE.468733

1. Introduction

Inspired by compressive sensing (CS) [1,2], CS imaging optically encodes high-dimensional scene data into low-dimensional optoelectronic measurements and then computationally recovers the desired information. Recent years have witnessed spectacular developments in CS imaging systems, such as hyperspectral imaging [3–5], holographic [6,7], tomography [8], ultrafast imaging [9–11], single pixel imaging (SPI) [12–14] and snapshot video compressive imaging (SCI) [15,16], etc. In this work, we focus on co-optimizing the whole pipeline of CS imaging, aiming to rebuild photorealistic scenes and bridge the gap between CS imaging and high-level vision tasks, as well as human perception.

Although previous methods for CS imaging recovery have achieved decent results, in the visual perception pipeline from target scene to perception (Fig. 1), only a small part is taken into account. For one thing, the destination of the reconstructed images and videos must be considered. In essence, the purpose of the imaging system is to obtain visual information for human viewing or computer vision (CV) tasks. Existing methods pursue a higher numerical similarity between reconstruction and the original scene, which is measured mainly by the peak signal-to-noise ratio (PSNR) [17] and structural similarity (SSIM) [17]. Yet, whether it is vision friendly enough, that is, to provide perceptually convincing quality for human viewing, and whether it can retain adequate information for CV tasks, remains to be explored. Also, in CS imaging, the optical encoding strategy determines the information sampled from the scene, which can significantly affect the reconstruction step. Previous approaches are usually based on suboptimal pseudo-random coding and corresponding measurements, which limits further improvements to some extent.

Fig. 1. In (a), the optimization range of different types of CS imaging system is illustrated. The exemplar single-pixel imaging results in (b) show that the proposed framework is capable of achieving high-quality reconstruction.

Download Full Size | PDF

In this paper, keeping the above analysis in mind, we propose an end-to-end framework for CS imaging, involving the co-optimized optical coding and co-optimization with high-level CV tasks, which we refer to as CoCoCs. To this end, we design a framework that composes a trainable sampling matrix, a trainable inversion and a flexible reconstruction network. We introduce perceptual loss and train the network in an adversarial setup to enhance the performance in CV tasks as well as the visual quality. We demonstrate the proposed framework on SPI and SCI, which are two of the most typical CS imaging systems. To summarize, this paper makes the following contributions.

We propose CoCoCs, a novel noniterative end-to-end deep-learning-based framework for CS imaging, which aims to deal with two main challenges: recovering high-quality photorealistic images and bridging the gap between CS imaging and CV tasks and the perception of human.

Extensive results have shown that CoCoCs compares favorably to baselines on PSNR and SSIM. More importantly, it not only generates more visually appealing results, but also substantially improves the accuracy of visual recognition tasks with reconstructions as inputs. To the best of our knowledge, this is the first attempt to evaluate CS imaging algorithms from the perspective of the mean opinion score (MOS) test and CV tasks.

We demonstrate the proposed framework on two typical CS imaging systems, i.e., single-pixel imaging and snapshot video compressive imaging, to display its flexibility and adaptivity. We built an SCI prototype to conduct hardware experiments to evaluate the performance of CoCoCs on real data.

2. Related works

Single pixel imaging and snapshot video compressive imaging. In the SPI [9,10], a static 2D target scene is modulated by dynamic masks but spatially integrated to be measured by a single detector without spatial resolution, generating a 1D temporal waveform, which is further processed to retrieve the image. The SCI [15,16] makes use of a 2D snapshot to acquire a 3D video cube (2D spatial coordinates and time). The dynamic scene is encoded by a series of time-varying masks and then temporally integrated by an image sensor to form a snapshot measurement, from which the original scene can be recovered by CS reconstruction algorithms.

Coding strategies and CS reconstruction algorithms. Typical CS imaging systems employ a pseudo-random coding strategy, thus corresponding algorithms are needed for reconstruction. For SPI, conventional iteration-based algorithms, such as AMP [18,19], TVAL-3 [20], are well established, but difficult to achieve outstanding results and time-consuming. Inspired by the powerful learning property of CNN, deep learning-based algorithms such as DCAN [21], ReconNet [22] are proposed. Algorithms for SCI share a similar taxonomy to those of SPI, i.e., iteration-based ones like TwIST [23], GAP-TV [24], DeSCI [25] and deep learning-based ones [26–28]. In addition to pseudorandom coding, valuable attempts are made to find better coding strategies, including orthogonal basis and learned codings. For the orthogonal basis, such as Fourier [29–32] and Hadamard [32–34], the multiplexing is needed to acquire different components in the transform domain, resulting in the decreasing of efficiency. There are also learned codings like dictionary learning [35], and learned patterns [21,36,37]. However, the co-optimization of coding, reconstruction, and high-level CV tasks, as well as human vision, has not been explored.

Co-optimization in the vision pipeline. Recent years have witnessed exciting and stunning progress in the field of high-level computer vision, including image classification, object detection, and motion recognition. To further improve performance, low-level image processing steps, such as denoising [38], image enhancement [39], resizing [40], etc., are also reconfigured oriented to CV tasks. However, CS imaging has not enjoyed the benefits of the aforementioned development.

Vision perceptual quality and evaluation. In most of the works of CS imaging, qualities of the results are mainly evaluated by simply measuring the pixel-wise distortion between the reconstruction and the ground truth, e.g., PSNR and SSIM [17]. However, recent studies show that these metrics are not sufficient to reveal perceptual quality or even conflict with each other [41]. Compared to the aforementioned criteria, MOS quantitatively measures the human-perceptual overall quality of media like images and videos [42–44]. In an MOS test, raters are invited to watch the media and assign a score from 1 (bad) to 5 (excellent), which are then averaged to form a assessment of the quality. However, to the best of our knowledge, MOS has not yet been conducted into CS imaging evaluation.

3. Mathematical forward models

Mathematically, the general vectorized forward model for CS imaging is defined by

(1)$$\boldsymbol{y}=\boldsymbol{\Phi}\boldsymbol{x}+\boldsymbol{g},$$

where measurement vector $\boldsymbol {y} \in \mathbb {R} ^ {N \times 1}$, measurement matrix $\boldsymbol {\Phi } \in \mathbb {R} ^ {N \times M}$, target $\boldsymbol {x} \in \mathbb {R} ^ {M \times 1}$ and noise $\boldsymbol {g} \in \mathbb {R} ^ {N \times 1}$. In CS imaging, we have $N < M$, which means the high-dimensional target $\boldsymbol {x}$ is to be recovered from the low-dimensional measurement $\boldsymbol {y}$ and the compressive sampling rate is $CSR \triangleq N/M$.

3.1 Forward model of SPI

In SPI, the 2D static scene is encoded by time-variant patterns, then spatially integrated and measured by a single pixel to form a 1D time-serial measurement $\boldsymbol {Y}^{s}$. Therefore, $\boldsymbol {y}^{s} = \boldsymbol {Y}^{s} \in \mathbb {R} ^ {n_t \times 1}$ with $n_t$ time measurement, and

(2)$$\boldsymbol{\Phi}^{s} = \left[ vec(\boldsymbol{C}_1),\ldots,vec(\boldsymbol{C}_{n_t}) \right]^{\top},$$

(3)$$\boldsymbol{x}^{s} = vec(\boldsymbol{X}_0),$$

where $\boldsymbol {X}_0 \in \mathbb {R} ^ {n_x \times n_y}$ is the static target image of the scene and $\boldsymbol {C}_k \in \mathbb {R} ^ {n_x \times n_y}$ is the $k$th one of the series of time-variant masks $\boldsymbol {C} \in \mathbb {R} ^ {n_x \times n_y \times n_t}$ with $n_t$ patterns. In this case, $M = n_x n_y$ and $N = n_t$, thus $CSR = n_t/{n_x n_y}$.

3.2 Forward model of SCI

In SCI, the scene video cube is encoded frame by frame and then temporally summed to form a single-shot measurement $\boldsymbol {Y}^{v} \in \mathbb {R} ^ {n_x \times n_y}$ and $\boldsymbol {y}^{v} = vec(\boldsymbol {Y}^{v})$ is its vectorized form, and

(4)$$\boldsymbol{\Phi}^{v} = \left[ diag(vec(\boldsymbol{C}_1)),\ldots,diag(vec(\boldsymbol{C}_{n_t})) \right],$$

(5)$$\boldsymbol{x}^{v} = \left[ {\boldsymbol{x}_1^{v}}^{\top},\ldots,{\boldsymbol{x}_{n_t}^{v}}^{\top} \right]^{\top},$$

where $\boldsymbol {x}_k^{v} = vec(\boldsymbol {X}_k)$, $\boldsymbol {X}_k,\boldsymbol {C}_k \in \mathbb {R} ^ {n_x \times n_y}$ are frames in the 3D cube of the scene video $\boldsymbol {X}^{v} \in \mathbb {R} ^ {n_x \times n_y \times n_t}$ and coding $\boldsymbol {C} \in \mathbb {R} ^ {n_x \times n_y \times n_t}$. Therefore, $M = n_x n_y n_t$, $N = n_x n_y$, and then $CSR = 1/n_t$.

4. CoCoCs framework

In this section, the CoCoCs framework is described. The overall frameworks for SPI and SCI are mostly the same, but have minor variance in details due to the difference of these two hardware systems. Therefore, CoCoCs for SPI and SCI are introduced in sub-section 4.1 and 4.2, respectively. Both of them consist of an description of basic model, the distant loss, the perceptual loss, the adversarial loss, and the total loss for generator and discriminator. For SCI, the temporal consistency loss is extra added with considering the feature of video data. In addition, the symbols for SPI and SCI are distinguished by superscript $s$ and $v$. For example, $\boldsymbol {y}^{s}$ represents the measurement in SPI and $\boldsymbol {y}^{v}$ stands for the measurement in SCI.

4.1 CoCoCs for SPI

As illustrated in Fig. 2 (a), we model the sampling, inversion, and reconstruction in an end-to-end framework. The measurement $\boldsymbol {y}^{s}$ is mapped to an intermediate results via trainable inversion $\boldsymbol {\Psi }^{s}$, which is then fed into a fully convolutional reconstruction network $R$ to produce the final result $\boldsymbol {\hat {x}}^{s}$. The model is jointly trained in an adversarial setup. To this end, we regard the process from the target scene $\boldsymbol {x}^{s}$ to the final output $\boldsymbol {\hat {x}}^{s}$ as a generator $G$ with $\boldsymbol {\Phi }^{s}$, $\boldsymbol {\Psi }^{s}$ and the weights in $R$ as its trainable parameters:

(6)$$\boldsymbol{\hat{x}}^{s} = R(\boldsymbol{\Psi}^{s}\boldsymbol{y}^{s}) = R(\boldsymbol{\Psi}^{s}\boldsymbol{\Phi}^{s}\boldsymbol{x}^{s}) = G(\boldsymbol{x}^{s}).$$

Here we adopt a U-Net [45] as the reconstruction network $R$ (please see Supplement 1 for details) and the trainable inversion $\boldsymbol {\Psi }^{s}$ is initialized by the Moore-Penrose pseudoinverse [46] of $\boldsymbol {\Phi }^{s}$. The weighted combination of distance loss, perceptual loss, and adversarial loss is used for the generator.

Fig. 2. An overview of the end-to-end deep-learning-based framework for compressive imaging.

Download Full Size | PDF

Distance loss. The L1 norm is used to measure the distortion between the reconstruction and the ground truth:

(7)$$\mathcal{L}_{L1} = \| \boldsymbol{x}^{s} - \boldsymbol{\hat{x}}^{s} \|_1.$$

Perceptual loss. For the perceptual loss, we use a fixed pretrained VGG16 [47], denoted as $P$, to extract the high-level feature of the reconstruction and the ground-truth and then measure the distance between their feature maps. The activation before the $j$th max pool layer of $P$ is denoted as $P_j (\cdot )$, then the perceptual loss is

(8)$$\mathcal{L}_{perc} = \sum_{j=1}^{n_{perc}}\| P_j(\boldsymbol{x}^{s}) - P_j(\boldsymbol{\hat{x}}^{s}) \|_1,$$

where the first 4 activations are used and $n_{perc} = 4$.

Adversarial loss. A discriminator $D$, consisting of 5 layers of convolutional layer followed by Leaky ReLU with slope 0.2, is used for adversarial training (see Supplement 1). The least squares loss in LSGAN [48] is used, defined as

(9)$$\mathcal{L}_{adv} = \| D(\boldsymbol{\hat{x}}^{s}) - 1 \|_2^{2} .$$

Total loss for generator. The above losses are weighted and summed up for adversarial training of $G$

(10)$$\mathcal{L}_{G} = \lambda_1\mathcal{L}_{L1}+\lambda_2\mathcal{L}_{perc}+\lambda_3\mathcal{L}_{adv},$$

where we set $\lambda _1$ and $\lambda _2$ as 1.0 and $\lambda _3$ to be 0.3.

Total loss for discriminator. The discriminator $D$ is trained in the LSGAN [48] manner to recognize the output as real or fake, updated with the following loss function

(11)$$\mathcal{L}_{D} = \frac{1}{2} \| D(\boldsymbol{x}^{s}) - 1 \|_2^{2} + \frac{1}{2} \| D(\boldsymbol{\hat{x}}^{s}) - 0 \|_2^{2}.$$

4.2 CoCoCs for SCI

Basically, the architecture is similar to that of SPI, but modifications are applied in detail to adapt it to SCI, as illustrated in Fig. 2 (b). The reconstructed output is obtained by a generator as

(12)$$\boldsymbol{\hat{x}}^{v} = R(\boldsymbol{\Psi}^{v}\boldsymbol{y}^{v}) = R(\boldsymbol{\Psi}^{v}\boldsymbol{\Phi}^{v}\boldsymbol{x}^{v}) = G(\boldsymbol{x}^{v}),$$

where the trainable inversion $\boldsymbol {\Psi }^{v}$ is initialized as ${\boldsymbol {\Phi }^{v}}^{\top }(\boldsymbol {\Phi }^{v}{\boldsymbol {\Phi }^{v}}^{\top })^{-1}$, which is a simplified form of Moore-Penrose pseudoinverse [46] of $\boldsymbol {\Phi }^{v}$ by taking its special structure in Eq. (4) into consideration. In this case, $R$ is a fully convolutional network with Res-Blocks [49] (detailed in Supplement 1). The loss functions for SCI training are introduced below.

Distance loss. The distance loss follows the definition of that in SPI, thus

(13)$$\mathcal{L}_{L1} = \| \boldsymbol{x}^{v} - \boldsymbol{\hat{x}}^{v} \|_1.$$

Perceptual loss. Different from the SPI, the temporal information needs to be considered in SCI. Therefore, a pretrained 3D-ResNet with spatio-temporal convolutional kernels [50] is adopted as $P$ to extract the features of the video cube. Instead of feeding the entire video cube into $R$, we randomly sample 3 frames to calculate:

(14)$$\boldsymbol{x}^{v}_{rs} = [ \boldsymbol{x}_{r-1}^{v},\boldsymbol{x}_{r}^{v},\boldsymbol{x}_{r+1}^{v} ],$$

(15)$$\boldsymbol{\hat{x}}^{v}_{rs} = [\boldsymbol{\hat{x}}_{r-1}^{v},\boldsymbol{\hat{x}}_{r}^{v},\boldsymbol{\hat{x}}_{r+1}^{v} ],$$

where $r \in \left [2,\ldots,n_t-1\right ]$ is a random number. Then the perceptual loss is

(16)$$\mathcal{L}_{perc} = \sum_{j=1}^{n_{perc}}\lambda_{perc,j}\| P_j(\boldsymbol{x}_{rs}^{v}) - P_j(\boldsymbol{\hat{x}}_{rs}^{v}) \|_1,$$

where $P_j (\cdot )$ denotes the activation of the $j$th block in 3D-ResNet and $\lambda _{perc,j}$ is the assigned weight corresponding to, which is 1/32, 1/16, 1/8, 1/4, and 1 for $j$ from 1 to 5.

Adversarial loss. Similar to that in SPI, we design a discriminator $D$ with 6 layers and extend the convolutions to 3D kernels (see Supplement 1). In this case, unlike SPI, we use the activations in $D$ to calculate the adversarial loss:

(17)$$\mathcal{L}_{adv} = \sum_{j=1}^{n_{adv}}\lambda_{adv,j}\| D_j(\boldsymbol{x}^{v}_{rs}) - D_j(\boldsymbol{\hat{x}}^{v}_{rs}) \|_1,$$

where $D_j (\cdot )$ is the activation of the $j$ th layer of $D$ and $\lambda _{adv,j}$ is alse set to 1/32, 1/16, 1/8, 1/4 and 1 for $j$ from 1 to 5, corresponding to the first 5 layers of $D$.

Temporal consistency loss. This loss is aiming to keep the temporal consistency of the reconstruction video. For the sampled 3 frames $\boldsymbol {\hat {x}}^{v}_{rs}$ from the reconstruction, we extract the feature maps corresponding to the first and the second frame by $D_{j,1}(\cdot )$ and $D_{j,2}(\cdot )$, respectively, and calculate their difference. The sampling operation is applied to the sampled frames $\boldsymbol {x}^{v}_{rs}$ from the ground truth. Then, the temporal consistency loss is built up by

(18)$$\mathcal{L}_{tc} = \sum_{j=1}^{n_{adv}-1}\lambda_{adv,j} \| ( D_{j,2}(\boldsymbol{x}^{v}_{rs}) - D_{j,1}(\boldsymbol{x}^{v}_{rs})) - ( D_{j,2}(\boldsymbol{\hat{x}}^{v}_{rs}) - D_{j,1}(\boldsymbol{\hat{x}}^{v}_{rs}) ) \|_1.$$

Total loss for generator. We assign weights to each loss to form the total loss for $G$

(19)$$\mathcal{L}_{G} = \lambda_1\mathcal{L}_{L1}+ \lambda_2\mathcal{L}_{perc}+ \lambda_3\mathcal{L}_{adv}+ \lambda_4\mathcal{L}_{tc}.$$

where we set $\lambda _1$ as 1.0, $\lambda _2$ as 0.2, both $\lambda _3$ and $\lambda _4$ to be 0.01.

Total loss for discriminator. The loss for the discriminator is in the same form as that for SPI, but the sampled 3 frames are served as input.

(20)$$\mathcal{L}_{D} = \frac{1}{2} \| D(\boldsymbol{x}^{v}_{rs}) - 1 \|_2^{2} + \frac{1}{2} \| D(\boldsymbol{\hat{x}}^{v}_{rs}) - 0 \|_2^{2}.$$

5. Results

5.1 Single-pixel imaging

Performance of SPI reconstruction. Our model is trained on the Caltech-101 dataset [51]. We randomly separate 475 images for testing and use the rest for training. All images are convert to grayscale and crop to 128 $\times$ 128. We use Adam optimizer [52] with its $\beta _1=0.5$ and a batch size of 64 to train the model throughout 100 epochs in total. The initial learning rate is $4\times 10^{-4}$, which decreases to $4\times 10^{-5}$ after the 50th epoch. We compare our models, including the full model (CoCoCs) and the one trained with only with $\mathcal {L}_{L1}$ (CoCoCs-L1), to iteration-based TVAL3 [20] and deep-learning-based ReconNet [22] and its variant with BM3D denoiser [53] (noted as ReconNet-B) at three CSRs: 25%, 12.5% and 6.25%.

We evaluated the average PSNR and SSIM on the Caltech-101 test set, which are summarized in Tab.1. It confirms that CoCoCs surpasses the reference models in these metrics. As shown in Fig. 3, CoCoCs provide more visually appealing images than other reference models. Improvement is more obvious at a low CSR like 6.25%. For the full model of CoCoCs, although its PSNR and SSIM slightly drop compared to the L1-based variant, it yields conspicuous visual enhancement such as finer details and clearer edges.

Fig. 3. Sample SPI reconstructions from the test dataset. Inset images in red boxes show the preservation of finer details in our approach.

Download Full Size | PDF

Table 1. The results of PSNR in dB and SSIM of SPI reconstruction. Red and blue indicate the overall best and the best among the reference methods at each CSR, respectively. The ${\blacktriangle }$ denotes the improvement of our methods to the best reference.

View Table | View all tables in this article

Performance in image classification. As a typical topic of pattern recognition, image classification is used to validate the superiority of our algorithm in the CV task. We feed the reconstructed images into the pretrained VGG16 [47] and present the accuracy (Tab.2). CoCoCs outperforms all reference methods by a large margin, and the advantage becomes more significant when the CSR is lower. At 6.25%, the accuracy of CoCoCs is higher than twice of TVAL3. Similarly to the visual quality in Fig. 3, CoCoCs performs better than the L1-based one with the aid of high-level loss.

Table 2. Image classification accuracy based on the reconstruction of SPI. The accuracy of the original image cropped to 128 $\times$ 128 is 88.0%. Red and blue indicate the overall best and the best among the reference methods at each CSR, respectively. The ${\blacktriangle }$ denotes the improvement of our methods to the best reference.

View Table | View all tables in this article

Ablation study for SPI. The goal of the study is to validate the contribution of optimizing the coding $\boldsymbol {\Phi }^{s}$ and the inversion $\boldsymbol {\Psi }^{s}$. When $\boldsymbol {\Phi }^{s}$ and $\boldsymbol {\Psi }^{s}$ are not optimized, significant degradation can be seen on all metrics in Tab.3, which can also be revealed by Fig. 4.

Fig. 4. Sample reconstructions in the SPI ablation study.

Download Full Size | PDF

Table 3. Ablation study of SPI. The ${\blacktriangledown }$ denotes deterioration of each setup compared to the full CoCoCs model. Red denotes the best results.

View Table | View all tables in this article

MOS testing for SPI. We have performed an MOS test to quantify the visual perceptual quality for humans. In this test, 20 raters are invited to assign an integral score from 1 (bad) to 5 (excellent) to 8 versions at the aforementioned 3 CSRs, which are: ReconNet, ReconNet-B, TVAL3, CoCoCs-L1, CoCoCs, and its ablation variants.

From the MOS results in Tab.4, CoCoCs achieves the highest score, and the second is its L1-based variant, both with a noticeable margin to other approaches. In the MOS of the SPI ablation study, as shown in Tab.5, the contribution of optimizing $\boldsymbol {\Phi }^{s}$ and inversion $\boldsymbol {\Psi }^{s}$ can be clearly recognized by the raters. The distribution of MOS is visualized in Fig. 5.

Fig. 5. MOS distribution of (a). SPI methods and (b). ablation study. The bars indicate the variance of the scores.

Download Full Size | PDF

Table 4. MOS results of SPI. Red and blue indicate the overall best and the best among the reference methods at each CSR, respectively. The ${\blacktriangle }$ denotes the improvement of our methods to the best reference.

View Table | View all tables in this article

Table 5. MOS results of SPI ablation study. The ${\blacktriangledown }$ denotes deterioration of each setup compared to the full CoCoCs model. Red denotes the best results.

View Table | View all tables in this article

5.2 Snapshot compressive imaging

Performance of SCI reconstruction. For SCI network training, we use a random sample of 5000 video clips, with $n_t$ frames per clip, from the ImageNet VID training set [54] and spatially crop to 256 $\times$ 256 pixels. We randomly sample 100 video clips from the ImageNet VID validation set for testing, which are distinct from the training videos. We explore cases with $n_t$ equaling 8, 16, 32, corresponding to 12.5%, 6.25% and 3.125% CSR. The model is trained for 100 epochs with a learning rate of $10^{-4}$ and another 100 epochs at $10^{-5}$ by the Adam [52] optimizer with $\beta _1=0.5$. For our L1-based model, all losses except $\mathcal {L}_{L1}$ are neglected. We compare our method with two representative approaches, i.e., GAP-TV [24] and PnP-FFDNet [26].

The quantitative comparison is shown in Tab.6, from which it can be seen clearly that the proposed CoCoCs and CoCoCs-L1 outperforms other algorithms on both PSNR (by $\sim$7dB) and SSIM (by $\sim$0.1). The L1-based model performs best on these two criteria. From the exemplar SCI reconstruction in Fig. 6, we can observe that CoCoCs can well resolve fine details in the zoom-in view, which is challenging for GAP-TV and PnP-FFDNet to produce. The CoCoCs also generates clearer and sharper edges than CoCoCs-L1. In Fig. 7, we plot the averaged frame-wise PSNR and SSIM of 6.25% CSR for the test set. It can be seen that the two CoCoCs models are able to smoothly reconstruct the frames with higher quality than the reference algorithms.

Fig. 6. Sample SCI reconstructed video frames.

Download Full Size | PDF

Fig. 7. Frame-wise numerical metrics at 6.25% CSR, averaged among the test dataset. (a). PSNR, (b). SSIM.

Download Full Size | PDF

Table 6. The results of PSNR in dB and SSIM of SCI reconstruction. Red and blue indicate the overall best and the best among the reference methods at each CSR, respectively. The ${\blacktriangle }$ denotes the improvement of our methods to the best reference.

View Table | View all tables in this article

Performance in action recognition. We evaluate the performance of the reconstructed videos on action recognition task. The pre-trained 3D-ResNet [50] is adopted for action recognition. Here, the UCF-101 dataset [55] is used, which includes video instances of 101 human action classes. We sample 5000 and 1000 video clips for training and testing, respectively. We pre-process the video clips in the same way as the ImageNet VID dataset. The top-1 accuracy is reported in Tab.7. CoCoCs-L1 and CoCoCs both show significant increases over other algorithms. For action recognition, as revealed by the results with the original video input, more frames (corresponding to a lower CSR in SCI) would provide more context information to achieve higher accuracy. However, for the typical SCI, the quality of reconstruction drops when the CSR decreases. These two factors lead to a decrease in accuracy with CSR challenging from 6.25% to 3.125% in GAP-TV, PnP-FFDNet, and CoCoCs-L1. With the guide of high-level loss, the full CoCoCs can preserve sufficient information and reach the highest accuracy with the smallest data size (3.125% CSR) among these SCI setups. We also notice that CoCoCs at 12.5% surpass the original video, which can be regarded as a type of video enhancement under the guidance of high-level perception loss.

Table 7. Accuracy of motion recognition with SCI reconstructions. Red and blue indicate the overall best and the best among the reference methods at each CSR, respectively. The ${\blacktriangle }$ denotes the improvement of our methods to the best reference.

View Table | View all tables in this article

Ablation study for SCI. Similar to Sec.5.1, we conduct an ablation study for SCI and the results are shown in Fig. 8 and Tab.8. We can observe that optimizing $\boldsymbol {\Phi }^{v}$, $\boldsymbol {\Psi }^{v}$ can improve the quality of the reconstruction revealed by these three metrics. Applying $\mathcal {L}_{tc}$ also contributes, except for a slight decrease in recognition accuracy at 3.125% CSR.

Fig. 8. Sample reconstructions in the SCI ablation study.

Download Full Size | PDF

Table 8. Ablation study of SCI. The ${\blacktriangledown }$ denotes deterioration and ${\blacktriangle }$ denotes improvement of each setup compared to the full CoCoCs model. Red denotes the best results.

View Table | View all tables in this article

MOS testing for SCI. The aforementioned 20 raters score the reconstructions in SCI as independent tests. The test samples are generated by GAP-TV, PnP-FFDNet, CoCoCs-L1, CoCoCs and its ablation variants (7 versions) at three CSRs of 12.5%, 6.25% and 3.125%. The MOS scores in Tab.9 confirm the advantages of CoCoCs over reference algorithms, by a greater margin at a lower CSR such as 3.125%. Tab.10 shows that $\boldsymbol {\Phi }^{v}$, $\boldsymbol {\Psi }^{v}$ and $\mathcal {L}_{tc}$ all set a contribution to visual quality. The MOS distribution is displayed in Fig. 9.

Fig. 9. MOS distribution of (a). SCI methods and (b). ablation study. The bars indicate the variance of the scores.

Download Full Size | PDF

Table 9. MOS results of SCI. Red and blue indicate the overall best and the best among the reference methods at each CSR, respectively. The ${\blacktriangle }$ denotes the improvement of our methods to the best reference.

View Table | View all tables in this article

Table 10. MOS results of SCI ablation study. The ${\blacktriangledown }$ denotes deterioration and ${\blacktriangle }$ denotes improvement of each setup compared to the full CoCoCs model. Red denotes the best results.

View Table | View all tables in this article

5.3 Results on real SCI hardware

The CoCoCs framework can be adapted to both SPI and SCI, and other compressive imaging systems. Taking the hardware resources into account, we demonstrate CoCoCs on a SCI hardware system. We perform hardware experiments using an SCI prototype we built, as illustrated in Fig. 10. The dynamic scene is imaged onto a Digital Micromirror Device (DMD, ViALUX V-9001 toolkit, equipped with a Texas Instruments DLP9000X chip with 2560 $\times$ 1600 resolution and 7.6 um pixel size) by a camera lens, and then the coded image is relayed to an image sensor (Sony IMX253 packaged in a FLIR GS3-U3-123S6M-C camera, with 4096 $\times$ 3000 resolution and 3.45 um pixel size) to be temporally integrated as measurement. To fit the resolution in the trained model, a 6 $\times$ 6 area are combined as one elements, thus a 1536 $\times$ 1536 region in the measurement is resized to 256 $\times$ 256 before being fed into the model. In this hardware experiment, we set the CSR as 2%, i.e., decompress a 50-frame video from a single shot. The frame rate of the image sensor is 20 frame per second (fps), thus the equivalent frame rate of the reconstructed video is 1000 fps.

Fig. 10. Illustration of the SCI prototype for hardware experiments. DMD: Digital Micromirror Device.

Download Full Size | PDF

We used the prototype to capture a high-speed dynamic scene, where a balloon filled with water was punctured. The results are shown in Fig. 11. GAP-TV shows a strong staircase effect and PnP-FFDNet has visible artifacts. Compared to baselines, CoCoCs achieves better quality. The L1-based results are relatively blur and some details are missed, while the full model provides finer details. The results show that CoCoCs has good generalization ability in natural scenes.

Fig. 11. Experimental results of SCI. (a) and (b) show two measurements, as well as their corresponding reconstructed frames with GAP-TV, PnP-FFDNet, CoCoCs-L1, and CoCoCs. The numbers after # denote the frame index of the reconstructed videos. Please see Visualization 1 for whole videos.

Download Full Size | PDF

6. Conclusion

A novel non-iterative end-to-end deep-learning-based framework for compressive imaging is proposed in this paper. This framework consists of a trainable sampling matrix, a trainable inversion layer, and a flexibly modified reconstruction network. The model is end-to-end trained under the supervision of a hybrid loss of distance loss, perceptual loss, and adversarial loss to realize co-optimization of optical coding, reconstruction, and high-level CV tasks. The proposed method is demonstrated on SPI and SCI, which shows that it not only outperforms in conventional pixel-wise metrics such as PSNR (up to +1.381 dB for SPI, +7.414 dB for SCI) and SSIM (up to +0.102 for SPI, +0.159 for SCI, out of 1.000), but also earns a distinguishable improvement in CV tasks and visual quality, revealed by classification accuracy (up to +42.5% for SPI, +12.5% for SCI) and MOS (up to +1.921 for SPI, +1.602 for SCI, out of 5.000), respectively. The experimental results based on our SCI prototype validate the performance of the proposed framework.

Although demonstrated with network architectures based on U-Net, CoCoCs is a flexible framework that can allow difference choices. Within the development of deep learning architecture, other advanced networks can also be adopted, including R2U-Net [56] and Visual Transformers [57], etc. Our proposed CoCoCs can easily be adapted to other compressive imaging systems and connected with more high-level vision tasks. We hope to promote the development of novel high-efficiency and high-performance visual perception systems.

Funding

National Natural Science Foundation of China (62135009); National Key Research and Development Program of China (2019YFB1803500).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory 52(2), 489–509 (2006). [CrossRef]

2. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006). [CrossRef]

3. M. E. Gehm, R. John, D. J. Brady, R. M. Willett, and T. J. Schulz, “Single-shot compressive spectral imaging with a dual-disperser architecture,” Opt. Express 15(21), 14013–14027 (2007). [CrossRef]

4. X. Lin, Y. Liu, J. Wu, and Q. Dai, “Spatial-spectral encoded compressive hyperspectral imaging,” ACM Trans. Graph. 33(6), 1–11 (2014). [CrossRef]

5. X. Yuan, T.-H. Tsai, R. Zhu, P. Llull, D. Brady, and L. Carin, “Compressive hyperspectral imaging with side information,” IEEE J. Sel. Top. Signal Process. 9(6), 964–976 (2015). [CrossRef]

6. W. Zhang, L. Cao, D. J. Brady, H. Zhang, J. Cang, H. Zhang, and G. Jin, “Twin-image-free holography: a compressive sensing approach,” Phys. Rev. Lett. 121(9), 093902 (2018). [CrossRef]

7. Z. Wang, L. Spinoulas, K. He, L. Tian, O. Cossairt, A. K. Katsaggelos, and H. Chen, “Compressive holographic video,” Opt. Express 25(1), 250–262 (2017). [CrossRef]

8. D. J. Brady, A. Mrozack, K. MacCabe, and P. Llull, “Compressive tomography,” Adv. Opt. Photonics 7(4), 756–813 (2015). [CrossRef]

9. L. Gao, J. Liang, C. Li, and L. V. Wang, “Single-shot compressed ultrafast photography at one hundred billion frames per second,” Nature 516(7529), 74–77 (2014). [CrossRef]

10. J. Liang, C. Ma, L. Zhu, Y. Chen, L. Gao, and L. V. Wang, “Single-shot real-time video recording of a photonic mach cone induced by a scattered light pulse,” Sci. Adv. 3(1), e1601814 (2017). [CrossRef]

11. Q. Guo, H. Chen, Z. Weng, M. Chen, S. Yang, and S. Xie, “Compressive sensing based high-speed time-stretch optical microscopy for two-dimensional image acquisition,” Opt. Express 23(23), 29639–29646 (2015). [CrossRef]

12. M. P. Edgar, G. M. Gibson, and M. J. Padgett, “Principles and prospects for single-pixel imaging,” Nat. Photonics 13(1), 13–20 (2019). [CrossRef]

13. G. M. Gibson, S. D. Johnson, and M. J. Padgett, “Single-pixel imaging 12 years on: a review,” Opt. Express 28(19), 28190–28208 (2020). [CrossRef]

14. O. Katz, Y. Bromberg, and Y. Silberberg, “Compressive ghost imaging,” Appl. Phys. Lett. 95(13), 131110 (2009). [CrossRef]

15. R. G. Baraniuk, T. Goldstein, A. C. Sankaranarayanan, C. Studer, A. Veeraraghavan, and M. B. Wakin, “Compressive video sensing: algorithms, architectures, and applications,” IEEE Signal Process. Mag. 34(1), 52–66 (2017). [CrossRef]

16. X. Yuan, D. J. Brady, and A. K. Katsaggelos, “Snapshot compressive imaging: Theory, algorithms, and applications,” IEEE Trans. on Image Process. 38(2), 65–88 (2021). [CrossRef]

17. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing 13(4), 600–612 (2004). [CrossRef]

18. D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed sensing,” Proc. Natl. Acad. Sci. 106(45), 18914–18919 (2009). [CrossRef]

19. J. Tan, Y. Ma, and D. Baron, “Compressive imaging via approximate message passing with image denoising,” IEEE Trans. Signal Process. 63(8), 2085–2092 (2015). [CrossRef]

20. C. Li, W. Yin, H. Jiang, and Y. Zhang, “An efficient augmented lagrangian method with applications to total variation minimization,” Comput. Optim. Appl. 56(3), 507–530 (2013). [CrossRef]

21. C. F. Higham, R. Murray-Smith, M. J. Padgett, and M. P. Edgar, “Deep learning for real-time single-pixel video,” Sci. Rep. 8(1), 2369 (2018). [CrossRef]

22. K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok, “Reconnet: Non-iterative reconstruction of images from compressively sensed measurements,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2016), pp. 449–458.

23. K. Koh, S.-J. Kim, and S. Boyd, “An interior-point method for large-scale l1-regularized logistic regression,” J. Mach. learning research 8, 1519–1555 (2007).

24. X. Yuan, “Generalized alternating projection based total variation minimization for compressive sensing,” in IEEE International Conference on Image Processing, (IEEE, 2016), pp. 2539–2543.

25. Y. Liu, X. Yuan, J. Suo, D. J. Brady, and Q. Dai, “Rank minimization for snapshot compressive imaging,” IEEE Trans. Pattern Anal. Mach. Intell. 41(12), 2990–3006 (2018). [CrossRef]

26. X. Yuan, Y. Liu, J. Suo, and Q. Dai, “Plug-and-play algorithms for large-scale snapshot compressive imaging,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE, 2020), pp. 1447–1457.

27. M. Qiao, Z. Meng, J. Ma, and X. Yuan, “Deep learning for video compressive sensing,” APL Photonics 5(3), 030801 (2020). [CrossRef]

28. Z. Wang, H. Zhang, Z. Cheng, B. Chen, and X. Yuan, “Metasci: Scalable and adaptive reconstruction for video compressive sensing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE, 2021), pp. 2083–2092.

29. Z. Zhang, X. Ma, and J. Zhong, “Single-pixel imaging by means of fourier spectrum acquisition,” Nat. Commun. 6, 1–6 (2015). [CrossRef]

30. C. Hu, H. Huang, M. Chen, S. Yang, and H. Chen, “Fouriercam: a camera for video spectrum acquisition in a single shot,” Photonics Res. 9(5), 701–713 (2021). [CrossRef]

31. H. Huang, C. Hu, S. Yang, M. Chen, and H. Chen, “Temporal ghost imaging by means of fourier spectrum acquisition,” IEEE Photonics J. 12, 1–12 (2020). [CrossRef]

32. Z. Zhang, X. Wang, G. Zheng, and J. Zhong, “Hadamard single-pixel imaging versus fourier single-pixel imaging,” Opt. Express 25(16), 19619–19639 (2017). [CrossRef]

33. C. M. Watts, D. Shrekenhamer, J. Montoya, G. Lipworth, J. Hunt, T. Sleasman, S. Krishna, D. R. Smith, and W. J. Padilla, “Terahertz compressive imaging with metamaterial spatial light modulators,” Nat. Photonics 8(8), 605–609 (2014). [CrossRef]

34. N. Huynh, E. Zhang, M. Betcke, S. Arridge, P. Beard, and B. Cox, “Single-pixel optical camera for video rate ultrasonic imaging,” Optica 3(1), 26–29 (2016). [CrossRef]

35. Y. Hitomi, J. Gu, M. Gupta, T. Mitsunaga, and S. K. Nayar, “Video from a single coded exposure photograph using a learned over-complete dictionary,” in International Conference on Computer Vision, (IEEE, 2011), pp. 287–294.

36. J. N. Martel, L. K. Mueller, S. J. Carey, P. Dudek, and G. Wetzstein, “Neural sensors: Learning pixel exposures for hdr imaging and video compressive sensing with programmable sensors,” IEEE Trans. Pattern Anal. Mach. Intell. 42(7), 1642–1653 (2020). [CrossRef]

37. C. Hu, H. Huang, M. Chen, S. Yang, and H. Chen, “Video object detection from one single image through opto-electronic neural network,” APL Photonics 6(4), 046104 (2021). [CrossRef]

38. D. Liu, B. Wen, J. Jiao, X. Liu, Z. Wang, and T. S. Huang, “Connecting image denoising and high-level vision tasks via deep learning,” IEEE Trans. on Image Process. 29, 3695–3706 (2020). [CrossRef]

39. V. Sharma, A. Diba, D. Neven, M. S. Brown, L. Van Gool, and R. Stiefelhagen, “Classification-driven dynamic image enhancement,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2018), pp. 4033–4041.

40. H. Talebi and P. Milanfar, “Learning to resize images for computer vision tasks,” arXiv preprint arXiv:2103.09950 (2021).

41. Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2018), pp. 6228–6237.

42. R. C. Streijl, S. Winkler, and D. S. Hands, “Mean opinion score (mos) revisited: methods and applications, limitations and alternatives,” Multimedia Systems 22(2), 213–227 (2016). [CrossRef]

43. A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Trans. on Image Process. 21(12), 4695–4708 (2012). [CrossRef]

44. A. K. Moorthy and A. C. Bovik, “Blind image quality assessment: From natural scene statistics to perceptual quality,” IEEE Trans. on Image Process. 20(12), 3350–3364 (2011). [CrossRef]

45. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, (Springer, 2015), pp. 234–241.

46. J. C. A. Barata and M. S. Hussein, “The moore–penrose pseudoinverse: A tutorial review of the theory,” Braz. J. Phys. 42(1-2), 146–165 (2012). [CrossRef]

47. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 (2014).

48. X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, (IEEE, 2017), pp. 2794–2802.

49. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (IEEE, 2016), pp. 770–778.

50. K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition), (IEEE, 2018), pp. 6546–6555.

51. L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in Conference on Computer Vision and Pattern Recognition Workshop, (IEEE, 2004), p. 178.

52. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).

53. K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-d transform-domain collaborative filtering,” IEEE Trans. on Image Process. 16(8), 2080–2095 (2007). [CrossRef]

54. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” Int. J Comput. Vis. 115(3), 211–252 (2015). [CrossRef]

55. K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402 (2012).

56. M. Z. Alom, C. Yakopcic, T. M. Taha, and V. K. Asari, “Nuclei segmentation with recurrent residual convolutional neural networks based u-net (r2u-net),” in National Aerospace and Electronics Conference, (IEEE, 2018), pp. 228–233.

57. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, (ICLR, 2021).

	Method	CSR 25%	CSR 12.5%	CSR 6.25%
PSNR	ReconNet	23.446	21.938	20.248
	ReconNet-B	23.458	21.950	20.264
	TVAL3	$29.108$	$25.122$	$22.219$
	CoCoCs-L1	30.641 $_{1.533}$ $_{▴}$	27.920 $_{2.798}$ $_{▴}$	25.787 $_{3.568}$ $_{▴}$
	CoCoCs	${29.838}_{0.730}$ $_{▴}$	${26.434}_{1.312}$ $_{▴}$	${23.600}_{1.381}$ $_{▴}$
SSIM	ReconNet	0.741	0.674	0.571
	ReconNet-B	0.744	0.677	0.571
	TVAL3	$0.842$	$0.725$	$0.607$
	CoCoCs-L1	0.904 $_{0.062}$ $_{▴}$	0.851 $_{0.126}$ $_{▴}$	0.788 $_{0.181}$ $_{▴}$
	CoCoCs	${0.891}_{0.049}$ $_{▴}$	${0.813}_{0.088}$ $_{▴}$	${0.709}_{0.102}$ $_{▴}$

	Method	CSR 25%	CSR 12.5%	CSR 6.25%
Acc. %	ReconNet	65.9	54.5	26.5
	ReconNet-B	67.0	55.8	28.2
	TVAL3	$80.8$	$62.3$	$32.0$
	CoCoCs-L1	${84.6}_{3.8}$ $_{▴}$	${80.4}_{18.1}$ $_{▴}$	${71.0}_{39.0}$ $_{▴}$
	CoCoCs	85.3 $_{4.5}$ $_{▴}$	81.5 $_{19.2}$ $_{▴}$	74.5 $_{42.5}$ $_{▴}$

$Φ^{s}$	$Ψ^{s}$	PSNR			SSIM			Acc.%
$Φ^{s}$	$Ψ^{s}$	CSR 25%	CSR 12.5%	CSR 6.25%	CSR 25%	CSR 12.5%	CSR 6.25%	CSR 25%	CSR 12.5%	CSR 6.25%
✗	✗	${23.009}_{6.829}$ $_{▾}$	${21.001}_{5.433}$ $_{▾}$	${19.652}_{3.948}$ $_{▾}$	${0.668}_{0.223}$ $_{▾}$	${0.572}_{0.241}$ $_{▾}$	${0.504}_{0.205}$ $_{▾}$	${66.3}_{19.0}$ $_{▾}$	${51.6}_{29.9}$ $_{▾}$	${42.5}_{32.0}$ $_{▾}$
✗	✓	${28.129}_{1.709}$ $_{▾}$	${24.700}_{1.734}$ $_{▾}$	${22.195}_{1.405}$ $_{▾}$	${0.853}_{0.038}$ $_{▾}$	${0.744}_{0.069}$ $_{▾}$	${0.636}_{0.073}$ $_{▾}$	${84.0}_{1.3}$ $_{▾}$	${76.4}_{5.1}$ $_{▾}$	${65.9}_{8.6}$ $_{▾}$
✓	✓	$29.838$	$26.434$	$23.600$	$0.891$	$0.813$	$0.709$	$85.3$	$81.5$	$74.5$

	Method	CSR 25%	CSR 12.5%	CSR 6.25%
MOS	ReconNet	2.007	1.698	1.347
	ReconNet-B	1.958	1.672	1.327
	TVAL3	$3.132$	$2.143$	$1.502$
	CoCoCs-L1	${3.568}_{0.436}$ $_{▴}$	${3.030}_{0.887}$ $_{▴}$	${2.683}_{1.181}$ $_{▴}$
	CoCoCs	3.963 $_{0.831}$ $_{▴}$	3.723 $_{1.580}$ $_{▴}$	3.423 $_{1.921}$ $_{▴}$

	$Φ^{s}$	$Ψ^{s}$	CSR 25%	CSR 12.5%	CSR 6.25%
MOS	✗	✗	${2.407}_{1.556}$ $_{▾}$	${2.033}_{1.690}$ $_{▾}$	${1.787}_{1.636}$ $_{▾}$
	✗	✓	${3.730}_{0.233}$ $_{▾}$	${3.318}_{0.405}$ $_{▾}$	${2.758}_{0.665}$ $_{▾}$
	✓	✓	$3.963$	$3.723$	$3.423$

CoCoCs: co-optimized compressive imaging driven by high-level vision

Abstract

1. Introduction

2. Related works

3. Mathematical forward models

3.1 Forward model of SPI

3.2 Forward model of SCI

4. CoCoCs framework

4.1 CoCoCs for SPI

4.2 CoCoCs for SCI

5. Results

5.1 Single-pixel imaging

5.2 Snapshot compressive imaging

5.3 Results on real SCI hardware

6. Conclusion

Funding

Disclosures

Data availability

Supplemental document

References

Supplementary Material (2)

Data availability

Cited By

Figures (11)

Tables (10)

Equations (20)

Optics Express

Name	Description
Supplement 1	Supplement 1
Visualization 1	Experimental results of SCI

	Method	CSR 12.5%	CSR 6.25%	CSR 3.125%
PSNR	GAP-TV	22.800	21.578	20.483
	PnP-FFDNet	$24.357$	$22.895$	$20.994$
	CoCoCs-L1	31.580 $_{7.223}$ $_{▴}$	29.201 $_{6.306}$ $_{▴}$	29.072 $_{8.078}$ $_{▴}$
	CoCoCs	${30.778}_{6.421}$ $_{▴}$	${28.832}_{5.937}$ $_{▴}$	${28.408}_{7.414}$ $_{▴}$
SSIM	GAP-TV	0.752	0.688	$0.638$
	PnP-FFDNet	$0.782$	$0.713$	0.621
	CoCoCs-L1	0.889 $_{0.107}$ $_{▴}$	0.826 $_{0.113}$ $_{▴}$	0.818 $_{0.180}$ $_{▴}$
	CoCoCs	${0.873}_{0.091}$ $_{▴}$	${0.817}_{0.104}$ $_{▴}$	${0.797}_{0.159}$ $_{▴}$

	Method	CSR 12.5%	CSR 6.25%	CSR 3.125%
Acc. %	GAP-TV	52.8	51.8	29.5
	PnP-FFDNet	$58.9$	$71.5$	$63.3$
	CoCoCs-L1	${59.7}_{0.8}$ $_{▴}$	${73.8}_{2.3}$ $_{▴}$	${71.8}_{8.5}$ $_{▴}$
	CoCoCs	60.1 $_{1.2}$ $_{▴}$	75.4 $_{3.9}$ $_{▴}$	75.8 $_{12.5}$ $_{▴}$
	Original	$59.1$	$76.9$	$81.8$

$Φ^{s}$	$Ψ^{s}$	$L_{t c}$	PSNR			SSIM			Acc.%
$Φ^{s}$	$Ψ^{s}$	$L_{t c}$	CSR 25%	CSR 12.5%	CSR 6.25%	CSR 25%	CSR 12.5%	CSR 6.25%	CSR 25%	CSR 12.5%	CSR 6.25%
✗	✗	✓	${29.509}_{1.269}$ $_{▾}$	${27.463}_{1.369}$ $_{▾}$	${26.630}_{1.778}$ $_{▾}$	${0.837}_{0.036}$ $_{▾}$	${0.764}_{0.053}$ $_{▾}$	${0.732}_{0.065}$ $_{▾}$	${58.9}_{1.2}$ $_{▾}$	${72.5}_{2.9}$ $_{▾}$	${70.7}_{5.1}$ $_{▾}$
✗	✓	✓	${29.620}_{1.158}$ $_{▾}$	${27.556}_{1.276}$ $_{▾}$	${27.470}_{0.938}$ $_{▾}$	${0.847}_{0.026}$ $_{▾}$	${0.785}_{0.032}$ $_{▾}$	${0.769}_{0.028}$ $_{▾}$	${59.9}_{0.2}$ $_{▾}$	${74.2}_{1.2}$ $_{▾}$	${72.2}_{3.6}$ $_{▾}$
✓	✓	✗	${30.641}_{0.137}$ $_{▾}$	${28.617}_{0.215}$ $_{▾}$	${28.350}_{0.058}$ $_{▾}$	${0.871}_{0.002}$ $_{▾}$	${0.811}_{0.006}$ $_{▾}$	${0.796}_{0.001}$ $_{▾}$	${59.6}_{0.5}$ $_{▾}$	${75.0}_{0.4}$ $_{▾}$	76.0 $_{0.2}$ $_{▴}$
✓	✓	✓	$30.778$	$28.832$	$28.408$	$0.873$	$0.817$	$0.797$	$60.1$	$75.4$	$75.8$

	Method	CSR 12.5%	CSR 6.25%	CSR 3.125%
MOS	GAP-TV	1.852	1.448	1.250
	PnP-FFDNet	$3.442$	$2.494$	$1.912$
	CoCoCs-L1	${3.566}_{0.124}$ $_{▴}$	${3.442}_{0.948}$ $_{▴}$	${3.490}_{1.578}$ $_{▴}$
	CoCoCs	3.576 $_{0.134}$ $_{▴}$	3.598 $_{1.104}$ $_{▴}$	3.514 $_{1.602}$ $_{▴}$

	$Φ^{v}$	$Ψ^{v}$	$L_{t c}$	CSR 12.5%	CSR 6.25%	CSR 3.125%
MOS	✗	✗	✓	${3.118}_{0.458}$ $_{▾}$	${2.442}_{1.156}$ $_{▾}$	${2.348}_{1.166}$ $_{▾}$
	✗	✓	✓	${3.384}_{0.192}$ $_{▾}$	${3.142}_{0.456}$ $_{▾}$	${3.070}_{0.444}$ $_{▾}$
	✓	✓	✗	${3.550}_{0.026}$ $_{▾}$	${3.466}_{0.132}$ $_{▾}$	${3.394}_{0.120}$ $_{▾}$
	✓	✓	✓	$3.576$	$3.598$	$3.514$