Unsupervised reconstruction with a registered time-unsheared image constraint for compressed ultrafast photography

Haoyu Zhou; Haoyu Zhou; Yan Song; Zhiming Yao; Dongwei Hei; Yang Li; Baojun Duan; Yinong Liu; Liang Sheng

doi:10.1364/OE.519872

1. Introduction

Capturing transient scenes with high speeds is crucial to comprehend the physical phenomena of various ultrafast processes. The streak camera is an ultrafast imaging tool with high spatio-temporal resolution. However, due to its characteristic of converting time information into spatial information, its imaging field of view (FOV) is limited to a single line to avoid signal superposition. The event is required to follow the same spatio-temporal pattern in order to view the space-time evolution of the entire 2D dynamic scene. The whole spatial distribution can only be seen through repeated shooting, so the dynamic scene need to be a repeatable phenomenon. For non-repeatable phenomena, such as supernova explosions and synchrotron radiation, this imaging method is not appropriate. Compressed ultrafast photography (CUP) [1–3] completely opens the entrance slit of the streak camera and adopts pseudo-random binary masks in front of the slit to superimpose 2D dynamic scenes at different moments, which is a passive computational imaging technology with a frame rate of 10 trillion frames per second (fps) and a sequence depth of hundreds of frames [4]. Based on compressed sensing (CS) [5,6], CUP collects the data cube in a single shot and restores it into a dynamic scene through the reconstruction algorithm, which has great advantages on recording non-repeatable and self-luminous phenomena. However, the high compression ratio induced by the large sequence depth makes restoring the dynamic scene an ill-posed problem. For the reconstruction of CUP, the reconstruction accuracy of two-step iterative shrinkage/thresholding (TwIST) [1,7] algorithm is unsatisfactory, which prevents the practical applications of CUP.

Currently, the optimization efforts are divided into two categories, hardware-based and algorithm-based. In terms of hardware, it is effective to increase the sampling rate by increasing the number of channels, so as to achieve the purpose of improving the reconstruction quality, such as the complementary dual-channel lossless CUP proposed by Liang et al. [8], the dual-channel CUP proposed by Yao et al. [9] and the multi-channel coupled CUP proposed by Yao et al. [10]. The genetic algorithm [11] proposed by Yang et al. optimize the coding masks and can also be regarded as a means of hardware optimization. In the current recovery approaches, using prior information is crucial to solving ill-posed problems, including model-based optimization and data-driven deep learning (DL) algorithms. Model-based optimization narrows down the range of potential solutions by designing various hand-crafted regularizers, so as to obtain better results. For example, TV [12] assumes the sparsity of data on gradients, BM3D [13] uses non-local similarity of natural images, weighted nuclear norm minimization (WNNM) [14] models the image into a low-rank structure to promote the low-dimensionality of image, and plug-and-play (PnP) structure [15,16] is used to combine ADMM framework with a trained deep denoising network where the nonlinear prior possessed by the advanced image denoiser is implicitly embedded into the model. In addition, there is a reconstruction algorithm that uses space and intensity to constrain images [17]. However, these model-based optimizations are inadequate for a wide variety of CUP applications. For data-driven deep learning algorithms [18–23], it is usually necessary to build a suitable network structure and learn the nonlinear mapping between input and output by training a large number of datasets to learn prior information. However, this method is still limited in actual applications. First, the datasets required for training have a very expensive acquisition cost. secondly, the data-driven deep learning algorithms have a narrow field of applicability, and lack sufficient generalizability in the absence of transfer learning processing. For example, DUN-3DUnet and EfficientSCI proposed in [20,21] are designed for the coding pattern of CACTI [24] and are not suitable for the coding pattern of CUP. The changes in masks, the number of image frames can also result in retraining, which is expensive in training cost especially for large pixel size and high compression ratio. In addition, data-driven DL algorithms are more challenging applying in situations where it is not easy to obtain the training datasets, such as medical imaging and laser inertial confinement fusion. These schemes can improve the image fidelity to a certain extent, but there are still great challenges in complex dynamic scenes measurement.

It is worth noting that for those methods which increase the sampling rate by adding the number of channels, their utilization of multiple channels presuppositions that the images of all channels match exactly with each other. However, taking [19] as an example, the paper adopts an additional external charge-coupled device (CCD) camera to form the time-unsheared view. In this case of imaging with different cameras, the matching premise is not easy to satisfy. First of all, the pixel sizes of images taken by different cameras are generally not identical. Secondly, the image may be rotated when passing through optical components such as the improperly positioned reflector. Finally, since the image size of the cameras is typically larger than the scene size, it is generally necessary to manually crop the images before reconstruction, and the corresponding pixels of different cameras may not match and have an offset during the cropping procedure. The slight affine transformation leads to CCD camera constraints that not only do not help to improve the reconstruction accuracy, but also degrade the performance of the reconstruction algorithm obviously (section 3.4). Recent studies have revealed that even without datasets, the convolutional neural network (CNN) structure itself has some regularization ability to capture a large number of statistical priors of low-level images, which is called deep image prior (DIP) [25]. DIP employs random noise as the input and learns appropriate network parameters from degraded images, which has been proven to be an effective tool to solve the reconstruction problems in spectral imaging [26], SIM [27], low light imaging [28], coherent phase imaging [29] and other computational imaging technologies. Different from those data-driven DL algorithms, inspired by non-data-driven DL approaches [30] such as DIP [25] and MMES [31], and in order to solve the above image mismatch problem, a new approach is proposed. In this paper, the untrained neural network is combined with the CS model to present a time-unsheared image constraint unsupervised (TUICU) learning algorithm based on image registration (IR) to solve CUP problems, called TUICU-IR. The proposed algorithm adopts an unsupervised deep learning framework which utilizes autoencoder network to learn the encoding and decoding process of the underlying image. Spatial transformer network (STN) [32] is introduced to carry out joint unsupervised learning affine transformation of CCD image, so as to achieve the purpose of image registration between the images of different views. Meanwhile, the reconstruction accuracy can be further improved by more effectively incorporating the hardware improvement [19] (CCD camera) into the algorithm. In addition, the proposed unsupervised DL algorithm can be suitable for those scenes where it is difficult to obtain the training datasets. The simulation and experimental results demonstrate that the proposed algorithm outperforms the existing CUP reconstruction algorithms, and has strong noise robustness, achieving the state-of-the-art CUP reconstruction results.

2. Principles

The equipment diagram of the CCD camera constraint CUP system is shown in Fig. 1. Firstly, the dynamic scene passes through a beam splitter, which divides the light into two beams, one of which is directly imaged by the external CCD camera, and the other is spatially encoded by the coding plate in front of the streak camera, which can also be replaced by the digital micromirror device (DMD). The encoded dynamic scene enters the streak camera and is converted into an electron beam by the photocathode. When passing through the scanning electric field, the electrons with different time of flight (ToF) are sheared by the scanning voltage, thus resulting in different deflections according to the time of arrival. After being enhanced by the micro-channel plate (MCP), the electrons reach the phosphor screen and are converted into the optical signal which is collected by the internal CCD camera to form a single 2D snapshot. Mathematically, the process of the streak camera can be expressed as:

(1)$${\boldsymbol{E}}(x^{\prime},y^{\prime})= \textbf{TSC}\boldsymbol{I}(x,y,t)+{\boldsymbol{n}},$$

where ${\boldsymbol{I}}$ denotes the original dynamic scene, $x$, $y$ is the space dimension, and $t$ is the time dimension, ${\boldsymbol{E}}$ denotes the snapshot finally collected by the internal CCD camera of the streak camera after a series of processes, $\textbf{C}$ denotes the spatial encoding process, $\textbf {S}$ denotes the temporal shearing step of streak camera, and $\textbf {T}$ denotes the process of spatio-temporal integration in the internal CCD camera. ${\boldsymbol{n}}$ denotes the noises in the collection process. Expressing $\textbf {TSC}$ in terms of $\textbf {O}$, the equation becomes

(2)$${\boldsymbol{E}}(x^{\prime},y^{\prime})= \textbf{O}\boldsymbol{I}(x,y,t)+{\boldsymbol{n}},$$

where ${\boldsymbol{E}}\in \mathbb {R}^{N_{xy}}$, $\textbf {O}\in \mathbb {R}^{N_{xy}\times N_{xyt}}$, ${\boldsymbol{I}}\in \mathbb {R}^{N_{xyt}}$, $\textbf {n}\in \mathbb {R}^{N_{xy}}$, and $N_{x}$, $N_{y}$ and $N_{t}$ denote the numbers of discretized pixels in the $x$, $y$ and $t$ coordinates, $N_{xy}$ and $N_{xyt}$ denote $N_{x}* N_{y}$ and $N_{x}* N_{y}* N_{t}$, respectively. Correspondingly, the projection angle of the dynamic scene on the external CCD camera is parallel to the $t$-direction coordinate, which can be expressed as

(3)$${\boldsymbol{{{E}}}}_{ccd}(x^{\prime},y^{\prime})= \textbf{T}\boldsymbol{I}(x,y,t)+{\boldsymbol{n}},$$

where ${\boldsymbol{{{E}}}}_{ccd}$ represents the scene collected by the external CCD camera, ${\boldsymbol{I}}$ denotes the original dynamic scene and $\textbf {T}$ denotes the process of spatio-temporal integration in the external CCD camera.

Fig. 1. The equipment diagram of the CCD camera constraint CUP system, CCD: charge-coupled device, MCP: micro-channel plate.

Methods	GAP-TV		FFDnet		FastDVDNet		DeSCI		TUICU (Proposed)
Evaluation Index	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
Drop(16)	28.09	0.904	33.18	0.942	33.83	0.965	28.52	0.923	35.08	0.970
Aerial(16)	21.52	0.667	20.98	0.675	23.85	0.739	21.73	0.686	25.56	0.824
Runner(24)	20.96	0.717	22.11	0.749	26.50	0.822	21.29	0.709	27.52	0.839
Crash(24)	21.80	0.693	20.62	0.697	23.41	0.767	21.98	0.712	24.53	0.839
Average	23.09	0.745	24.22	0.766	26.90	0.823	23.38	0.757	28.17	0.868

Datasets	Drop(16)		Aerial(16)		Runner(24)		Crash(24)		Average
Evaluation Index	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
GAP-TV( $σ_{0}$ =5)	27.92	0.876	21.50	0.660	20.94	0.697	21.72	0.678	23.02	0.728
FFDNet( $σ_{0}$ =5)	26.11	0.712	20.71	0.654	21.10	0.659	19.27	0.575	21.80	0.650
FastDVDnet( $σ_{0}$ =5)	28.10	0.742	23.41	0.692	25.97	0.783	22.46	0.637	24.99	0.714
DeSCI( $σ_{0}$ =5)	27.47	0.795	21.69	0.660	21.15	0.685	21.86	0.657	23.04	0.699
Undamaged( $σ_{0}$ =5)	30.96	0.888	25.33	0.799	26.15	0.784	24.20	0.751	26.66	0.805
M1( $σ_{0}$ =5)	30.90	0.900	25.01	0.793	26.23	0.794	24.15	0.751	26.57	0.810
M2( $σ_{0}$ =5)	30.73	0.911	24.71	0.776	25.88	0.775	23.93	0.741	26.31	0.801
M3( $σ_{0}$ =5)	30.95	0.920	25.04	0.793	26.30	0.796	24.17	0.764	26.62	0.818
M4 ( $σ_{0}$ =5)	30.73	0.914	24.73	0.778	26.00	0.811	23.85	0.739	26.33	0.810
GAP-TV ( $σ_{0}$ =10)	26.71	0.821	21.42	0.645	20.87	0.685	21.53	0.648	22.63	0.700
FFDNet ( $σ_{0}$ =10)	24.07	0.829	20.20	0.640	19.36	0.560	16.54	0.442	20.04	0.618
FastDVDnet ( $σ_{0}$ =10)	26.70	0.751	22.68	0.659	24.23	0.728	22.09	0.663	23.93	0.700
DeSCI ( $σ_{0}$ =10)	24.15	0.654	21.53	0.602	20.95	0.636	21.29	0.559	21.98	0.613
Undamaged ( $σ_{0}$ =10)	29.45	0.877	24.72	0.704	24.96	0.729	23.53	0.673	25.67	0.746
M1 ( $σ_{0}$ =10)	29.48	0.882	24.62	0.714	25.14	0.742	23.67	0.696	25.73	0.758
M2 ( $σ_{0}$ =10)	29.21	0.866	24.46	0.734	24.88	0.712	23.48	0.686	25.51	0.749
M3 ( $σ_{0}$ =10)	29.35	0.868	24.59	0.719	25.25	0.740	23.56	0.697	25.69	0.756
M4 ( $σ_{0}$ =10)	29.18	0.892	24.39	0.723	24.98	0.708	23.47	0.688	25.51	0.753
GAP-TV ( $σ_{0}$ =20)	24.16	0.688	21.12	0.601	20.63	0.651	20.66	0.568	21.64	0.627
FFDNet ( $σ_{0}$ =20)	19.55	0.419	18.27	0.491	19.10	0.614	15.39	0.292	18.08	0.454
FastDVDnet ( $σ_{0}$ =20)	20.58	0.448	21.34	0.610	21.60	0.684	20.17	0.611	20.92	0.588
DeSCI ( $σ_{0}$ =20)	18.35	0.554	20.30	0.494	19.85	0.536	18.88	0.463	19.35	0.512
Undamaged ( $σ_{0}$ =20)	27.98	0.840	23.75	0.640	24.05	0.647	22.70	0.627	24.62	0.689
M1 ( $σ_{0}$ =20)	28.11	0.842	24.07	0.686	24.15	0.660	22.96	0.640	24.82	0.707
M2 ( $σ_{0}$ =20)	27.97	0.836	23.95	0.685	24.00	0.655	22.77	0.630	24.67	0.702
M3 ( $σ_{0}$ =20)	28.08	0.842	24.06	0.688	24.13	0.661	22.87	0.641	24.79	0.708
M4 ( $σ_{0}$ =20)	28.01	0.839	23.85	0.676	23.91	0.653	22.83	0.640	24.65	0.702
GAP-TV (Poisson)	27.02	0.799	21.07	0.621	19.73	0.635	20.88	0.578	22.18	0.658
FFDNet (Poisson)	26.61	0.878	20.20	0.648	21.24	0.695	20.73	0.677	22.20	0.725
FastDVDnet (Poisson)	28.20	0.833	22.36	0.656	21.87	0.709	21.82	0.649	23.56	0.712
DeSCI (Poisson)	27.30	0.722	21.54	0.611	21.07	0.651	21.83	0.620	22.94	0.651
Undamaged (Poisson)	30.84	0.912	24.88	0.755	25.55	0.799	23.96	0.754	26.31	0.805
M1 (Poisson)	30.83	0.914	24.65	0.742	25.52	0.800	23.97	0.760	26.24	0.804
M2 (Poisson)	30.62	0.913	24.34	0.736	25.14	0.788	23.73	0.740	25.96	0.794
M3 (Poisson)	30.84	0.916	24.72	0.749	25.53	0.801	23.87	0.759	26.24	0.806
M4 (Poisson)	30.38	0.911	24.45	0.739	25.22	0.787	23.79	0.745	25.96	0.796
GAP-TV (P&G)	26.87	0.769	21.06	0.614	19.72	0.630	20.85	0.566	22.13	0.645
FFDNet (P&G)	25.38	0.855	20.13	0.646	21.12	0.692	20.49	0.670	21.78	0.716
FastDVDnet (P&G)	28.27	0.836	22.36	0.658	21.86	0.709	21.82	0.651	23.58	0.714

Datasets	Drop(16)		Aerial(16)		Runner(24)		Crash(24)		Average
Evaluation Index	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
w/o time-unsheared image (noiseless)	33.15	0.955	22.46	0.708	23.95	0.760	22.43	0.712	25.50	0.784
w/o time-unsheared image ( $σ_{0}$ =5)	28.97	0.878	22.29	0.692	23.04	0.699	22.41	0.667	24.18	0.734
w/o time-unsheared image ( $σ_{0}$ =10)	27.85	0.871	22.01	0.614	22.00	0.642	22.13	0.632	23.50	0.690
w/o time-unsheared image ( $σ_{0}$ =20)	26.10	0.837	21.87	0.629	20.93	0.571	21.59	0.604	22.62	0.660
w/o registration (noiseless)	27.73	0.908	22.47	0.694	22.36	0.656	21.30	0.637	23.47	0.724
w/o registration ( $σ_{0}$ =5)	27.07	0.849	22.29	0.668	22.48	0.654	21.32	0.589	23.29	0.690
w/o registration ( $σ_{0}$ =10)	26.98	0.838	22.01	0.583	22.15	0.597	21.50	0.570	23.16	0.647
w/o registration ( $σ_{0}$ =20)	26.49	0.810	21.23	0.429	21.93	0.567	21.18	0.559	22.71	0.591

Datasets	Drop(16)		Aerial(16)		Runner(24)		Crash(24)		Average
Evaluation Index	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
$σ_{0}$ =10(M1)	29.39	0.837	24.45	0.708	24.79	0.731	23.62	0.662	25.56	0.734
$σ_{0}$ =20(M1)	27.74	0.743	24.02	0.672	23.73	0.606	22.83	0.575	24.58	0.649

Datasets	Drop(16)		Aerial(16)		Runner(24)		Crash(24)		Average
Evaluation Index	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
$σ_{0}$ =10(M1)	29.36	0.862	24.54	0.705	24.51	0.705	23.62	0.695	25.51	0.742
$σ_{0}$ =20(M1)	27.98	0.838	24.00	0.685	23.80	0.646	22.82	0.643	24.65	0.703

Abstract

1. Introduction

2. Principles

3. Results and discussion

3.1 Comparison with state-of-the-art methods

3.2 Image registration analysis

3.3 Noise robustness analysis

3.4 Ablation study

3.5 Experiment

4. Conclusion

Funding

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (15)

Tables (5)

Equations (19)

Optics Express