Physics-driven deep learning enables temporal compressive coherent diffraction imaging

Ziyang Chen; Siming Zheng; Siming Zheng; Zhishen Tong; Xin Yuan

doi:10.1364/OPTICA.454582

Coherent diffraction imaging (CDI) [1] has attracted significant interest in various fields such as material science [2] and biology [3], benefiting from the fact that a high-resolution image with intensity and phase information can be generated from a far-field coherent diffraction pattern. Although originally proposed for x rays, CDI is a general technique that can achieve much higher spatial resolution than its direct imaging counterparts. Existing CDI heavily relies on the prevalent phase retrieval (PR) approach [1,4], such as the “hybrid input–output”(HIO) [5] method, which recovers the intensity and phase information of an object from its power spectral density. Due to the ill-posed nature of PR, it is challenging to recover complicated objects without the supporting constraint, especially in high-detection-noise and low-dynamic-range scenarios [6–8]. Towards this end, various modifications [9,10] of CDI have been proposed to break the supporting constraint while retaining the convergence of reconstruction. Recently, coherent modulated imaging (CMI) [11] has adopted a modulator behind the imaged object to achieve dynamic-range modulated diffraction patterns, and then recover complicated images.

Compressive sensing (CS) [12] provides an effective solution to recover a high-dimensional signal from its low-dimensional measurements, providing a chance for large field-of- view (FoV) and high-spatial-resolution imaging. Specifically, high-dimensional image information is randomly modulated into a two-dimensional (2D) detectable plane through a group of coded patterns, which are then reconstructed from measurements using CS algorithms, such as temporal CS [13] and spectral CS [14]. Recently, deep learning [15] has flourished and been successfully applied to CS reconstruction. Inspired by CS and deep learning, the video snapshot compressive imaging (SCI) technique achieves high-speed frames from single-shot encoded measurement, and can enhance the frame rate of existing cameras up to one to two orders of magnitude [13,16]. While the video SCI technique can capture high-speed dynamic processes, its modulation process has so far mainly focused on the image domain (plane).

This Letter proposes a temporal compressive CDI (TC-CDI) method to achieve multi-frame images from single-shot encoded measurement in CDI, where modulation happens in the frequency domain. The proposed method is demonstrated with near-infrared light, where modulation is achieved by modulating the frequency of moving objects through a digital micro-mirror device (DMD) by projecting a group of coding patterns on the Fourier plane. Experimental results show that a compression ratio of up to eight can be achieved, which demonstrates the effectiveness of our proposed method, and potentially advances CDI to visualize the dynamic process of molecules with large FoVs and high spatial and temporal resolutions simultaneously.

Although compressive diffraction imaging has been reported before [17] to investigate the material properties of objects along the longitudinal location using a single-pixel energy sensitivity detector, our work differs from that since the compressive process of the proposed TC-CDI takes place in the temporal domain and modulation happens in the frequency domain. Furthermore, the reconstruction algorithm is based on a physics-driven deep-learning framework, which benefits the flexibility of optimization-based approaches and the high reconstruction quality of learning-based approaches.

The contributions of this work are two-fold: (1) the video SCI technique is generalized to frequency-domain modulation, and (2) the proposed physics-driven deep-learning algorithm enables TC-CDI to reconstruct high-quality video frames. Our work provides a novel paradigm to integrate CDI, deep learning, and PR to capture high-resolution high-dynamic scenes.

Figure 1 depicts a schematic diagram of TC-CDI. Without loss of generality, the moving object $\textbf{O}(\textbf{x},t)$ is illuminated by monochromatic light ${\textbf{U}_0}(\textbf{x})$, where $\textbf{x}$ denotes the spatial coordinate, and $t$ refers to the temporal coordinate. According to Fraunhofer diffraction, the light field ${\textbf{U}_d}(\textbf{x})$ is given by

Fig. 1. Schematic of temporal compressive CDI, where ${\textbf{U}_0}(\textbf{x})$ represents incident light, $\textbf{O}(\textbf{x},t)$ is the moving object, $\textbf{P}(\textbf{x})$ denotes the imaging field of view, $\textbf{M}(\textbf{x},t)$ is the dynamic random modulation function, and $\textbf{I}(\textbf{x})$ represents the intensity signal on a pixelated detector.

Download Full Size | PDF

(1)$$\begin{split}{\textbf{U}_d}(\textbf{x}) &\propto \int \textbf{O}({\textbf{x}^\prime},t){\textbf{U}_0}({\textbf{x}^\prime})\textbf{P}({\textbf{x}^\prime})\exp \!\left\{{- i\frac{{2\pi}}{{\lambda z}}{\textbf{x}^\prime}\textbf{x}} \right\}{\rm d}{\textbf{x}^\prime}\\ &\propto {\cal F}{\{\textbf{O}({\textbf{x}^\prime},t){\textbf{U}_0}({\textbf{x}^\prime})\textbf{P}({\textbf{x}^\prime})\} _{\frac{{2\pi}}{{\lambda z}}{\textbf{x}^\prime}}},\end{split}$$

where $\textbf{P}(\textbf{x})$ denotes the imaging FoV, ${\cal F}\{x\}$ denotes the Fourier transformation, and ${\textbf{x}^\prime}$ is the spatial coordinate at the original plane. Considering plane wave illumination, namely, ${\textbf{U}_0}({\textbf{x}^\prime}) = {\rm constant}$, the light field ${\textbf{U}_t}(\textbf{x},t)$ after random modulation $\textbf{M}(\textbf{x},t)$ can be expressed by

(2)$${\textbf{U}_t}(\textbf{x},t) = {\textbf{U}_d}(\textbf{x},t)\textbf{M}(\textbf{x},t).$$

Correspondingly, the detected intensity $\textbf{I}(\textbf{x})$ is

(3)$$\begin{split}\textbf{I}(\textbf{x}) &= \int_{{\Delta _t}} \parallel \!{\textbf{U}_t}(\textbf{x},t){\parallel ^2}{\rm d}t\\& \propto \int_{{\Delta _t}} \parallel {\cal F}{\{\textbf{O}({\textbf{x}^\prime},t)\textbf{P}({\textbf{x}^\prime})\} _{\frac{{2\pi}}{{\lambda z}}{\textbf{x}^\prime}}}{\parallel ^2}\parallel \!\textbf{M}(\textbf{x},t){\parallel ^2}{\rm d}t,\end{split}$$

where ${\Delta _t}$ denotes the integration time of the detector. Put Eq. (3) into discrete form and consider measurement noise, ${\boldsymbol Y} = \sum\nolimits_{t = 1}^T \parallel {\textbf{U}_d}(:,:,t) \odot \textbf{M}(:,:,t){\parallel ^2} + \textbf{G}$, where ${\boldsymbol Y}$, ${\textbf{U}_d}(:,:,t)$, and $\textbf{M}(:,:,t)$ are the discrete forms of $\textbf{I}(\textbf{x}),{\textbf{U}_d}(\textbf{x},t)$, and $\textbf{M}(\textbf{x},t)$, respectively, with the same shape as ${{\mathbb R}^{W \times H}}$ ($H$ rows and $W$ columns), $T$ denotes the number of frequency-domain (compressed) frames, $\textbf{G}$ signifies measurement noise, and $\odot$ represents the Hadamard product. Note that since $\textbf{M}(:,:,t)$ is binary, i.e., composed of {0,1}, where “0” blocks light and “1” transmits light, the detected intensity can be modeled as

(4)$${\boldsymbol Y} = \sum\limits_{t = 1}^T \parallel {\textbf{U}_d}(:,:,t){\parallel ^2} \odot \textbf{M}(:,:,t) + \textbf{G}.$$

We define $\textbf{U}_d^\prime (:,:,t) = \parallel {\textbf{U}_d}(:,:,t){\parallel ^2}$, and vectorize ${\boldsymbol Y}$, $\textbf{U}_d^\prime $, and $\textbf{G}$, namely, $\textbf{y} = {\rm vec}({\boldsymbol Y}) \in {{\mathbb R}^{\textit{WH}}}$, $\textbf{u} = {\rm vec}(\textbf{U}_d^\prime) \in {{\mathbb R}^{\textit{WHT}}}$, and $\textbf{g} = {\rm vec}(\textbf{G}) \in {{\mathbb R}^{\textit{WH}}}$. Equation (4) turns out to be

(5)$$\textbf{y} = {\boldsymbol \Phi}\textbf{u} + \textbf{g},$$

where ${\boldsymbol \Phi} \in {{\mathbb R}^{WH \times WHT}}$ denotes the sensing matrix, which is a concatenation of diagonal matrices. Specifically, ${\boldsymbol \Phi} = [{{\boldsymbol \Phi}_1}, \ldots ,{{\boldsymbol \Phi}_T}]$, where ${{\boldsymbol \Phi}_t} = {\rm Diag}({\rm vec}(\textbf{M}(:,:,t)))$ is a diagonal matrix with its diagonal elements composed of ${\rm vec}(\textbf{M}(:,:,t))$.

Next, a two-stage reconstruction model is proposed to recover image information from the measurement $\textbf{y}$. Specifically, as shown in Fig. 2, in the first stage, the frequency-domain frames are reconstructed through a physics-driven deep-unfolding structure based on a deep neural network (DNN). To be concrete, the reconstruction process can be modeled as the following optimization problem:

(6)$${\boldsymbol {\hat u}} = {\rm arg} \mathop {\min}\limits_\textbf{u}\! \parallel \textbf{y} - {\boldsymbol \Phi}\textbf{u}\parallel _2^2 + \tau R(\textbf{u}),$$

where $R(\textbf{u})$ represents the regularization part, and $\tau$ is a balance parameter. We solve Eq. (6) using a deep-unfolding network composed of cascaded physics-driven phases. Specifically, within each phase, there are two steps:

(7)$${\textbf{u}^{(j)}} = {\textbf{v}^{(j - 1)}} + {{\boldsymbol \Phi}^ \top}{({\boldsymbol \Phi}{{\boldsymbol \Phi}^ \top})^{- 1}}(\textbf{y} - {\boldsymbol \Phi}{\textbf{v}^{(j - 1)}}),$$

(8)$${\textbf{v}^{(j)}} = {\rm Network}({\textbf{u}^{(j)}}),$$

where ${\textbf{v}^{(j)}}$ denotes estimation of the desired signal at the $j$th phase. The backbone of each phase has an encoder–decoder structure with reversible blocks plugged in [18] for economic memory. In the second stage, the reconstructed frequency-domain frames are fed into the proposed DNN aided HIO (DNN-HIO) algorithm. As shown in Fig. 2, the classical HIO workflow is adopted, where ${\textbf{f}_i}$ and ${\textbf{F}_i}$ denote, respectively, the signal in real space and Fourier space in the $i$th iteration, $\textbf{n}$ and $\textbf{k}$ represent the corresponding coordinates, and $\sqrt {\textbf{mag}}$ is the magnitude of the frequency-domain frames. Additionally, the real space correction is used to check whether the coordinates of update variables belong to the support set ${\cal S}$, and $\beta$ provides feedback to the algorithm for those beyond ${\cal S}$. Different from the classical HIO, here a tunable denoising DNN [19] is employed in each iteration, which can adapt to different measurement noise levels, and is able to work in practical applications where the noise level is uncontrollable. Details are described in Supplement 1.

The optical setup of our TC-CDI system is shown in Fig. 3. A laser source with 780 nm center wavelength and 50 KHz spectral linewidth is coupled into a single-mode fiber. The output light from the fiber is then collimated by two achromatic doublet lenses AL1 ($f = 50\,\,{\rm mm}$) and AL2 ($f = 100\,\,{\rm mm}$) with a beam diameter of approximately $10\,\,{\rm mm}$. We adopt a classical convex lens to realize the Fourier transformation of the object. Specifically, the moving object is located at the front focal plane of the achromatic doublet lens AL3 ($f = 100\,\,{\rm mm}$), where a slit diaphragm ($0.56\,\,{\rm mm}$) behind the object is used to control the imaging FoV, and the detection plane is at the back focal plane of AL3. A 4F system, consisting of two biconvex lenses, AL4 ($f = 50\,\,{\rm mm}$) and AL5 ($f = 100\,\,{\rm mm}$), is used to extend the spatial spectrum distribution, which can well match the modulator DMD, (TI, $1024 \times 768\;{\rm pixels}$, $13.68\,\,{\unicode{x00B5}{\rm m}}$ pixel pitch). The modulation patterns are pre-stored in the DMD, and change periodically over time, following a random binary distribution, composed of $\{0,1\}$. The encoded measurement is projected onto the camera (MV-CA013-A0UM, $1024 \times 1280$ pixels, $4.8\,\,{\unicode{x00B5}{\rm m}}$ pixel pitch) through an imaging system consisting of AL6 ($f = 100\,\,{\rm mm}$) and the objective lens (OL, $4 \times$, ${\rm NA} = 0.2$).

Fig. 2. Illustration of the proposed two-stage reconstruction model. Top-left: in the first stage, the frequency-domain frames are modulated by dynamic masks, and integrated into the compressed measurement along time domain, which is then fed into a two-phase network. The backbone is an encoder–decoder structure with reversible blocks plugged in. Top-right: in the second stage, the reconstructed frequency-domain frames are fed into the proposed DNN-HIO algorithm with an iterative structure employing DNN for phase retrieval. Bottom: details of components highlighted with corresponding colors, where ${\cal F}$ and ${\cal G}$ are both rectified linear unit (ReLU) functions; details in Supplement 1.

Download Full Size | PDF

Fig. 3. TC-CDI set-up. AL, achromatic lens; OL, objective lens; FP, Fourier plane; DMD, digital micromirror device. The sample (dynamic scene) moves along the blue arrow.

Download Full Size | PDF

To verify the advantage of the CDI technique in high-spatial-resolution imaging, we compare the imaging resolution of the CDI technique using the DNN-HIO algorithm with a direct imaging technique. As shown in Fig. 4(a), two selected sets of line pairs in the USAF 1951 resolution target are chosen as objects. The spatial resolutions of the two selected sets are 57.02 line pair ${\rm (lp)/mm}$ and $14.25\;{\rm lp/mm}$. As can be observed from Figs. 4(c) and 4(d), CDI through the proposed DNN-HIO algorithm has much higher spatial resolution than direct imaging. This is because the spatial resolution of direct imaging is limited by the pixel size of the detector, whereas the spatial resolution of CDI is limited by the detection area of the detector. Therefore, for the fine structures of the object in a small FoV, CDI generally has higher spatial resolution than direct imaging.

Fig. 4. Comparison of imaging resolution between direct imaging and CDI using the DNN-HIO algorithm. (a) USAF 1951 resolution target; subfigures are enlarged images of the yellow boxes. (b1), (b2) Imaging results of the corresponding subfigures in (a) through direct imaging; (c1), (c2) imaging results using our proposed CDI technique through the DNN-HIO algorithm; (d1), (d2) intensity profiles extracted from the cross-section color lines.

Download Full Size | PDF

Next, we consider the dynamic scenario. As shown in Fig. 5(b), a fraction of the resolution target composed of numbers “2” and “3” is selected as the object. The object moves at a high speed and is captured by the imaging system within a snapshot (50 ms). The pixel number of measurement is $512 \times 512$. By using the proposed network, the spatial spectral images of $T = 8$ are recovered, and then the spatial images are reconstructed. As shown in Fig. 5(d), two frames are selected due to the simple object. Overall, the reconstruction results demonstrate that the proposed TC-CDI technique can achieve dynamic video frames of moving targets within a single shot.

Fig. 5. Experimental results of dynamic scene composed of numbers “2” and “3” in the USAF 1951 resolution target. (a) Coded measurement; (b) reference image of “2” and “3”; (c) two selected reconstructed spatial spectra; (d) corresponding reconstructed images through our DNN-HIO algorithm.

Download Full Size | PDF

Last, we adopt TC-CDI to visualize the motion of complicated objects. The moving logo of Westlake University is observed (see Visualization 1). Figure 6 shows the reconstruction results at compression ratio $T = 8$, namely, eight frames are recovered from single-shot measurement, where each frame is $512 \times 512\;{\rm pixels}$ in size, but we cropped $100 \times 100\;{\rm pixels}$ for better visualization. Clearly, different parts of the logo can be observed entering and leaving the FoV over time, which illustrates that the complete motion of the object can be visualized with a single exposure. In addition, due to deep learning, the proposed DNN-HIO algorithm has higher reconstruction performance than HIO. A detailed comparison with different PR algorithms is shown in Supplement 1, in which we also show results of $T = 20$ high-quality frames reconstructed from a snapshot measurement.

Fig. 6. Reconstruction results for the complicated object. (a) Coded measurement; (b) reference images of the moving object; (c) reconstructed spatial spectra; (d) eight corresponding reconstructed spatial images by HIO algorithm; (e) eight corresponding reconstructed spatial images by the proposed DNN-HIO algorithm. Boxes of different colors denote the parts where the contrast between the two results is more obvious. See Visualization 1.

Download Full Size | PDF

In summary, we have proposed an imaging approach integrating SCI into CDI. A system has been built, and we have developed a two-stage deep-learning model for the reconstruction. In the first stage, we estimate frequency-domain frames by a physics-driven network, and then feed these frames into the second stage for PR. Experimental results demonstrate the efficacy of the proposed system and algorithm.

Existing CDI techniques can achieve the finest spatial resolution of about 2 nm using x-ray beams [20], and exposure time remains in seconds. For example, the CMI technique demands an exposure time of 3 s [11], while 3D reconstruction through a partial CDI technique [21] requires 16 s. Our proposed TC-CDI technique achieved a spatial resolution of 17.5 µm using a 780 nm wavelength laser, and its frame rate reached 160 fps. Although the spatial resolution of our TC-CDI is on the $\unicode{x00B5}{\rm m}$ level due to the near-infrared source, its temporal resolution is significantly improved by using the modulation method in the frequency domain. Our future work will use other laser sources such as x ray to improve spatial resolution.

For detectors with fixed dynamic ranges, such as eight bits, it is challenging for TC-CDI to recover high-dynamic-range scenes, especially in the noisy case. In general, the quality of reconstructed images through TC-CDI declines as the dynamic range of the detector decreases. Although our TC-CDI technique adopts random patterns, the dynamic ranges of TC-CDI can be further improved by optimizing the modulation patterns that block the reflective light from the middle area.

Compared to existing techniques that allow objects to move only in the FoV but cannot be localized, our system can restore the motion of objects moving at high speed with information in and out of the FoV through a single exposure. Our method can also be used to improve the insufficient resolution of micro objects. The proposed method has broad applications in microscopic imaging and x-ray intensity CDI. In the future, more effort will be needed to improve the acquisition range and compression ratio. A joint network combining the temporal reconstruction and PR can also be built.

Funding

Lochn Optics.

Acknowledgment

We acknowledge the Research Center for Industries of the Future (RCIF) at Westlake University for supporting this work, and the funding from Lochn Optics. We thank Dr. Youzhen Gui at the Shanghai Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, for providing the near-infrared, single-frequency laser in our experiments.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this Letter may be obtained at [22].

Supplemental document

See Supplement 1 for supporting content.

REFERENCES

1. J. Miao, P. Charalambous, J. Kirz, and D. Sayre, Nature 400, 342 (1999). [CrossRef]

2. J. Miao, Y. Nishino, Y. Kohmura, B. Johnson, C. Song, S. H. Risbud, and T. Ishikawa, Phys. Rev. Lett. 95, 085503 (2005). [CrossRef]

3. H. Jiang, C. Song, C.-C. Chen, R. Xu, K. S. Raines, B. P. Fahimian, C.-H. Lu, T.-K. Lee, A. Nakashima, J. Urano, T. Ishikawa, F. Tamanoi, and J. Miao, Proc. Natl. Acad. Sci. USA 107, 11234 (2010). [CrossRef]

4. M. A. Pfeifer, G. J. Williams, I. A. Vartanyants, R. Harder, and I. K. Robinson, Nature 442, 63 (2006). [CrossRef]

5. J. R. Fienup, Appl. Opt. 21, 2758 (1982). [CrossRef]

6. M. M. Seibert, T. Ekeberg, F. R. Maia, et al., Nature 470, 78 (2011). [CrossRef]

7. X. Huang, J. Nelson, J. Steinbrener, J. Kirz, J. J. Turner, and C. Jacobsen, Opt. Express 18, 26441 (2010). [CrossRef]

8. A. Barty, J. Küpper, and H. N. Chapman, Annu. Rev. Phys. Chem. 64, 415 (2013). [CrossRef]

9. R. Horisaki, R. Egami, and J. Tanida, Opt. Express 24, 3765 (2016). [CrossRef]

10. I. Johnson, K. Jefimovs, O. Bunk, C. David, M. Dierolf, J. Gray, D. Renker, and F. Pfeiffer, Phys. Rev. Lett. 100, 155503 (2008). [CrossRef]

11. F. Zhang, B. Chen, G. R. Morrison, J. Vila-Comamala, M. Guizar-Sicairos, and I. K. Robinson, Nat. Commun. 7, 13367 (2016). [CrossRef]

12. D. L. Donoho, IEEE Trans. Inf. Theory 52, 1289 (2006). [CrossRef]

13. P. Llull, X. Liao, X. Yuan, J. Yang, D. Kittle, L. Carin, G. Sapiro, and D. J. Brady, Opt. Express 21, 10526 (2013). [CrossRef]

14. A. A. Wagadarikar, N. P. Pitsianis, X. Sun, and D. J. Brady, Opt. Express 17, 6368 (2009). [CrossRef]

15. G. Barbastathis, A. Ozcan, and G. Situ, Optica 6, 921 (2019). [CrossRef]

16. X. Yuan, D. J. Brady, and A. K. Katsaggelos, IEEE Signal Process. Mag. 38(2), 65 (2021). [CrossRef]

17. J. Greenberg, K. Krishnamurthy, and D. Brady, Opt. Lett. 39, 111 (2014). [CrossRef]

18. Z. Cheng, B. Chen, G. Liu, H. Zhang, R. Lu, Z. Wang, and X. Yuan, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 16246–16255.

19. K. Zhang, W. Zuo, and L. Zhang, IEEE Trans. Image Process. 27, 4608 (2018). [CrossRef]

20. Y. Takahashi, Y. Nishino, R. Tsutsumi, N. Zettsu, E. Matsubara, K. Yamauchi, and T. Ishikawa, Phys. Rev. B 82, 214102 (2010). [CrossRef]

21. J. Clark, X. Huang, R. Harder, and I. Robinson, Nat. Commun. 3, 993 (2012). [CrossRef]

22. Z. Chen, S. Zheng, Z. Tong, and X. Yuan, “Code for temporal compressive coherent diffraction imaging,” Github (2022), https://github.com/zsm1211.

Physics-driven deep learning enables temporal compressive coherent diffraction imaging

Abstract

Funding

Acknowledgment

Disclosures

Data availability

Supplemental document

REFERENCES

Supplementary Material (2)

Data availability

Cited By

Figures (6)

Equations (8)

Optica

Name	Description
Supplement 1	Supplemental document
Visualization 1	Reconstructed videos related to Fig. 6