Temporal structured illumination and vision-transformer enables large field-of-view binary snapshot ptychography

Ziyang Chen; Siming Zheng; Wenzhe Wang; Jinqi Song; Xin Yuan; Xin Yuan; Xin Yuan

doi:10.1364/OE.504721

1. Introduction

Ptychography [1] has received extensive attention in the fields of materials science [2] and biology [3] because of its high resolution and large field-of-view (FoV). In recent years, there has been significant application research using Ptychography and improvements to Ptychography. For instance, single-shot Ptychography used partially coherent light superimposed on each other [4] while fast lensless imaging of crystals used the fourth-generation synchrotron source [5]. Fourier Ptychography is implemented through a combination of synthetic aperture and phase retrieval concepts [6]. Similar to coherent diffraction imaging [7], Ptychography has also been used for short wavelength scenes where light cannot be focused by a lens but instead using far-field coherent diffraction pattern and phase retrieval (PR) algorithms [7,8] to reconstruct the scene, in which, "hybrid input-output" (HIO) [9] method is one of the most commonly used PR methods. Due to the ill-posed nature of HIO, it is challenging to recover complicated objects without the supporting constraint, especially in high detection-noise and low dynamic range scenarios [10–12]. Decomposing complex targets into multiple simple small regions is an effective way to achieve PR for complex targets. Towards this end, Ptychography is able to image more complicated objects than coherent diffraction imaging by scanning a stack of illuminated regions. However, the cycle of changing the position of the structured light illumination and the exposure of the camera will inevitably lead to a longer time to complete the overall imaging of the target.

Recently, Metaimaging sensors have been validated for achieving large field of view 3D imaging in the visible light band [13], and snapshot Ptychography realized by array cameras avoids the complex illumination area movement and image composition issues [14]. While this method is novel, the array camera is challenging to build. This paper aims to answer the question "Is it possible to use a single camera to implement Ptychography of a large FoV with complicated scenes by a snapshot measurement?"

To answer this question, hereby we employ the idea of compressive sensing (CS) [15], which provides an effective solution to recover the high-dimensional signal from its low-dimensional measurements. CS provides a chance for large FoV and high spatial resolution imaging. Specifically, we utilize snapshot compressive imaging (SCI) [16–18] technique, which can reconstruct high-speed frames from a single-shot encoded measurement and can enhance the frame rate of existing cameras up to 1-2 orders of magnitude. In this manner, we can retrieve a number of frames with different small FoV to form a large FoV in the final stitched image, which can include complicated scene and fine details. Our system is thus denoted as FoV Compressive Ptychography, dubbed VC-Ptychography in short.

Specifically, we propose a FoV compressive Ptychography method to achieve multi-frame images from a single-shot encoded measurement, where the modulation happens in the frequency domain. The proposed method is demonstrated with near-infrared light. The object is illuminated by sequential structured light through a digital micro-mirror device (DMD), which is utilized by projecting a group of coding patterns on the Fourier plane within one exposure period. Experimental results show that a compression ratio of up to 20 can be achieved, which demonstrates the effectiveness of our proposed method, and potentially advances Ptychography to visualize the dynamic process of molecules with large FoV and high spatial and temporal resolution simultaneously.

To tackle the challenging inverse problem, a novel three-step reconstruction algorithm composed of multi-frame spectra reconstruction, phase retrieval, and multi-frame image stitching is developed, where we employ the emerging deep learning structure, namely Transformer-based network, in the first step. Therefore, our algorithm is Deep Neural Network (DNN) based HIO, dubbed DNN-HIO.

2. Methods

2.1 Forward model

Fig. 1 depicts the schematic diagram of Ptychography. Without loss of generality, the object ${\bf O}({\bf x})$ is illuminated by monochromatic light ${\bf U}_{0}({\bf x})$ with a movable diaphragm to limit its spatial position, where ${\bf x}$ denotes the spatial coordinate. According to Fraunhofer diffraction, the light field ${\bf U}_{d}({\bf x})$ is given by:

(1)$$\begin{aligned} \mathbf{U}_{d}(\mathbf{x}) &\propto \displaystyle\int{ \mathbf{O}\left(\mathbf{x}'\right)\mathbf{U}_{0}\left(\mathbf{x'}\right) \mathbf{P}\left(\mathbf{x}',t\right) \exp\left\{{-}i\frac{2\pi}{\lambda z}{\mathbf{x}'\mathbf{x}}\right\} \mathrm{d}\mathbf{x}'}\\ & \propto \mathcal{F}\left\{ \mathbf{O}\left(\mathbf{x}'\right)\mathbf{U}_{0}\left(\mathbf{x'}\right) \mathbf{P}\left(\mathbf{x}',t\right)\right\}_{\frac{2\pi}{\lambda z}\mathbf{x}'}, \end{aligned}$$

where ${\bf P}(\cdot )$ is the imaging FoV, $\mathcal {F}\{\cdot \}$ is the Fourier transformation, and ${\bf x}'$ is the spatial coordinate at the original plane and $t$ refers to the temporal coordinate.

Fig. 1. Schematic of Ptychography, where ${\bf U}_0({\bf x})$ represents the incident light, ${\bf O}({\bf x})$ is the object, ${\bf P}({\bf x},t)$ denotes the imaging field of view (FoV), ${\bf M}({\bf x},t)$ is the dynamic random modulation function, and ${\bf I}({\bf x})$ represents the intensity signal on a pixelated detector. A movable diaphragm is used to limit the illumination range.

Download Full Size | PDF

Considering the plane wave illumination, namely ${\bf U}_{0}\left ({\bf x'}\right ) = {\tt constant}$, the light field ${\bf U}_{t}({\bf x},t)$ after random modulation ${\bf M}({\bf x},t)$ can be expressed by:

(2)$$\mathbf{U}_{t}(\mathbf{x},t) = \mathbf{U}_{d}(\mathbf{x},t)\mathbf{M}(\mathbf{x},t).$$

Correspondingly, the detected intensity ${\bf I}({\bf x})$ is defined as:

(3)$$\begin{aligned} \mathbf{I}(\mathbf{x}) & = \displaystyle\int_{\Delta_t } \left\| \mathbf{U}_{t}(\mathbf{x},t) \right\|^2 \mathrm{d}t\\ & \propto \displaystyle\int_{\Delta_t} \left\| \mathcal{F} {\left\{ \mathbf{O}\left(\mathbf{x}'\right) \mathbf{P}\left(\mathbf{x}',t\right)\right\}}_{\frac{2\pi}{\lambda z}\mathbf{x}'} \right\|^2 \left\| \mathbf{M}(\mathbf{x},t) \right\|^2 \mathrm{d}t, \end{aligned}$$

where $\Delta _t$ denotes the integration (exposure) time of the detector. Put Eq. (3) into discrete form and considering the measurement noise,

(4)$${{\boldsymbol Y}} = \sum_{t=1}^{T} \left\| \mathbf{U}_{d}(:,:,t)\odot\mathbf{M}(:,:,t) \right\|^2+\mathbf{G},$$

where ${{\boldsymbol Y}}$, ${\bf U}_{d}(:,:,t)$ and ${\bf M}(:,:,t)$ are the discrete forms of ${\bf I}({\bf x}),{\bf U}_{d}({\bf x},t)$, and ${\bf M}({\bf x},t)$ with the same shape of $\mathbb {R}^{W\times H}$ ($H$ rows and $W$ columns), $T$ denotes the number of frequency-domain (compressed) frames, ${\bf G}$ signifies the measurement noise, and $\odot$ represents the Hadamard (element-wise) product.

Note that since ${\bf M}(:,:,t)$ is binary, i.e., composed of {0, 1}, where ‘0’ blocks the light and ‘1’ transmits the light, the detected intensity can be modeled as:

(5)$${{\boldsymbol Y}} = \sum_{t=1}^{T} \left\| \mathbf{U}_{d}(:,:,t)\right\|^2\odot\mathbf{M}(:,:,t)+\mathbf{G}.$$

We define ${\bf U}_{d}^{'}(:,:,t) = \|{\bf U}_{d}(:,:,t)\|^2$, and vectorize ${{\boldsymbol Y}}$, ${\bf U}_{d}^{'}$ and ${\bf G}$, namely, ${\bf y}={\tt vec}({{\boldsymbol Y}})\in \mathbb {R}^\textit{WH}$, ${\bf u}={\tt vec}({\bf U}_{d}^{'})\in \mathbb {R}^\textit{WHT}$ and ${\bf g}={\tt vec}({\bf G})\in \mathbb {R}^\textit{WH}$. Eq. (5) turns out to be:

(6)$$\mathbf{y}= \boldsymbol{\Phi} \mathbf{u}+ \mathbf{g}, \quad \boldsymbol{\Phi}\in\mathbb{R}^{\textit{WH}\times\textit{WHT}},$$

where $\boldsymbol {\Phi } = [\boldsymbol {\Phi }_1,\dots,\boldsymbol {\Phi }_T]$ denotes the sensing matrix and is a concatenation of diagonal matrices. Specifically, $\boldsymbol {\Phi }_t= {\tt Diag}({\tt vec}({\bf M}(:, :, t)))$ is a diagonal matrix with its diagonal elements composed of ${\tt vec}({\bf M}(:, :, t))$, $\forall t=1,\dots,T$.

2.2 Reconstruction algorithm

In this section, we design a three-step reconstruction algorithm to restore the desired large FoV image from the 2D measurement of a small FoV. With the knowledge of modulation in the first stage, the captured single 2D measurement can be reconstructed to sequential frames by algorithms [19–23]. Whereas, existing optimization-based algorithms are time-consuming, learning-based algorithms are not robust to various scenarios with different scales and are weak in modeling long-range dependency. To ameliorate these issues, we combine the traditional optimization algorithm, i.e., generalized alternating projection [24,25] and deep learning with Transformer blocks [26] plugged in. The first reconstruction process is an ill-posed problem that can be modeled as

(7)$$\hat{\mathbf{u}} = \arg\min_\mathbf{u} \left\|\mathbf{y} - \boldsymbol{\Phi} \mathbf{u}\right\|_2^2 + \tau R(\mathbf{u}),$$

where $R({\bf u})$ represents the regularization part and $\tau$ is a balance parameter.

To eliminate the effect of the mask and improve the model’s robustness, we solve Eq. (7) under a deep unfolding framework [27] composed of 3D CNNs and Transformer blocks. For streamlining the whole framework, we only leverage 2 projections followed by our proposed network. Specifically,

(8)$$\begin{aligned} \mathbf{u}^{(j)} =\mathbf{v}^{(j-1)} + \boldsymbol{\Phi}^{\top}(\boldsymbol{\Phi}\boldsymbol{\Phi}^{\top} )^{{-}1} \left(\mathbf{y} - \boldsymbol{\Phi} \mathbf{v}^{(j-1)}\right), \end{aligned}$$

(9)$$\begin{aligned}\mathbf{v}^{(j)} = {\rm Network} (\mathbf{u}^{(j)}), \end{aligned}$$

where ${\bf v}^{(j)}$ denotes the estimated signal at the $j$-th phase.

In Fig. 2, there are eight blocks after each projection; all blocks have the same architecture, each of which is composed of two Transformer modules followed by 3D-CNNs. Multi-head self-attention modules (MSA) are widely utilized in Transformers. Most traditional MSAs of Transformers for video perform global spatial interactions simultaneously by utilizing all tokens extracted from the whole feature map. In this paper, the first Transformer module shares the same setting with the block attention in Swin Transformer [26] which focuses on the spatial domain and does not follow its design of shifting window partition between consecutive self-attention layers. Let $X_f \in \mathbb {R}^{B \times C \times W \times H \times T}$ denote input feature map. The Layer Normalization is first conducted. Then given he 2D window size of $S \times S$, the input tokens are partitioned into 2D windows with the size of $\mathbb {R}^{B \times \frac {W}{S} \times \frac {H}{S} \times T \times C,S \times S}$ and we conduct MSA on the windows for each head:

(10)$${\rm MSA}(Q, K, V) = {\rm Softmax} \left( \frac{QK^T}{\sqrt{d}} + B\right)V,$$

where $Q$, $K$, $V$ denotes the query, key, and value matrix respectively. The number of each head’s channel $d=C/N$ and $N$ is the number of heads. $B$ represents the 3D relative position bias. In the second Transformer module, we impose self-attention on the temporal dimension. To further investigate the different FoV and spatial correlation, we feed the output of the Transformer modules into 3D CNNs. Although 2D-CNN can well capture the spatial features for image-based tasks, for our FoV tasks, modeling FoV information and motion patterns is essential. Hence performing 3D convolutions over the spatio-temporal volume outperforms those who neglect to investigate temporal and FoV information. To enable the network structure to inherit the advantages of CNN and Transformer, we combine CNN and Transformer in the reconstruction process. In the second stage, the reconstructed frequency-domain frames are fed into the DNN-HIO algorithm [18], which follows the classical HIO workflow and employs a tunable denoising deep neural network (DNN) in each iteration to constrain the signal domain.

Fig. 2. Illustration of the proposed three-step reconstruction algorithm. Left: the frequency-domain frames are modulated by dynamic masks, and integrated into the compressed measurement along temporal dimension, which is then fed into our proposed algorithm with reference frames (RF). Middle: details of components of our proposed DNN with Transformer and 3D CNNs (Step 1). Top-right: Reconstructed frequency-domain frames are fed into the DNN-HIO algorithm for phase retrieval (Step 2). Bottom: image stitching method (Step 3) is conducted.

Download Full Size | PDF

After the reconstructed images with small FoV being obtained, a simple yet effective image stitching method is utilized. Since the width of the reference image in Fig. 5(c) can be gauged, the width of the stitched image $w_s$ is easy to calculate. Given $w_s$, the number of reconstructed spatial images $n_{\rm image}$ and the width of each reconstructed spatial image $w_r$, the average width of the overlapping region between every two adjacent reconstructed spatial images $w_{avg}$ can be roughly calculated as:

(11)$$w_{\mathtt{avg}} = \frac{n_{\mathtt{image}} \times w_r - w_s}{n_{\mathtt{image}} - 1}.$$

For every two adjacent reconstructed (small FoV) spatial images to be stitched, we set the initial width of the overlapping region between the two images to $w_{{\tt avg}}$, and the initial height of the region to 0, binarize the two image provisionally. We then stitch them with the defined initial width and height, and count the number of pixels with a value no less than 1 in the stitched image as $n_{{\tt pixel}}$. Following this, we continually move the position of the right reconstructed spatial image in four directions to find the lowest value of $n_{{\tt pixel}}$. The position of the right image with the lowest value of $n_{{\tt pixel}}$ is the correct position for stitching, and the width and height of the overlapping region in this situation are defined as $w_{{\tt correct}}$ and $h_{{\tt correct}}$.

After acquiring $w_{{\tt correct}}$ and $h_{{\tt correct}}$ for every two adjacent reconstructed spatial images, we stitch the reconstructed spatial images from left to right seriatim and refine the background region in the stitched image.

2.3 Optical setup

The optical setup of our VC-Ptychography system is shown in Fig. 3. A laser source with $780 nm$ center wavelength and $50 kHz$ spectral linewidth is coupled into a single-mode fiber. The output light from the fiber is then collimated by two achromatic doublet lenses BL1 ($f=25 mm$) and BL2 ($f=100 mm$) with a beam diameter of approximately $20 mm$ and irradiated on the surface of DMD1 (TI, $1920 \times 1280$ pixels, $9.8 \mu m$ pixel pitch). Structured illumination images are pre-stored in DMD1. We choose two square light spots that are distributed longitudinally and shifted synchronously over time as the illumination structure. A 4f system consisting of BL3 ($f=100mm$) and BL4 ($f=25mm$) images the structure loaded on DMD1 onto the sample surface. We adopt a classical convex lens to realize the Fourier transformation of the sample. Specifically, the object is located at the front focal plane of the achromatic doublet lens FL ($f=100 mm$), and DMD2 (TI, $1024 \times 768$ pixels, $13.68 \mu m$ pixel pitch) is at the back focal plane of FL. Replacing FL with imaging lens IL ($f=50mm$) can achieve direct imaging of the sample. The modulation patterns are pre-stored in the DMD2, and the time-varying period is consistent with DMD1, following a random binary distribution. The encoded measurement is projected onto the camera (MV-CA013-A0UM, $1024 \times 1280$ pixels, $4.8 \mu m$ pixel pitch) through an imaging system which consists of BL5 ($f=100mm$) and the objective lens (OL, $4\times$, NA = 0.2). The structural lighting and coding in the system are realized by the {0, 1} binary coding of DMD, where ’0’ represents non-reflective light and ’1’ represents reflective light.

Fig. 3. Our VC-Ptychography system set-up. BL: Biconvex lens. FL: Fourier lens. IL: Imaging lens. OL: Objective lens. DMD: Digital micro-mirror device.

Download Full Size | PDF

3. Results

To verify the advantages of the VC-Ptychography technique in high spatial-resolution imaging, we conduct experiments to compare the imaging resolution of the VC-Ptychography technique using our proposed algorithm with a direct imaging technique. We also conduct simulation to verify the performance of the proposed algorithm.

3.1 Experimental datasets

We select the fifth group of USAF1951 resolution target as the test target, set the system compression ratio to $T=10$, illumination width to $78.4 \mu m$ ($8$ pixels on DMD1), and overlap $50{\%}$ of the neighbouring frames. The single-shot compressed measurement is shown in Fig. 4(a) along with the reconstruction result using our proposed algorithm shown in Fig. 4(c). As a comparison, we replace FL with IL and set all pixels of DMD1 and DMD2 to "1" to obtain the result of direct imaging of the target. As shown in Fig. 4(b), at the bottom-left, in Fig. 4(c), an area above the USAF 1951 resolution target (shown in Fig. 4(a)) is selected as the object and cannot be resolved by direct imaging without magnification. The spatial resolution of the selected sets is $32$ line pairs per millimeter ($(lp)/mm$) to $57.02 lp/mm$, respectively. In the bottom-right, Fig. 4(e), we plot the intensity profiles extracted from the cross-section color lines corresponding to the lines in Fig. 4(c-d) to compare the resolutions of direct imaging and our VC-Ptychography. Fig. 4(d) shows intensity profiles extracted from the corresponding cross-section color lines in Fig. 4 (b) and (c), and it is evident that VC-Ptychography exhibits significantly higher resolution than its direct imaging counterpart. This is because the spatial resolution of direct imaging is limited by the pixel size of the detector, whereas the spatial resolution of our VC-Ptychography is limited by the detection area of the detector. Therefore, for the fine structures of the object in a small FoV, VC-Ptychography generally has a higher spatial resolution than direct imaging.

Fig. 4. Experimental data: Comparison of imaging resolution using our proposed VC-Ptychography against direct imaging and. (a) Single-shot coded measurement. (b) Direct imaging of the USAF 1951 resolution target. (c) Imaging results using our proposed VC-Ptychography and DNN-HIO algorithm. (d) Intensity profiles extracted from the corresponding cross-section color lines in (b) and (c).

Download Full Size | PDF

Next, we consider capturing a large FoV object in Fig. 5 using our VC-Ptychography, where we show the reference scene, the large and complex logo of Westlake University, at the bottom-left in part (c). We choose to simultaneously illuminate two areas with width of $313.6 \mu m$ (32 pixels on DMD1), with a spacing of $313.6 \mu m$ between the two areas. There are overlaps of $803.6 \mu m$ between adjacent frames, and the compression ratio is set to $T=20$. Fig. 5(a) is the captured single-shot measurement; Fig. 5(b) shows the reconstruction results at compression ratio 20, i.e., 20 frames recovered from a single-shot measurement, where each frame is of $512 \times 512$ pixels in size and we cropped $200 \times 200$ pixels for better visualization. The combined image is shown in Fig. 5(d) for a large FoV. Clearly, different parts of the logo can be illuminated separately by structured light over time, which illustrates that the high-resolution complete structure of an object can be visualized with a single exposure.

Fig. 5. Experimental data: Reconstruction results of a complicated large field-of-view object. (a) Single-shot coded measurement. (b) 20 corresponding reconstructed spatial images by the proposed DNN-HIO algorithm. (c) Reference image of the object. (d) Stitched large FoV image obtained by our proposed algorithm.

Download Full Size | PDF

We compare the impact of the illumination area and compression ratio on the reconstruction quality of the target and achieve two measurements according to the parameters set by $W=(R+1) \times MW$, where $W$ is the target size, $MW$ is the single frame illumination area, $R$ is the compression ratio (number of FoV frames compressed into one measurement). This setting makes $50{\%}$ overlapping area between frames for subsequent combined images. The compression ratio of the two measurements is 10 and 20, and the illumination areas are $1058.4 \mu m$ and $715.4 \mu m$, respectively. The corresponding reconstruction results are shown in Fig. 6 and 7, respectively. We can clearly see that it seems higher compression ratio can achieve better reconstruction quality, which will be further analyzed in detail in the following subsection. It is worth noting that the conclusion of the “compression ratio-reconstruction quality negative correlation" is different from that of other compression sensing systems. This is because, in the VC-Ptychography system, a smaller illumination area is equivalent to a larger magnification in direct imaging.

Fig. 6. Experimental data: (a) Coded measurement. (b) 10 corresponding reconstructed spatial spectra of Fourier-plane images after the first step in the proposed algorithm. (c) 10 corresponding reconstructed spatial images using our VC-Ptychography system after the second step in the proposed algorithm. (d) Reference image of the target. (e) Combined large FoV image obtained by our proposed full three-step algorithm.

Download Full Size | PDF

Fig. 7. Experimental data: (a) Coded measurement. (b) 20 corresponding reconstructed spatial spectra of Fourier-plane images after the first step in the proposed algorithm. (c) 20 corresponding reconstructed spatial images using our VC-Ptychography system after the second step in the proposed algorithm. (d) Reference image of the target. (e) Combined large FoV image obtained by proposed full three-step algorithm.

Download Full Size | PDF

3.2 Simulation datasets

To verify the proposed three-step reconstruction algorithm for capturing scenes with large FoV and high spatial resolution simultaneously, we show the simulation results of our proposed VC-Ptychography system in Fig. 8. As we can see, satisfactory results can be achieved by feeding the unique captured measurement into the three-step reconstruction algorithm. In the first step, we reconstruct the spectra and then get the images in the second step which focuses on the task of phase retrieval, and in the final step, a simple yet effective image stitching method is utilized to reconstruct the original scene with large FoV. We quantitatively analyze the simulation experimental results’ quality by peak-signal-to-noise-ratio (PSNR) and structural similarity (SSIM) [28]. They are respectively 20.49dB and 0.91. After adding 0.01 intensity random noise to the compressed measurement values, the image can still be reconstructed, and the PSNR and SSIM compared to the original image are 15.02dB and 0.6895. We also compared the reconstruction effects under different compression ratios, setting the inter frame overlap to $50{\%}$ and the compression ratio to {5,10,15,20,25}, and calculated the PSNR and SSIM, shown on the bottom-right of Fig. 8. It can be seen that the best reconstruction effect can be achieved when the compression ratio is 10 in the noise free case. The reconstruction quality and compression ratio are not simply positively correlated, but depend on the complexity of the image. In general, VC-Ptychography can be recognized as a lossy compression technique, and excessively high compression ratios can cause a decrease in reconstruction quality.

Fig. 8. Simulation results with compression of 10. Top: the ground truth of the simulation data. Middle: the three stages’ reconstruction results of our proposed algorithm. Bottom: the three stages’ reconstruction results of our proposed algorithm with zero-mean Gaussian noise following ${\cal N}(0,0.01)$. Bottom right: PSNR and SSIM in different compression ratio.

Download Full Size | PDF

We used more complex targets to verify the possibility of segmentation in both the $x$-$y$ direction of the target, as shown in Fig. 9. The electron microscope image containing multiple cells was cut into 20 overlapping sub images. After encoding and compressing the spatial spectrum of the sub images and inputting it into the third-order algorithm, good reconstruction results can be obtained. The PSNR and SSIM of the reconstructed image are 16.82dB and 0.7479. Higher compression ratios can also be achieved. The numerical simulation results of a compression ratio of 30 are shown in Fig. 10, and the PSNR and SSIM of the reconstructed image are 16.82dB and 0.7566.

Fig. 9. Simulation results with compression of 20. Top: the ground truth of the simulation data. Bottom: the three stages’ reconstruction results of our proposed algorithm.

Download Full Size | PDF

Fig. 10. Simulation results with compression of 30. Top: the ground truth of the simulation data. Bottom: the three stages’ reconstruction results of our proposed algorithm.

Download Full Size | PDF

4. Discussion

In summary, we have built a novel snapshot Ptychography system that can capture complex targets of a large FoV only by a snapshot measurement. In addition, for restoring the target images we have developed a three-step algorithm, i.e., vision-Transformer-based sequential frames reconstruction, DNN-promoted phase retrieval, and multi-frame image stitching. Experimental results demonstrate the efficacy of the proposed system and algorithm.

Existing Ptychography techniques can achieve the finest spatial resolution using an X-ray beam, but the exposure time remains in seconds. For example, the coherent modulation imaging (CMI) [29] technique demands an exposure time of 3 seconds, while the 3D reconstruction through partially coherent diffraction imaging (CDI) technique requires 16 seconds. Our proposed VC-Ptychography technique achieved a spatial resolution of $8.77 \mu m$ using a $780 nm$ wavelength laser; its frame rate reached 20 fps, and FoV reached by combining 20 compressed images in each captured measurement. Although the spatial resolution of our VC-Ptychography is now on the $\mu m$ level due to the near-infrared source, its FoV and temporal resolution is significantly improved by using the modulation method in the frequency domain. Our future work will concentrate on using other laser sources such as an X-ray to improve spatial resolution.

For detectors with fixed dynamic ranges (e.g., 8 bits), it is challenging for CDI to recover high dynamic-range scenes [18], especially in the noisy case. By contrast, VC-Ptychography can solve the problem of dynamic range very well. By illuminating the target block with discrete structured light, the problem of high intensity in the low-frequency part can be avoided, and at the same time, the structure of the high-frequency part can be made simple and easy to collect.

Compared with existing Ptychography technology, which requires multiple exposures and a stable mechanical structure to control the illumination range, our VC-Ptychography technology can achieve a simpler optical path structure and has a much larger field-of-view. Our method can also be used to improve the insufficient resolution of micro-objects. The proposed method has broad applications in microscopic imaging and X-ray intensity CDI. In the future, more efforts will be required to improve the compression ratio and achieve Ptychography imaging of moving objects and develop better stitching algorithms.

In addition, grayscale targets have more complex and non-negligible weak high-order spectra in the Fourier domain, with intensity only on the order of hundredth to thousandth of low-order frequencies. Due to the loss of weak information caused by the compressed sensing process, low-intensity information cannot be retained, resulting in the proposed VC-Ptychography only being applicable to binary targets. We are trying to solve this problem using adaptive or non-linear encoding methods.

Funding

National Natural Science Foundation of China (62271414); Science Fund for Distinguished Young Scholars of Zhejiang Province (LR23F010001); Westlake Institute for Optoelectronics (2023GD007).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. F. Pfeiffer, “X-ray ptychography,” Nat. Photonics 12(1), 9–17 (2018). [CrossRef]

2. Y. Jiang, Z. Chen, Y. Han, et al., “Electron ptychography of 2D materials to deep sub-ångström resolution,” Nature 559(7714), 343–349 (2018). [CrossRef]

3. L. Zhou, J. Song, J. S. Kim, et al., “Low-dose phase retrieval of biological specimens using cryo-electron ptychography,” Nat. Commun. 11(1), 2773 (2020). [CrossRef]

4. P. Sidorenko and O. Cohen, “Single-shot ptychography,” Optica 3(1), 9–14 (2016). [CrossRef]

5. P. Li, M. Allain, T. Grünewald, et al., “4th generation synchrotron source boosts crystalline imaging at the nanoscale,” Light: Sci. Appl. 11(1), 73 (2022). [CrossRef]

6. G. Zheng, C. Shen, S. Jiang, et al., “Concept, implementations and applications of fourier ptychography,” Nat. Rev. Phys. 3(3), 207–223 (2021). [CrossRef]

7. J. Miao, P. Charalambous, J. Kirz, et al., “Extending the methodology of X-ray crystallography to allow imaging of micrometre-sized non-crystalline specimens,” Nature 400(6742), 342–344 (1999). [CrossRef]

8. M. A. Pfeifer, G. J. Williams, I. A. Vartanyants, et al., “Three-dimensional mapping of a deformation field inside a nanocrystal,” Nature 442(7098), 63–66 (2006). [CrossRef]

9. J. R. Fienup, “Phase retrieval algorithms: A comparison,” Appl. Opt. 21(15), 2758–2769 (1982). [CrossRef]

10. M. M. Seibert, T. Ekeberg, F. R. Maia, et al., “Single mimivirus particles intercepted and imaged with an X-ray laser,” Nature 470(7332), 78–81 (2011). [CrossRef]

11. X. Huang, J. Nelson, J. Steinbrener, et al., “Incorrect support and missing center tolerances of phasing algorithms,” Opt. Express 18(25), 26441–26449 (2010). [CrossRef]

12. A. Barty, J. Küpper, and H. N. Chapman, “Molecular imaging using X-ray free-electron lasers,” Ann. Rev. Phys. Chem. 64, 415–435 (2013). [CrossRef]

13. J. Wu, Y. Guo, C. Deng, et al., “An integrated imaging sensor for aberration-corrected 3d photography,” Nature 612(7938), 62–71 (2022). [CrossRef]

14. C. Wang, M. Hu, Y. Takashima, et al., “Snapshot ptychography on array cameras,” Opt. Express 30(2), 2585–2598 (2022). [CrossRef]

15. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory 52(4), 1289–1306 (2006). [CrossRef]

16. P. Llull, X. Liao, X. Yuan, et al., “Coded aperture compressive temporal imaging,” Opt. Express 21(9), 10526–10545 (2013). [CrossRef]

17. X. Yuan, D. J. Brady, and A. K. Katsaggelos, “Snapshot compressive imaging: Theory, algorithms, and applications,” IEEE Signal Process. Mag. 38(2), 65–88 (2021). [CrossRef]

18. Z. Chen, S. Zheng, Z. Tong, et al., “Physics-driven deep learning enables temporal compressive coherent diffraction imaging,” Optica 9(6), 677–680 (2022). [CrossRef]

19. S. Zheng, Y. Liu, Z. Meng, et al., “Deep plug-and-play priors for spectral snapshot compressive imaging,” Photonics Res. 9(2), B18–B29 (2021). [CrossRef]

20. S. Zheng, C. Wang, X. Yuan, et al., “Super-compression of large electron microscopy time series by deep compressive sensing learning,” Patterns 2(7), 100292 (2021). [CrossRef]

21. Z. Cheng, B. Chen, G. Liu, et al., “Memory-efficient network for large-scale video compressive sensing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), pp. 16246–16255.

22. Y. Li, M. Qi, R. Gulve, et al., “End-to-end video compressive sensing using anderson-accelerated unrolled networks,” in IEEE International Conference on Computational Photography, (2020), pp. 1–12.

23. Z. Wu, J. Zhang, and C. Mou, “Dense deep unfolding network with 3D-CNN prior for snapshot compressive imaging,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 4892–4901.

24. X. Liao, H. Li, and L. Carin, “Generalized alternating projection for weighted-ℓ_2,1 minimization with applications to model-based compressive sensing,” SIAM J. Imaging Sci. 7(2), 797–823 (2014). [CrossRef]

25. X. Yuan, “Generalized alternating projection based total variation minimization for compressive sensing,” in 2016 IEEE International Conference on Image Processing, (IEEE, 2016), pp. 2539–2543.

26. Z. Liu, Y. Lin, Y. Cao, et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 10012–10022.

27. Z. Meng, X. Yuan, and S. Jalali, “Deep unfolding for snapshot compressive imaging,” Int. J. Comput. Vis. 131(11), 2933–2958 (2023). [CrossRef]

28. Z. Wang, A. C. Bovik, H. R. Sheikh, et al., “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process. 13(4), 600–612 (2004). [CrossRef]

29. F. Zhang, B. Chen, G. R. Morrison, et al., “Phase retrieval by coherent modulation imaging,” Nat. Commun. 7(1), 13367 (2016). [CrossRef]

Temporal structured illumination and vision-transformer enables large field-of-view binary snapshot ptychography

Abstract

1. Introduction

2. Methods

2.1 Forward model

2.2 Reconstruction algorithm

2.3 Optical setup

3. Results

3.1 Experimental datasets

3.2 Simulation datasets

4. Discussion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (10)

Equations (11)

Optics Express