## Abstract

Imaging with low-dose light is of importance in various fields, especially when minimizing radiation-induced damage onto samples is desirable. The raw image captured at the detector plane is then predominantly a Poisson random process with Gaussian noise added due to the quantum nature of photo-electric conversion. Under such noisy conditions, highly ill-posed problems such as phase retrieval from raw intensity measurements become prone to strong artifacts in the reconstructions; a situation that deep neural networks (DNNs) have already been shown to be useful at improving. Here, we demonstrate that random phase modulation on the optical field, also known as coherent modulation imaging (CMI), in conjunction with the phase extraction neural network (PhENN) and a Gerchberg-Saxton-Fienup (GSF) approximant, further improves resilience to noise of the phase-from-intensity imaging problem. We offer design guidelines for implementing the CMI hardware with the proposed computational reconstruction scheme and quantify reconstruction improvement as function of photon count.

© 2020 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

Retrieving phase information out of intensity images captured by a detector is important in several practical applications where photon-limited illumination onto samples is desired [1]. For imaging biological specimens, high-dose light may induce phototoxicity, which could affect the viability of cells [2] as cost of larger signal-to-noise ratio (SNR) in detection. For particle imaging, this concern is often even more severe: for example, imaging integrated circuits requires reduced beam power to avoid destructive side-effects, such as heat-induced deformation [3,4].

There are various ways to retrieve phase information from intensity measurements, for example holography [5–9], ptychography [10–12] or through the transport-of-intensity equation (TIE) [13–17]. Coherent diffraction imaging (CDI) [18–20] is popular as a lensless method relying on a single diffraction pattern and without need for a reference beam.

Under ample illumination, iterative reconstruction algorithms, e.g. Gerchberg-Saxton-Fienup (GSF) [21–23], tend to work quite well for CDI when the objects are isolated and have no strong phase variations. Using a neural network as inverse operator in CDI also works well but does not confer appreciable performance advantage over GSF [24]. When the noise becomes significant, using a stronger regularization prior is generally recommended; and in that case a machine learning algorithm, such as a deep neural network [25] or a combination of networks [26] with a restricted training dataset becomes an effective way to learn this regularization prior. Performance improves further if the raw image is first processed by an approximate inverse, which we call the “Approximant.” The Approximant partially incorporates our knowledge of the physics of the optical system into the subsequent machine learning based inverse algorithm. Alternatively, the Approximant may be thought of as reducing the learning burden, so that the neural network needs to learn less about the physics and more about the prior—even though that distinction is not clearly delineated [27].

Coherent modulation imaging (CMI) introduces a physical constraint by phase modulating the optical field diffracted from the object at some intermediate distance between the object and the camera [28–33]. The modulation may be random or of a well-defined form, e.g. using a quadratic phase pattern [34–37]. In this paper, we adopt the random modulation approach. The phase information from the object is encoded in the speckle-like diffraction pattern that is recorded as raw image at the detector plane. The CMI scheme effectively improves ill-posedness and eliminates ambiguous solutions, such as the twin image, in the inverse estimate of the phase.

Nevertheless, when the number of illuminating photons is limited, iterative reconstruction algorithms are prone to fail to converge or produce strong artifacts even with the random phase modulation in the CMI scheme. The purpose of this paper is to investigate, for the first time to our knowledge, the use of deep neural networks in combination with CMI to obtain several improvements: guarantee convergence, improve image fidelity through removal of the artifacts, resolve the ambiguities in the phase reconstructions. The general theory of CMI and as implemented here with a neural network-assisted inverse are developed in Section 2. CMI design is optimized for use with a neural network-assisted inverse in Section 3. According to this optimization, we constructed CMI experimental apparatus, trained the algorithms and conducted extensive qualitative and quantitative tests and comparisons with CDI and the GSF algorithm. The apparatus description and results are in Section 4 and concluding thoughts are in Section 5.

## 2. Methods

#### 2.1 Coherent modulation imaging scheme in a general sense

The CMI principle is shown in Fig. 1. The phase object is illuminated by a localized wave at normal incidence. A spatial light modulator (SLM) or a fabricated mask with random, binary-phase transitions is placed along the path between the phase object and the measurement plane. The purpose of the mask is to randomly encode the phase information. Fabricated phase masks, in particular, enable sharper transitions in phase than SLMs and can be used with X-rays and particle radiation sources like electrons, but they have some limitations: (1) binary design will lead to sampling artifacts; (2) multi-level masks are generally expensive to fabricate; and (3) the randomness encoded on the mask cannot be altered once the mask has already been made. Nevertheless, multi-bit SLMs that are mostly available for visible and near-infrared light can approximate continuous phase modulations, and the displayed patterns may be altered at will. Therefore, in this work we chose the SLM method to implement the random phase mask.

Let ${\psi }_{\textrm {obj}}(x,y)=\textrm {e}^{i\varphi (x,y)}$ denote the field at the exit plane of the phase object. In the simplest case, when the phase object is well approximated as thin, $\varphi$ directly maps the index of refraction modulation or topography of the object. If the thin film approximation is not satisfied, then more elaborate models [38–41] relate the object structure to $\varphi$, and coupled amplitude modulation may occur in addition to phase modulation. For now, we neglect these effects and consider $\varphi (x,y)$ as a pure phase signal which we wish to reconstruct.

Let $\Phi (x',y')$ represent the (also assumed pure) phase modulation imposed by the CMI mask. Under the paraxial and scalar approximations, the field at the detector plane is expressed as [42]

Forward Eqs. (1–4) also apply to CDI with the choice $\Phi \left (x',y'\right )=\mathbf {0}$. The difference between CDI and CMI measurements is illustrated in Figs. 2 and 3 in the space and spatial-frequency domains, respectively. The diffraction pattern is discernible in the spatial CDI raw intensity with $10^{3}$ photons/pixel, whereas the CMI raw intensity is completely diffuse. Spreading of the spectrum is also evident in the corresponding power spectral density (PSD) of the CMI raw intensity. These trends are, of course, not discernible in either the highly noisy raw images or their PSDs at $1$ photon/pixel.

To retrieve the phase $\varphi (x,y)$ from the intensity ${I}_{\textrm {det}}\left (x'',y''\right )$, we construct a two-step inverse algorithm. The first step is to define the forward operator, which essentially is a discretization of the Fresnel Eqs. (1) and (3). This is done in Section 2.2. The forward operator is then used in the computation of the Approximant, and the output is used as input to a neural network that finalizes the computation of the inverse, as described in Section 2.3. The training of the DNN is described in Section 2.4.

#### 2.2 Definition of the forward operator

Our computational window consists of $N\times N$ pixels. Let $\psi _{\textrm {obj},mn},\ m, n=1, \ldots , N$, denote the object field at discrete location $\left (x_m, y_n\right )$. We rasterize ${\psi }_{\textrm {obj}}$ to the $N^{2}\times 1$ vector ${\Psi }_{\textrm {obj}}$ and define the Fresnel kernel $N\times N$ matrices $A$, $B$, $C$, $D$, such that

#### 2.3 Inverse algorithm

In Fig. 4, the algorithm receives an intensity-only measurement ${I}_{\textrm {det}}\left (x'',y''\right )$ and derives the phase inverse estimate $\hat {\varphi }(x,y)$ based on two types of reconstruction method: GSF algorithm and deep neural network (DNN). (The hat over $\hat {\varphi }$ indicates the phase estimate *vis-à-vis* the true phase $\varphi$.) Reconstructing phase information solely depending on the GSF algorithm is similar to a conventional technique of CMI scheme [30], which leads to GSF reconstructions ${\hat {\varphi }}_{\textrm {GSF}}$. Instead, the DNN-based algorithm first utilizes the intermediate or Approximant reconstruction ${\hat {\varphi }}_{\textrm {approx}}$, which then forms a training pair with its corresponding ground truth image $\varphi$ for training to produce the final reconstruction ${\hat {\varphi }}_{\textrm {DNN}}$. The performance is then tested by comparing ${\hat {\varphi }}_{\textrm {DNN}}$ to $\varphi$ for “test” pairs, excluded from the training set.

We denote the phase estimate produced by our DNN algorithm as

where $\textrm {DNN}(\cdot )$ is the input-output relationship of the trained DNN. Training,*i.e.*specifying the weights ${\mathbf {w}}_{\textrm {DNN}}$, is the nonlinear minimization procedure

The Approximant ${\hat {\varphi }}_{\textrm {approx}}$ is also based on the GSF algorithm with only a single backward step, *i.e.* half an iteration. This strategy was also followed in [25]. Other Approximant implementations are possible, but we did not investigate them in this paper. We also compute the full TV-denoised version of the GSF algorithm to generate ${\hat {\varphi }}_{\textrm {GSF}}$ for comparison with ${\hat {\varphi }}_{\textrm {DNN}}$.

The combined GSF- and DNN-based inverse algorithms shown in Fig. 4 proceed as follows: First, the intensity measurement ${I}_{\textrm {det}}$ is pre-processed as described in Appendix A. The GSF module is initialized with a plane wave with a truncated Airy pattern and $t=0$. This goes through the forward operator $H_{z_1,z_2}$ to reach the detector plane. The forward operator is realized using the angular spectrum method [45] because this eases sampling requirements given the physical parameters of our system; whereas [30] used [46]. The pre-processed measurement is imposed as the modulus constraint to obtain the Approximant ${\hat {\varphi }}_{\textrm {approx}}$ with $t=0$ as shown in Fig. 4. The TV denoising step on the phase estimate at this stage for the Approximant is optional (see Table 1.)

Continuing on with the computation toward the GSF estimate ${\hat {\varphi }}_{\textrm {GSF}}$, the backpropagation operation is applied to the field estimate as $H^{*}_{z_1,z_2}=H^{-1}_{z_1,z_2}$. A TV denoising process is also applied to the phase estimate [19,47], and an update according the hybrid input-output scheme [23]; this follows the convention in [48]. A subsequent support constraint leads to a detector field estimate, which replaces the previous iterate of the phase estimate and with $T\;>\;0$ the iteration repeats for $t=1,\ldots , T$. GSF reconstructions in Sections 3.1 and 3.2 were obtained with $T = 30$ iterates.

#### 2.4 Training the deep neural network

For all our DNN results, we used the PhENN (Phase Extraction Neural Network) architecture [24], as shown in Fig. 5. This includes encoder-decoder structures and skip connections according to the U-Net [49] principle with residuals [50]. PhENN is known to work for reconstructing phase information out of intensity measurements. For training and testing, several images were randomly picked from ImageNet [51] and a segmented IC layout [52]. ImageNet, in particular, is a reasonable choice for both training and testing as it is known to be a highly generic dataset with cross-domain generalization ability, whereas the IC layout is a good example of a highly restricted prior [27]. Comparison on cross-domain generalization ability between the neural networks trained with two different databases can be found in Appendix B.

From each database we drew randomly $5000$ training examples, $450$ validation examples, and $50$ test examples. For training, we used the stochastic gradient descent scheme with the *Adam* [53] optimizer over $100$ epochs, of which the initial learning rate was set to be $0.001$. Desktop for all of these processes has specifications of Intel CPU i$9$-$9900$K $3.60\ \textrm {GHz}$ with $16$ $\textrm {MB}$ cache, $64$ $\textrm {GB}$ of RAM, and NVIDIA GeForce RTX $2080$ with $8$ $\textrm {GB}$ VRAM.

The TLF was chosen as either the structural similarity index metric (SSIM) [54] or the negative Pearson correlation coefficient (NPCC). The respective TLFs are defined as

## 3. Simulations and design considerations

In the simulation, the signals were assumed to be Poisson random variables with additive Gaussian noise. Mean rate of each Poisson statistics was set to be either $1$ or $10$ depending on the noise level of interest. Additionally, in the case of mean photon arrival level per pixel being $1$, a factor of $50$ multiplied the Poisson random variables to mimic the behavior of EM gain of our EM-CCD. Gaussian noise is uncorrelated with Poisson statistics, and its standard deviation was assumed to be $10$ with zero mean.

Simulations were conducted under two different scenarios to explain when random phase modulation is favorable for low-photon phase retrieval (Section 3.1) and why deep neural networks are needed for reconstructing phase objects (Section 3.2). Each scenario suggests a criterion that both $z_1$ and $z_2$ should meet. This guided our choice of $(z_1, z_2)$ in the experiments of Section 4.

#### 3.1 When is the random phase modulation favorable?

We swept two design parameters, $\textit {i.e.}$ $z_1$ and $z_2$, and for each combination computed the values of perceptual loss [56] and Pearson correlation coefficient (PCC) between reconstructions and their corresponding ground truth images. These two metrics represent radically different image aspects and, thus, we expect to reduce bias in our conclusions by taking both into consideration. Perceptual loss [56] is a feature loss devised to quantify visual quality using a VGG network pre-trained on the ImageNet database. As in [57], we used PhENN to generate reconstructions and the VGG network to compute the corresponding perceptual losses. Following [57], under photon-limited conditions we extracted the perceptual loss from the ReLU $1$-$2$ layer of the VGG, whereas under ample illumination we used ReLU $2$-$2$ layer as recommended by [56]. On the other hand, the PCC, defined as in Eq. (12) but without the minus sign, essentially computes the normalized spatial cross-covariance.

In the CDI scheme, *i.e.* without the random phase modulation, the performance of DNN in general decreases as $z_1+z_2$ increases, as expected since the numerical aperture (NA) of the system then decreases. This is shown in Figs. 6(a) and (d). In CMI, *i.e.* with the random phase modulation $\Phi \left (x',y'\right )$ incorporated into the system as in (2), this relationship is reorganized, showing an intermediate region where CMI yields better results according to both metrics, as seen in Figs. 6(b) and (e). With larger $z_2$, the improvement becomes smaller as the spatial signal becomes too diffuse; see Figs. 2 and 3. This indicates that overdoing the modulation can make the raw intensity more prone to be corrupted by the Poisson statistics of the signal and the readout and dark noise.

To quantify performance difference with and without phase mask, the merit ratio ${\gamma }_{\textrm {metric}}$ was defined as

#### 3.2 Why is the deep neural network needed under photon-limited conditions?

Figures 7 and 8 show the comparison of GSF and DNN reconstructions according to perceptual loss and PCC, respectively, for various combinations of $\left (z_1,z_2\right )$. Perceptual loss was computed on ReLU $1$-$2$ layer of VGG$16$ architecture when the photon arrival level is $1$ photon per pixel [57] and ReLU $2$-$2$ layer when it is $10$ photons per pixel [56]. In the same manner as Eq. (14), we take the results from both GSF and DNN reconstructions into consideration by defining the ratio

The general trend in Figs. 7 and 8 is that the noisier the raw images the more the improvement one may expect in the reconstructions by using DNN over GSF. This is especially true in the region $z_2\;<\;50\textrm {~mm}$.

According to Figs. 6, 7 and 8, the choice ($z_1 = 490\textrm {~mm},\ z_2 = 48.5\textrm {~mm}$) satisfies a reasonable compromise for performance under noisy conditions according to all combinations of reconstruction algorithms and image quality metrics. This choice is indicated with yellow asterisks in the figures and was used for the experimental apparatus and results in the next section. For a quantitative comparison between the results from this simulation and the experiments, see Appendix C.

## 4. Experiments and analysis of results

#### 4.1 Optical apparatus

The optical apparatus is schematically depicted in Fig. 9. There are two SLMs: one transmissive and one reflective. The transmissive SLM1 (Holoeye LC2012, pixel pitch: $36\ \mu \textrm {m}$, $1024\times 768$ pixels) displays phase objects, and the reflective SLM2 (Thorlabs EXULUS-HD2, pixel pitch: $8\ \mu \textrm {m}$, $1920\times 1080$ pixels) implements a random phase pattern in 8-bit grayscale values. The calibration process for the two SLMs is described in Appendix D. Two linear polarizers, $\textit {i.e.}$ $\textrm {POL}1$ and $\textrm {POL}2$, properly modulate the optical field for $\textrm {SLM}1$ to display the objects with the maximum phase depth of $\sim 4.6\textrm {~rad}$. $\textrm {HWP}2$ rotates the polarization angle to $45^{\circ }$ to the vertical axis, thus the maximum phase depth of $\textrm {SLM}2$ is $2\pi$.

We use a coherent light source (Thorlabs HNL210L, power: $20\ \textrm {mW}$, $\lambda = 633\ \textrm {nm}$) followed by a variable neutral density (VND) filter to control the photon flux. The beam is expanded with a collimating lens L$1$ and cropped to $24\textrm {~mm}$ by an aperture A$1$. A second aperture A$2$ is placed right before POL$1$ to limit the spatial extent of the beam to $12\textrm {~mm}$ diameter. Using the design results from Sections 3.1 and 3.2, we propagate the optical field by $z_1=490\textrm {~mm}$ from SLM1 to SLM2 and by $z_2=48.5\textrm {~mm}$ from SLM2 to the EM-CCD (QImaging Rolera EM-C2, pixel pitch: $8\ \mu \textrm {m}$, $1002\times 1004$ pixels). The EM gain settings on the CCD and photon counts for the experiments reported here are in Table 1.

To implement CDI, SLM2 is set to zero phase delay for all pixels and essentially acts as a mirror. For CMI, the random phase modulation $\Phi$ is imposed by SLM2, according to the following procedure: first, a low-resolution version of the random pattern $\Phi$ is designed according to fair coin tosses, independently for each pixel. Unfortunately, pixel-to-pixel crosstalk in SLM2 [44,58] effectively introduces a spurious correlation between the phase values at neighboring pixels when they are displayed on the physical SLM2. In Appendix E we describe a process to effectively decorrelate them so that they represent as accurately as possible the results of the original independent fair coins sampling.

#### 4.2 Experimental Results

For training and testing, images were randomly drawn from ImageNet and IC layout database and displayed as true phases $\varphi$ on the transmissive SLM1. Photon arrival levels were controlled to either $1$ or $1000$ photons per pixel, and both CDI or CMI schemes were used (*i.e.* with SLM2 unmodulated or modulated, respectively.) Results presented in this section are for test examples only.

Figure 10 displays intensity measurements, their Approximants, GSF reconstructions, and DNN reconstructions of two test images, labelled as 1 and 2, randomly selected from the ImageNet database. With CDI, and regardless of the photon arrival level, some unwanted artifacts, e.g. ripples, are found in the results (Image $1$). The artifacts are more prominent at the low photon count. Also, the results exhibit ambiguities in phase (visible in Image $2$). With the modulation, many artifacts are removed as shown in the DNN reconstructions (Image $1$), and the ambiguities in phase are also largely resolved (Image $2$) for both photon counts. Figure 11 similarly shows results from two images randomly selected from the IC layout database and displays similar trends. Thus, in terms of visual appearance it can be said that the CMI scheme in conjunction with deep learning in general improves reconstruction quality at low photon counts.

Figures 12 and 13 show our two chosen metrics from section 3: PCC, SSIM; and, in addition, the standard metrics PSNR (peak signal-to-noise ratio) and NRMSE (normalized root mean square) [55] on the reconstructions produced by the CDI and CMI schemes with deep learning. Generally, the improvement is less noticeable in the IC layout cases. This is not surprising, since these represent a stronger prior that can be learnt by the DNN to regularize effectively even with the more ill-posed CDI; whereas for the less restrictive ImageNet case the priors are weak and the improved condition of the CMI forward operator is more helpful.

We also analyzed the azimuthally averaged Power Spectral Densities (PSDs) of the reconstructions relative to the ground truth. This kind of spectral analysis has proven to be useful before for understanding how the deep learning inverse algorithm behaves at different spectral bands [26,60]. All cases, with or without CMI, are seen to approach the PSD of the ground truth in Fig. 14(a). A comparison of the ratio between the PSD for CMI over the PSD for CMI, despite some oscillations due to the small values involved, seems to indicate that CMI tends to perform slightly better at low and high frequencies; and CDI tends to perform slightly better at intermediate frequencies. These results merit further investigation.

## 5. Conclusions and discussion

Phase retrieval from intensity is a highly ill-posed problem and, therefore, CDI reconstruction performance is extremely sensitive to noise. In this paper we found that using CMI together with a DNN inverse results in certain desirable effects. CMI reduces ill-posedness while the DNN can be an effective regularizer, especially with the Approximant to reduce the learning burden. The combination of CMI plus DNN leads generally to improved reconstructions, both qualitatively and quantitatively. Not surprisingly, when the noise is severe, CMI aids the DNN more effectively for data with weak priors, e.g. ImageNet. Conversely, when the priors are strong, as in the IC layout database, the CMI effect is smaller. It would be interesting to investigate, though not within the scope of the present work, how these results apply to more realistic phase objects, e.g. biological cells rather than phase objects implemented on SLMs; and to different bands of the electromagnetic spectrum or to particle imaging.

## Appendix A. More details on the inverse algorithm of Fig. 4

Pre-processing in Fig. 4 consists of two steps. First, an affine transformation is applied to the raw intensity measurement ${I}_{\textrm {det}}$. This is because the coordinates of three different planes in Fig. 1 should match with each other as close as possible, which otherwise would lead to the failure of decoding process and introduce severe artifacts. This optimization step corrects a mismatch of the center axes, a rotation of a pattern or detector, imperfect alignment of the optical system, and divergence of the beam due to imperfect collimation as $z = z_1 + z_2$ gets larger. The affine transform matrix is determined by optimization with the Nelder-Mead method [61,62]. Negative normalized mutual information (NMI) was chosen as a loss function for the optimization [63]. The method tries to find an optimal matrix

where $T$ is a translation, $R$ a rotation, $\mathit {Sh}$ a shear, and $\mathit {Sc}$ a scaling matrix.The second pre-processing step involves tuning parameters for imposing nonlinearity on the raw intensity measurement as $\left ({I}_{\textrm {det}}\right )^{p}$ and controlling the degree of a Tukey window $K_r$ as a smoothing kernel on the measurement. Changing $p$ to be other than $1$ may either accentuate or dim the phase contrast in Approximants – too small values of $p$ overly obscure the information, and at the other extreme they excessively emphasize the details. In addition, the Tukey window, for some small value $\textit {i.e.}$ $r\; >\; 0$ elimimates ripple-like artifacts that otherwise appear on the edges of the Approximants. However, too large a value of $r$ degrades the quality of Approximants. Therefore, both $p$ and $r$ should be determined interactively, depending on the photon arrival level and a type of dataset of interest. Typically, $p$ is chosen between $0.8$ and $1.2$, and $r$ is between $0.1$ and $0.4$.

Total-variation (TV) denoising is applied to a phase estimate to either guarantee the convergence of the GSF algorithm or to ease computational burden on training with DNN architecture [19,47]. For GSF reconstructions, it was sufficient with $30$ iterations to reach a plateau, and the algorithm did not converge without the TV denoising. In case of Approximants, the TV denoising is optional – the number of iterations of the process depends on the photon arrival level of measurements and a type of dataset as illustrated in Table 1. Here, phase estimates after every iteration are wrapped.

## Appendix B. Cross-domain generalization

ImageNet and IC layout databases were chosen in this paper since they are radically different priors. Neural networks trained with ImageNet database are known to have better generalization ability than those trained with IC layout database which is highly restricted and thus acts as a strong regularizer on neural networks [27]. Therefore, cross-domain generalization performance is better with the ImageNet-trained network than with the IC layout-trained network as shown in Table 2 and Fig. 15.

## Appendix C. Additional tables of quantitative comparison

Tables 3 and 4 provide additional quantitative results supplementing Figs. 6, and 7 and 8, respectively. It is noticeable that the perceptual loss metric seems to improve for the experimental over simulation results. This is somewhat unexpected; slight discrepancies such as this between image evaluations by different numerical metrics are well-known and documented in the literature [64–67].

## Appendix D. Spatial light modulator (SLM) calibration

Spatial light modulators assign different values of phase delay at every single digital pixel corresponding to a displayed pattern in grayscale. In this work, a transmissive SLM1 (Holoeye LC$2012$) and a reflective SLM2 (Thorlabs EXULUS-HD$2$), see Fig. 9, were used to display phase objects and to apply random phase modulation on the optical field, respectively. The transmissive SLM was calibrated beforehand to establish the optimum configuration of the two linear polarizers, POL$1$ and POL$2$. Polarization angles were set to modulate the phase up to $\sim 4.6\ \textrm {rad}$, which, however, involves coupled amplitude modulation. The reflective one has negligible coupled amplitude modulation, and its calibration curve is close to linear. A Mach-Zehnder interferometer was used for the calibration of two SLMs. Calibration results are presented in Fig. 16.

## Appendix E. Random phase pattern optimization

Unlike fabricated phase masks, displayed patterns on a spatial light modulator may be affected by crosstalk among adjacent pixels [58] especially if they have abrupt changes in phase. If it were not compensated correctly, unwanted artifacts are introduced to Approximants because of the difference between encoding and decoding phase patterns. Smoothing phase profile of the patterns eases the problem. We implemented this as image interpolation with an appropriate size of kernel.To decide what size of kernel enables for the SLM to display a phase pattern as close to its original design possible, we performed a parameter sweep on the size of interpolation kernel. Figure 17 shows the quantitative comparison using several metrics. The best compensation is achieved for kernel size $=6$, and we found that this maximizes the performance gap between CDI and CMI as well. In Fig. 18, we cross-checked this result in the spatial frequency domain using the PSDs of the ground truth and reconstructions and found, consistently, kernel size $=6$ to perform best.

## Funding

Southern University of Science and Technology (6941806); Intelligence Advanced Research Projects Activity (FA8650-17-C-9113); Korea Foundation for Advanced Studies; National Natural Science Foundation of China (11775105).

## Acknowledgments

Thanks to Mo Deng, Subeen Pang, Zhenfei He, and Prof. Jiaming Bai for helpful discussions. I. Kang acknowledges partial support from KFAS (Korea Foundation of Advanced Studies) scholarship, and F. Zhang acknowledges funding from the National Natural Science Foundation of China.

## Disclosures

The authors declare no conflicts of interest.

## References

**1. **P. A. Morris, R. S. Aspden, J. E. Bell, R. W. Boyd, and M. J. Padgett, “Imaging with a small number of photons,” Nat. Commun. **6**(1), 5913 (2015). [CrossRef]

**2. **P. P. Laissue, R. A. Alghamdi, P. Tomancak, E. G. Reynaud, and H. Shroff, “Assessing phototoxicity in live fluorescence imaging,” Nat. Methods **14**(7), 657–661 (2017). [CrossRef]

**3. **L. Gignac, C. Beslin, J. Gonsalves, F. Stellari, and C.-C. Lin, “High energy bse/se/stem imaging of 8 um thick semiconductor interconnects,” Microsc. Microanal. **20**(S3), 8–9 (2014). [CrossRef]

**4. **I. Utke, S. Moshkalev, and P. Russell, * Nanofabrication using focused ion and electron beams: principles and applications* (Oxford University Press, 2012).

**5. **D. Gabor, “A new microscopic principle,” Nature **161**(4098), 777–778 (1948). [CrossRef]

**6. **D. J. Brady, K. Choi, D. L. Marks, R. Horisaki, and S. Lim, “Compressive holography,” Opt. Express **17**(15), 13040–13049 (2009). [CrossRef]

**7. **Y. Rivenson, Y. Zhang, H. Günaydın, D. Teng, and A. Ozcan, “Phase recovery and holographic image reconstruction using deep learning in neural networks,” Light: Sci. Appl. **7**(2), 17141 (2018). [CrossRef]

**8. **Y. Wu, Y. Rivenson, Y. Zhang, Z. Wei, H. Günaydin, X. Lin, and A. Ozcan, “Extended depth-of-field in holographic imaging using deep-learning-based autofocusing and phase recovery,” Optica **5**(6), 704–710 (2018). [CrossRef]

**9. **Y. Wu, Y. Luo, G. Chaudhari, Y. Rivenson, A. Calis, K. de Haan, and A. Ozcan, “Bright-field holography: cross-modality deep learning enables snapshot 3d imaging with bright-field contrast using a single hologram,” Light: Sci. Appl. **8**(1), 25 (2019). [CrossRef]

**10. **M. Holler, A. Díaz, M. Guizar-Sicairos, P. Karvinen, E. Färm, E. Härkönen, M. Ritala, A. Menzel, J. Raabe, and O. Bunk, “X-ray ptychographic computed tomography at 16 nm isotropic 3d resolution,” Sci. Rep. **4**(1), 3857 (2015). [CrossRef]

**11. **L. Tian, X. Li, K. Ramchandran, and L. Waller, “Multiplexed coded illumination for fourier ptychography with an led array microscope,” Biomed. Opt. Express **5**(7), 2376–2389 (2014). [CrossRef]

**12. **A. M. Maiden and J. M. Rodenburg, “An improved ptychographical phase retrieval algorithm for diffractive imaging,” Ultramicroscopy **109**(10), 1256–1262 (2009). [CrossRef]

**13. **N. Streibl, “Phase imaging by the transport equation of intensity,” Opt. Commun. **49**(1), 6–10 (1984). [CrossRef]

**14. **L. Waller, L. Tian, and G. Barbastathis, “Transport of intensity phase-amplitude imaging with higher order intensity derivatives,” Opt. Express **18**(12), 12552–12561 (2010). [CrossRef]

**15. **L. Waller, S. S. Kou, C. J. Sheppard, and G. Barbastathis, “Phase from chromatic aberrations,” Opt. Express **18**(22), 22817–22825 (2010). [CrossRef]

**16. **L. Waller, M. Tsang, S. Ponda, S. Y. Yang, and G. Barbastathis, “Phase and amplitude imaging from noisy images by kalman filtering,” Opt. Express **19**(3), 2805–2815 (2011). [CrossRef]

**17. **Y. Zhu, A. Shanker, L. Tian, L. Waller, and G. Barbastathis, “Low-noise phase imaging by hybrid uniform and structured illumination transport of intensity equation,” Opt. Express **22**(22), 26696–26711 (2014). [CrossRef]

**18. **K. A. Nugent, “Coherent methods in the x-ray sciences,” Adv. Phys. **59**(1), 1–99 (2010). [CrossRef]

**19. **R. Horisaki, R. Takagi, and J. Tanida, “Learning-based imaging through scattering media,” Opt. Express **24**(13), 13738–13743 (2016). [CrossRef]

**20. **J. Miao, R. L. Sandberg, and C. Song, “Coherent x-ray diffraction imaging,” IEEE J. Sel. Top. Quantum Electron. **18**(1), 399–410 (2012). [CrossRef]

**21. **R. W. Gerchberg, “A practical algorithm for the determination of phase from image and diffraction plane pictures,” Optik **35**, 237–246 (1972).

**22. **W. Saxton, * Computer techniques for image processing in electron microscopy*, vol. 10 (Academic Press, 2013).

**23. **J. R. Fienup, “Phase retrieval algorithms: a comparison,” Appl. Opt. **21**(15), 2758–2769 (1982). [CrossRef]

**24. **A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica **4**(9), 1117–1125 (2017). [CrossRef]

**25. **A. Goy, K. Arthur, S. Li, and G. Barbastathis, “Low photon count phase retrieval using deep learning,” Phys. Rev. Lett. **121**(24), 243902 (2018). [CrossRef]

**26. **M. Deng, S. Li, A. Goy, I. Kang, and G. Barbastathis, “Learning to synthesize: Robust phase retrieval at low photon counts,” Light: Sci. Appl. **9**(1), 36 (2020). [CrossRef]

**27. **M. Deng, S. Li, I. Kang, N. X. Fang, and G. Barbastathis, “On the interplay between physical and content priors in deep learning for computational imaging,” arXiv preprint arXiv:2004.06355 (2020).

**28. **F. Zhang, G. Pedrini, and W. Osten, “Phase retrieval of arbitrary complex-valued fields through aperture-plane modulation,” Phys. Rev. A **75**(4), 043805 (2007). [CrossRef]

**29. **F. Zhang and J. Rodenburg, “Phase retrieval based on wave-front relay and modulation,” Phys. Rev. B **82**(12), 121104 (2010). [CrossRef]

**30. **F. Zhang, B. Chen, G. R. Morrison, J. Vila-Comamala, M. Guizar-Sicairos, and I. K. Robinson, “Phase retrieval by coherent modulation imaging,” Nat. Commun. **7**(1), 13367 (2016). [CrossRef]

**31. **X. Dong, X. Pan, C. Liu, and J. Zhu, “Single shot multi-wavelength phase retrieval with coherent modulation imaging,” Opt. Lett. **43**(8), 1762–1765 (2018). [CrossRef]

**32. **A. Ulvestad, W. Cha, I. Calvo-Almazan, S. Maddali, S. Wild, E. Maxey, M. Duparaz, and S. Hruszkewycz, “Bragg coherent modulation imaging: Strain-and defect-sensitive single views of extended samples,” arXiv preprint arXiv:1808.00115 (2018).

**33. **W. Tang, J. Yang, W. Yi, Q. Nie, J. Zhu, M. Zhu, Y. Guo, M. Li, X. Li, and W. Wang, “Single-shot coherent power-spectrum imaging of objects hidden by opaque scattering media,” Appl. Opt. **58**(4), 1033–1039 (2019). [CrossRef]

**34. **J. Wu, H. Zhang, W. Zhang, G. Jin, L. Cao, and G. Barbastathis, “Single-shot lensless imaging with fresnel zone aperture and incoherent illumination,” Light: Sci. Appl. **9**(1), 53 (2020). [CrossRef]

**35. **G. Williams, H. Quiney, B. Dhal, C. Tran, K. A. Nugent, A. Peele, D. Paterson, and M. De Jonge, “Fresnel coherent diffractive imaging,” Phys. Rev. Lett. **97**(2), 025506 (2006). [CrossRef]

**36. **G. Williams, H. Quiney, A. Peele, and K. Nugent, “Fresnel coherent diffractive imaging: treatment and analysis of data,” New J. Phys. **12**(3), 035020 (2010). [CrossRef]

**37. **B. Abbey, K. A. Nugent, G. J. Williams, J. N. Clark, A. G. Peele, M. A. Pfeifer, M. De Jonge, and I. McNulty, “Keyhole coherent diffractive imaging,” Nat. Phys. **4**(5), 394–398 (2008). [CrossRef]

**38. **B. Chen and J. J. Stamnes, “Validity of diffraction tomography based on the first born and the first rytov approximations,” Appl. Opt. **37**(14), 2996–3006 (1998). [CrossRef]

**39. **A. Devaney, “Inverse-scattering theory within the rytov approximation,” Opt. Lett. **6**(8), 374–376 (1981). [CrossRef]

**40. **J. Lim, A. B. Ayoub, E. E. Antoine, and D. Psaltis, “High-fidelity optical diffraction tomography of multiple scattering samples,” Light: Sci. Appl. **8**(1), 1–12 (2019). [CrossRef]

**41. **T.-A. Pham, E. Soubies, A. Ayoub, J. Lim, D. Psaltis, and M. Unser, “Three-dimensional optical diffraction tomography with lippmann-schwinger model,” IEEE Trans. Comput. Imaging **6**, 727–738 (2020). [CrossRef]

**42. **J. W. Goodman, * Introduction to Fourier optics* (Roberts and Company Publishers, 2005).

**43. **D. Torrieri, * Principles of spread-spectrum communication systems*, vol. 1 (Springer, 2005).

**44. **C. Kohler, F. Zhang, and W. Osten, “Characterization of a spatial light modulator and its application in phase retrieval,” Appl. Opt. **48**(20), 4003–4008 (2009). [CrossRef]

**45. **J. Schmidt, * Numerical simulation of optical wave propagation with examples in matlab*, (Society of Photo-Optical Instrumentation Engineers (SPIE), 2010).

**46. **F. Zhang, I. Yamaguchi, and L. Yaroslavsky, “Algorithm for reconstruction of digital holograms with adjustable magnification,” Opt. Lett. **29**(14), 1668–1670 (2004). [CrossRef]

**47. **L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Phys. D **60**(1-4), 259–268 (1992). [CrossRef]

**48. **A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sci. **2**(1), 183–202 (2009). [CrossRef]

**49. **O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), (Springer, 2015), pp. 234–241.

**50. **K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), (2016), pp. 770–778.

**51. **J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), (2009), pp. 248–255.

**52. **A. Goy, G. Rughoobur, S. Li, K. Arthur, A. I. Akinwande, and G. Barbastathis, “High-resolution limited-angle phase tomography of dense layered objects using deep neural networks,” Proc. Natl. Acad. Sci. U. S. A. **116**(40), 19848–19856 (2019). [CrossRef]

**53. **D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).

**54. **Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. **13**(4), 600–612 (2004). [CrossRef]

**55. **J. R. Fienup, “Invariant error metrics for image reconstruction,” Appl. Opt. **36**(32), 8352–8357 (1997). [CrossRef]

**56. **J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision (ECCV), (Springer, 2016), pp. 694–711.

**57. **M. Deng, A. Goy, S. Li, K. Arthur, and G. Barbastathis, “Probing shallower: perceptual loss trained phase extraction neural network (plt-phenn) for artifact-free reconstruction at low photon budget,” Opt. Express **28**(2), 2511–2535 (2020). [CrossRef]

**58. **P. Gemayel, B. Colicchio, A. Dieterlen, and P. Ambs, “Cross-talk compensation of a spatial light modulator for iterative phase retrieval applications,” Appl. Opt. **55**(4), 802–810 (2016). [CrossRef]

**59. **A. van der Schaaf and J. H. van Hateren, “Modelling the power spectra of natural images: statistics and information,” Vision Res. **36**(17), 2759–2770 (1996). [CrossRef]

**60. **S. Li and G. Barbastathis, “Spectral pre-modulation of training examples enhances the spatial resolution of the phase extraction neural network (phenn),” Opt. Express **26**(22), 29340–29352 (2018). [CrossRef]

**61. **G. K. Matsopoulos, N. A. Mouravliansky, K. K. Delibasis, and K. S. Nikita, “Automatic retinal image registration scheme using global optimization techniques,” IEEE Trans. Inf. Technol. Biomed. **3**(1), 47–60 (1999). [CrossRef]

**62. **J. A. Nelder and R. Mead, “A simplex method for function minimization,” Comput. J. **7**(4), 308–313 (1965). [CrossRef]

**63. **A. Strehl and J. Ghosh, “Cluster ensembles—a knowledge reuse framework for combining multiple partitions,” J. Mach. Learn. Res. **3**, 583–617 (2002).

**64. **G. Barbastathis, A. Ozcan, and G. Situ, “On the use of deep learning for computational imaging,” Optica **6**(8), 921–943 (2019). [CrossRef]

**65. **R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., (2018), pp. 586–595.

**66. **M. Bertero and P. Boccacci, * Introduction to inverse problems in imaging* (CRC press, 1998).

**67. **A. M. Eskicioglu and P. S. Fisher, “Image quality measures and their performance,” IEEE Trans. Commun. **43**(12), 2959–2965 (1995). [CrossRef]