## Abstract

Imaging through scattering is an important yet challenging problem. Tremendous progress has been made by exploiting the deterministic input–output “transmission matrix” for a fixed medium. However, this “one-to-one” mapping is highly susceptible to speckle decorrelations – small perturbations to the scattering medium lead to model errors and severe degradation of the imaging performance. Our goal here is to develop a new framework that is highly scalable to both medium perturbations and measurement requirement. To do so, we propose a statistical “one-to-all” deep learning (DL) technique that encapsulates a wide range of statistical variations for the model to be resilient to speckle decorrelations. Specifically, we develop a convolutional neural network (CNN) that is able to learn the statistical information contained in the speckle intensity patterns captured on a set of diffusers having the same macroscopic parameter. We then show for the first time, to the best of our knowledge, that the trained CNN is able to generalize and make high-quality object predictions through an entirely different set of diffusers of the same class. Our work paves the way to a highly scalable DL approach for imaging through scattering media.

© 2018 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. INTRODUCTION

Light scattering in complex media is a pervasive problem across many areas, such as deep tissue imaging [1], imaging in degraded environment [2], and wavefront shaping [3–5]. To date, there is no simple solution for inverting scattering because of the many possible optical paths between the object and the detector. The output of a coherent light scattered from a complex medium exhibits a seemingly random speckle pattern [6]. The speckle’s spatial distribution is a complex function of both the microscopic arrangement of the scatterers and the wavefront of the incident field. Thus, a comprehensive *deterministic* characterization of the scattering process is often difficult, requiring large-scale measurements.

Major progress has been made by using the transmission matrix (TM) framework [3,7,8] that characterizes the “one-to-one” input–output relation of a fixed scattering medium as a linear shift-*variant* matrix. Owing to the many underlying degrees of freedom, the TM is inevitably large; its size generally grows quadratically as the transferred pixel number, i.e., the system’s space-bandwidth product (SBP). This makes this approach highly measurement and data-demanding for high-SBP applications. Under special conditions, simplification can be made using the memory effect [9], which approximates the system to be shift-*invariant*. However, the SBP of this method is still small due to the limited memory effect range [9,10], finite sensor dynamic range [11], imaging geometry [12–14], and trade-offs between illumination coherence, speckle contrast, and measurement requirement [11,15–17].

A major limitation of these existing approaches is their high susceptibility to model errors. The phase-sensitive TM is inherently intolerant to speckle decorrelations [18–21]. Slight changes of the medium can lead to much reduced correlations between the speckles measured before and after. This indicates the breakdown of the previous input–output relation, and results in rapid degradation of the transferred images. In other words, a new TM is needed once the speckle patterns become decorrelated, e.g., Pearson correlation coefficient (PCC) $<1/e$, making these methods challenging to scale for applications involving dynamic scatterers. Current solutions focus on developing hardware with higher speed than the medium’s decorrelation time [20,22–25]; still, they are often limited by the memory effect.

Our goal here is to develop a highly *scalable* imaging through scattering framework by overcoming the existing limitations in susceptibility to speckle decorrelation and SBP. The main approach is to build a “one-to-all” model that possesses two essential *statistical* properties. First, “one” model sufficiently encompasses the statistical *variations* across “all” scattering media with different scatterer microstructures but within the same class. Second, the model can distill the statistically *invariant* information encoded in the speckle patterns (correlated or decorrelated). Together, they allow the single model to be generalizable to various objects/media having the same statistical characteristics.

The proposed model is built on a deep learning (DL) framework. To satisfy the desired statistical properties, we do *not* train a convolutional neural network (CNN) to learn the TM of *a single* scattering medium. Instead, we build *a CNN to learn a “one-to-all” mapping by training on multiple scattering media with different microstructures while having the same macroscopic parameter*. Specifically, we show that our CNN model trained on a few diffusers can sufficiently support the statistical information of all diffusers having the same mean characteristics (e.g., “grits” [26]). We then experimentally demonstrate that the CNN is able to “invert” speckles captured from entirely different diffusers to make high-quality object predictions, as outlined in Fig. 1.

DL is shown to be powerful in solving complex imaging problems, providing state-of-the-art performance in super-resolution [27,28], holography [29,30], and phase recovery [31,32]. Instead of building an explicit model, DL takes a data-driven approach that seeks solutions by learning from a large-scale dataset. The major benefit includes the flexibility and adaptability in solving complex problems, in which a parametric model is hard to derive and/or prone to errors. Closely related to our work are the learning-based techniques for imaging/focusing through diffusers [33–37]. Unfortunately, all existing networks are trained and tested only on the *same* diffuser, so the model may still be susceptible to speckle decorrelation. Indeed, as tested in our experiment, a single diffuser trained CNN does not capture sufficient statistical variations to interpret speckle patterns from other diffusers. Another closely related line of work is using DL to conduct imaging through multimode fibers (MMFs) [38,39]. Image transfer through a MMF also results in speckle patterns due to spatial mode mixing. CNNs have been designed to capture sufficient statistical variations of the setup so as to provide superior robustness against random variations.

We demonstrate our technique under shift-*variant* scattering by placing a diffuser at a defocused plane [13,14,36]. This geometry provides a limited isoplanatic region ($\approx \mathrm{speckle}$ size) [13,14], as verified experimentally in Fig. 2. The objects extend well beyond the isoplanatic region ($\sim 300\times 300$ speckle size). Our task is further complicated by the intensity-only measurement under coherent illumination; the mapping between the object and speckle intensity is nonlinear [6]. The training step in our DL method is conceptually similar to the TM calibration, in which a series of patterns are input to the diffuser and the output is measured. In TM calibration, interferometric measurements are often required [7,8]; additional phase-retrieval procedures are needed when intensity-only data are used [40]. Here, the proposed CNN learns to interpret the “phaseless” measurements using its nonlinear, multilayer structure.

We experimentally achieve $\sim 256\times 256\text{\hspace{0.17em}}\mathrm{pixel}$ SBP using up to 2400 training pairs. Importantly, our training data were collected on *multiple* diffusers. Distinct from the TM approach, our trained CNN is able to predict objects through “unseen diffusers” that were *never used during training*. We experimentally quantify the CNN performance trained with one, two, or four diffusers and demonstrate the superior robustness over speckle decorrelation of our technique. We further demonstrate that the trained CNN is able to generalize over new object types through unseen diffusers.

Although it is hard to give an explicit expression of our CNN model (a common challenge in DL), we attempt to provide some insights by performing both CNN visualization and statistical analysis on our data across multiple objects and diffusers. The basic mechanism of DL is to identify statistical invariance across large datasets [41]. We first visualize the activation maps of our CNN when inputting speckle patterns obtained from the same object but through different diffusers. By quantifying the correlations between the corresponding activation maps, we show that our CNN indeed gradually learns the invariance across these speckle patterns. Next, we visualize speckle intensity correlations and show that physical invariance does exist across seemingly decorrelated speckle patterns taken through different diffusers. Such information would be hard to be directly utilized using existing models. Our CNN model is able to discover and exploit these “hidden” invariant features owing to its higher representation power.

We demonstrate a promising DL framework toward highly scalable imaging through scattering media. Our method significantly improves the system’s information throughput and adaptability compared to existing approaches, by improving both the SBP and the robustness to speckle decorrelations.

## 2. METHOD

#### A. Experimental Setup

We use a spatial light modulator (SLM) (Holoeye NIR-011, pixel size 8 μm) as a programmable *amplitude*-only object with two orthogonally oriented polarizers before and after [Fig. 2(a)], similar to [36]. It is coherently illuminated by a collimated beam from a HeNe laser (632 nm, Thorlabs HNL210L). The SLM is relayed onto the camera (Thorlabs Quantalux, pixel size 5.04 μm) by a 4F system. Two lenses with focal lengths 200 mm (L1) and 125 mm (L2) are used to provide a 0.625 magnification. This design approximately produces the same effective pixel size for the object and the image, convenient for the CNN implementation since the same number of pixels can be used for the input and output without resizing [36]. Precise pixel-wise alignment was *not* performed or needed. A $\sim 9\text{\hspace{0.17em}}\mathrm{mm}$ iris is placed at the pupil plane of the 4F system to control the speckle size. The theoretical average speckle size is $\sim 8.8\text{\hspace{0.17em}}\mathrm{\mu m}$, or equivalently $\sim 14\text{\hspace{0.17em}}\mathrm{\mu m}$ on the object plane, as set by $\lambda /2\mathrm{NA}$ (NA denotes the numerical aperture of the 4F system) [6]. This is experimentally verified by taking the autocorrelation of a speckle pattern through a diffuser and measuring the full width at half-maximum [6], which reads $\sim 16\text{\hspace{0.17em}}\mathrm{\mu m}$, as shown in Fig. 2(b) (for ease of comparison, all length measurements are converted to the object side).

The spatially *variant* scattering is generated by placing a thin glass diffuser (Thorlabs, 220 grits, DG10-220) between the SLM and the 4F system’s first lens. This system theoretically provides a small isoplanatic region that is limited to a single speckle since the diffuser is placed at a defocus position [13]. We quantify the isoplanatism by measuring the intensity speckle correlations [11]. A $3\times 3\text{\hspace{0.17em}}\mathrm{pixel}$ “point-object” is scanned linearly across the SLM pixel by pixel (8 μm). The isoplanatic range is then found by calculating the PCC between the speckle pattern from the central point and the one from each shifted point. Rapid speckle decorrelation beyond a single speckle range is observed in Fig. 2(c). The correlation coefficient plateaus around 0.3, close to the value in the speckle intensity autocorrelation curve [Fig. 2(b)]. The smallest object (24 μm) was limited by the signal-to-background ratio of the experiment due to imperfect polarizer extinction power producing non-negligible background at low-light levels. The same procedure was repeated on different object sizes; nearly identical curves are obtained (see Supplement 1). The same behavior was numerically predicted in [13,36].

Speckle measurements are repeated on several diffusers having the same macroscopic parameter (220 grits). All glass (BK-7) diffusers are manufactured by the same process (Thorlabs), in which the top surface is first polished, and the bottom surface is then ground with the specified (220) grit. 220-grit provides an average 63 μm feature size on the glass surface. When imaged in our setup, speckles generated by all diffusers possess similar statistical properties, including the average speckle size, and the background correlation (0.3) (see Figs. 2 and 11).

#### B. Data Acquisition

The central $512\times 512$ SLM pixels are used as the object; the corresponding central $512\times 512$ camera pixels are used as the speckle intensity for CNN training and testing. Considering the system’s resolution (measured by the speckle size), the SBP is $\sim 300\times 300\text{\hspace{0.17em}}\mathrm{pixels}$ with a field-of-view (FOV) of $\sim 4\times 4\text{\hspace{0.17em}}{\mathrm{mm}}^{2}$, which is well beyond the isoplanatic patch. The objects displayed on the SLM are 8-bit grayscale images from the MNITS handwritten digit [42], NIST handwritten letter [43], and Quickdraw object [44] databases.

In total we take speckle patterns using nine different diffusers. We use data from up to four diffusers to train our CNN, *the data from the other five diffusers are never seen by the CNN during training and are only used for testing*. The *training* objects are taken only from the handwritten digit and letter databases. The Quickdraw objects are used only for *testing*. The data were taken spanning $\sim 8$ weeks, demonstrating the robustness of our approach to possible random variations during the experiment.

To collect the *training* data, we use in total 600 objects (300 digits and 300 letters). For each *training* diffuser, we take 600 speckle images, giving in total up to 2400 training data.

Our *testing* data are purposely designed to have four groups for characterizing our CNN’s generalization capability evaluated from different perspectives, as follows:

**Group 1** tests the CNN generalization over *speckle decorrelation due to the change of diffusers*. It consists of 3000 *“seen objects through unseen diffusers”* collected from the *same* 600 objects used in the training, but through the five *unseen testing* diffusers.

**Group 2** tests the CNN over *the change of diffusers and unseen objects of the same type* (as the training objects). It consists of 200 *“unseen objects of the same type through unseen diffusers”* from previously *unused* 200 objects during the training and of the same class (100 digits + 100 letters), and through a randomly selected *unseen testing* diffuser.

**Group 3** tests the CNN over *the change of diffusers and new object types*. It consists of 800 *“unseen objects of new types through unseen diffusers,”* and through the five *unseen testing* diffusers. The objects are taken from the Quickdraw database.

**Group 4** benchmarks the CNN performance *trained on a single diffuser*. It consists of 28 *“unseen objects through the same diffuser”* from previously *unused* 28 objects of the same type (9 digits + 19 letters) during training and through a randomly selected *seen training* diffuser.

#### C. Data Preprocessing

Owing to computational limitations, all input and output images are first downsampled from $512\times 512\text{\hspace{0.17em}}\mathrm{pixels}$ to $256\times 256\text{\hspace{0.17em}}\mathrm{pixels}$ by taking the average within each $2\times 2$ neighboring pixels (i.e., $2\times 2$ binning). The downsampling reduces both the number of network parameters (this number grows with the input size) and the required data size for training without overfitting (this size grows with the network parameters). However, two artifacts may result. First, our system images each speckle with approximately two pixels; after downsampling, each binned *image* pixel contains intensities from several speckle grains, effectively reducing the contrast of the input patterns [6]. Second, each binned *object* pixel may combine pixels from both the object and background regions, introducing incorrect (noisy) ground truth. Robust training using noisy ground truth has been shown in other CNN tasks [45]. In essence, the CNN learns the invariants and filters out the random noise. Our results suggest that the downsampling has little effect on the final results. Next, for both training and testing, the input speckles are normalized between 0 and 1 by dividing each image by its maximum.

Our CNN is designed to perform two types of tasks. First, the *binary detection* task outputs a two-channel binary estimate of the object and background. Accordingly, during the training, each grayscale object is thresholded by setting all nonzero valued pixels to 1 to give the ground-truth object; the ground-truth background is the complement. Second, the *grayscale object reconstruction* task outputs a two-channel grayscale estimate of the object and background. The ground-truth object is the grayscale image displayed on the SLM, processed with $2\times 2\text{\hspace{0.17em}}\mathrm{pixel}$ binning and normalized between 0 and 1; the ground-truth background is defined by subtracting the ground-truth object from 1.

#### D. CNN Implementation

We build a CNN to learn a statistical model relating the speckle patterns and the unscattered objects. Importantly, the goal is to make predictions through previously unseen diffusers.

The overall structure of the proposed CNN (Fig. 3) follows the encoder–decoder “Unet” architecture [46] with modifications of replacing each convolutional layer with a dense block [47] to improve the training efficiency [36]. The input to the CNN is a preprocessed $256\times 256$ speckle pattern. Next, the input goes through the “encoder” path, which consists of four dense blocks connected by max pooling layers for downsampling. The intermediate output from the encoder has small lateral dimensions $(16\times 16)$ but encodes rich information along the “depth” (having 1088 activation maps). Each dense block contains multiple layers, in which each layer consists of batch normalization (BN), the rectified linear unit (ReLU) nonlinear activation, and convolution (conv) with 16 filters. Next, the low-resolution activation maps go through the “decoder” path, which consists of four additional dense blocks connected by upsampling convolutional (up conv) layers. The information across different spatial scales are tunneled through the encoder–decoder paths by skip connections to preserve high-frequency information. After the decoder path, an additional convolutional layer followed by the last layer produces the network output. The design of this last layer requires careful consideration of *the desired imaging task*.

Our CNN is designed to *image sparse objects*. Widely used loss functions including mean squared error (MSE) and mean absolute error (MAE) cannot promote sparsity since they assume the underlying signals follow Gaussian and Laplace statistics, respectively [48]. In a recent work [36], the negative PCC is shown to promote sparse predictions. Here, we propose an alternative method. We use a softmax layer to produce a pair of mutually complementary channels presenting the object and the background, respectively. We then use the averaged cross-entropy [46] as the loss function $L$, which has shown to promote sparsity [49], and is given by

Importantly, our design allows making both binary and grayscale predictions. First, we consider *the pixel-wise binary detection* problem—the CNN predicts if the object is present or not pixel by pixel. In this case, both the ground truth and predictions take binary values. The intermediate output from the softmax layer is often interpreted as the probabilities of each pixel belonging to the object and background classes. Second, we consider *the grayscale object reconstruction* problem—the CNN predicts continuous-valued intensity in each object pixel. In this case, both the ground truth and predictions take grayscale values. The predictions are directly from the softmax layer. Since our objects are generated with a 8-bit SLM, the CNN predictions are set to the same bit level.

The CNN training was performed on the Boston University shared computing cluster with one graphics processing unit (NVIDIA Tesla P100) using Keras/Tensorflow. Each CNN is trained with 500 epochs by the Adam optimizer for up to 44 h. The learning rate of ${10}^{-4}$ is used for the first 300 epochs, ${10}^{-5}$ for the next 100 epochs, and ${10}^{-6}$ for the final 100 epochs. Once the CNN is trained, each prediction was made in real time. More details of the CNN architecture, parameter optimization, and training procedures are provided in Supplement 1. We also provide open source code of our CNN model along with pretrained weights and sample data in [50].

## 3. RESULTS

We present our results from four types of experiments, in line with the acquired data described in Section 2.B. The results from the first three experiments are all from the CNN trained with four training diffusers and tested on five testing diffusers. The last experiment is to compare the four-training-diffuser results to those from the CNN trained on a single diffuser. Although our CNN is able to make both binary and grayscale predictions, we here only show binary images. The grayscale network provides similar performance, as detailed in Supplement 1. This is probably because our CNN is designed to image sparse objects. Imaging nonsparse objects become more challenging [36] and will be considered in our future work.

In the first experiment, we test our CNN to *predict “seen objects through unseen diffusers”* (**Task 1**). Notably, our CNN demonstrates superior generalization in predicting objects through previously unseen diffusers. Representative examples of the speckle and prediction pairs are shown in Fig. 4. More results are given in Supplement 1. For the same object, although the speckle patterns through different diffusers appear notably different, the CNN consistently makes high-quality predictions. Later, we quantify the differences between these speckle patterns by speckle decorrelation analysis in Section 4. The prediction results present slight variations since our CNN makes *pixel-wise predictions*, rather than the whole-image classification [51]. Our pixel-wise prediction task is considerably more difficult since the network needs to effectively learn the per-pixel input–output relation. In addition, since our CNN adapts to *all diffusers of the same class*, the learned relation needs also to be adaptable to all possible statistical variations. The variations of the predictions for this task using our binary CNN are quantified later in Fig. 8. Representative examples and statistical analysis on the grayscale CNN predictions are provided in Supplement 1.

In the second experiment, we test our CNN on a more difficult task of *predicting “unseen objects of the same type through unseen diffusers”* (**Task 2**). The set of objects has never been used in the training. These objects, however, belong to the same object class as the training data, i.e., handwritten digits and letters. A quantitative comparison between Task 1 and Task 2 measured by the speckle decorrelation is presented in Section 4. Representative examples are shown in Fig. 5, demonstrating that the CNN is able to make high-quality binary predictions of these *unseen objects from the same class*, while through unseen diffusers. The corresponding grayscale predictions are shown in Supplement 1.

In the third experiment, we further test our CNN on *predicting “unseen objects of new types through unseen diffusers”* (**Task 3**). The set of objects has never been used in the training and belongs to a *different object class* (Quickdraw). Representative examples are shown in Fig. 6, demonstrating that our CNN is still able to make high-quality predictions of these *unseen new types of objects* through unseen diffusers. The quality of the binary predictions for this task are quantified in Fig. 9 across different object types. The corresponding grayscale predictions are evaluated in Supplement 1.

In the fourth experiment, we compare our “four-training-diffuser” results against those from *the CNN trained on a single diffuser*. The results are presented in Fig. 7, which consists of two tasks. **Task 4** makes predictions on *unseen objects by the CNN that is trained and tested on the same diffuser*. Successful demonstrations of accomplishing this task via machine learning have been reported [33,34,36]. **Task 5** makes predictions on *unseen objects through a different unseen diffuser by the CNN trained on a single diffuser*. The goals of this experiment are twofold. First, owing to the different choices of CNN architectures and loss functions, here we validate that our design can indeed reliably perform Task 4, as shown in Fig. 7. Our results from our CNN are further quantified in Fig. 8, which match the state-of-the-art performance with an average PCC of 0.626 [36]. Second, we verify that a CNN trained on only a single diffuser can*not* be reliably generalized to other diffusers (shown in Fig. 7), since the CNN is tuned to fit only to the model of a specific diffuser.

Next, we quantify the performance on the “seen objects through unseen diffusers” task. We expand the comparisons across six CNNs trained on one, two, or four diffusers with three training dataset sizes (in total: 800, 1600, and 2400 pairs). We use two metrics, including the Jaccard index (JI) and PCC. Both metrics are useful to measure the similarity between image pairs [52]; they provide slightly different scores due to the differences in error counting. Each CNN is tested under the same condition, using the same 1000 speckle patterns (the same as Fig. 4).

We first present the JI scores. In the top figure of Fig. 8, the JI of each CNN tested on each individual testing diffuser is shown as a circle. Results from all five unseen diffusers are clustered together, regardless of the CNN being used, demonstrating *the consistency of the CNN prediction against object and diffuser variations*. In addition, we make two observations. First, *the performance improves as more training diffusers are used*. This is evident by comparing the results from the same number of 800 training data while increasing the number of training diffusers (similarly for the 1600 case). Second, *the performance further improves by increasing the size of the training dataset*. This is seen by comparing the same number of four training diffusers while increasing the training dataset size (similarly for the two-diffuser case). To provide an intuitive visualization of the JI score, the bottom figure of Fig. 8 shows a few representative examples. In the first row, the result is further broken down to the true positive (white), the false positive (green), and the false negative (purple).

Next, we provide the alternative evaluation using the PCC score. The mean PCC of each CNN is given in Table 1. The general observations remain the same as the JI evaluation. In addition, we observe that the performance from “four diffusers, 800 data” is slightly better than that from “two diffusers, 1600 data” (i.e., more diffusers and less data), further demonstrating the effectiveness of training using multiple diffusers.

Finally, we quantify the performance on the “unseen objects of new types through unseen diffusers” task in Fig. 9. The results are from the CNN trained with four training diffusers and 2400 training data (the condition for Fig. 6). In general, our trained CNN is able to make high-quality predictions albeit with reduced JI scores compared to the “seen object” case. The performance also varies with the specific object types. In total, we tested six different types, whose performance are quantified by the mean and standard deviations of the JI. These results suggest that the quality of the CNN model is also influenced by the object types used during training. A larger training dataset covering additional object types may further improve our results.

## 4. ANALYSIS

To provide some insights of our CNN model, we perform analysis on both the network and the speckle patterns. The main principle of DL is to learn statistical invariant information across a large dataset [41]. Thus, our goal is to look for *any meaningful invariant features* among speckles taken through different diffusers. If any are found, it can suggest that it is plausible to establish a statistical mapping to relate these speckles by the CNN model.

First, we visualize the intermediate activation maps [53] from each layer of our CNN when inputting speckle patterns from *the same object* but through *different testing diffusers*. Starting with a pair of visually distinct speckle patterns, the activation maps gradually resemble similar patterns as the data flow through the encoder–decoder paths, as shown in Fig. 10(a). To quantify the learned invariance, we compute the pairwise PCCs of each corresponding layer (across all channels) from the same object for all possible combinations of the five testing diffusers. The PCC generally grows as the CNN layer; PCC curves from different objects follow the similar trend, as shown in Fig. 10(b).

Next, we perform speckle correlation analysis. Our findings are summarized in Fig. 11. First, we quantify speckle decorrelation in our measurement using the classical PCC metric [18–20]. Figure 11(a) presents the PCC’s histograms under various tasks (defined in Section 3), each from 400 randomly chosen speckle patterns. We describe the result based on the order of decorrelation (hence the difficulty of the task). First, Task 4 (Fig. 7) is evaluated by ${A}_{\mathrm{D}1}*{B}_{\mathrm{D}1}$, which correlates speckles from *different objects through the same diffuser*. Most of the speckle patterns are decorrelated and the mean coefficient is 0.307, which is consistent with the values found in both the isoplanatism and speckle size characterization plots in Fig. 2. Second, Task 1 (Fig. 4) is evaluated by ${A}_{\mathrm{D}1}*{A}_{\mathrm{D}2}$, which is for *the same object through different diffusers*. The speckle patterns are further decorrelated to a mean value of 0.221. Third, Tasks 2,3,5 (Figs. 5–7) are evaluated by ${A}_{\mathrm{D}1}*{B}_{\mathrm{D}2}$, which is for *different objects through different diffusers*. This gives the lowest correlation of around 0.207.

A single-valued metric does not sufficiently capture the rich information encoded in the speckle patterns. As inspired by speckle correlography [15] and the variants [11,16,17], next we investigate the speckle intensity correlation function for different speckle pairs. Representative examples from our main findings are presented in Fig. 11(b). Importantly, taking the speckle intensity autocorrelation as the reference, *speckle intensity cross-correlation from the same object but through two different diffusers (e.g., the first for training and the second for testing) resembles the similar pattern as the reference.* These correlation patterns do not follow the simple relation exploited in [11,15–17]. Nevertheless, the *invariance* maintained across speckle patterns *from training and testing diffusers* does suggest that there exist *learnable* and *generalizable* features. This suggests that if the CNN is *trained and tested with the same object but through different diffusers* (e.g., in Fig. 4), a physically meaningful invariance exists in these speckle intensity correlation patterns. Our CNN model is able to discover and exploit this “hidden” information although these speckle pairs are considered “decorrelated” based on the PCC. Next, correlation patterns from visually similar objects are shown to present a notable difference, demonstrating the sensitivity of these features. Overall, we speculate that these invariant correlation patterns/features could contribute to the scalability of our CNN with respect to speckle decorrelations. Furthermore, our results on *unseen objects* through unseen diffusers (Figs. 5 and 6) suggest that these learned invariances are generalizable to a broader range of speckle measurements.

## 5. CONCLUSION AND DISCUSSION

We have demonstrated a DL framework to significantly improve the scalability of imaging through scattering. Traditional techniques suffer from the *“one-to-one”* limitation, in which one model only works for one fixed scattering medium. Here, we take an entirely different *“one-to-all”* strategy, in which one model fits all scattering media within the same class. In practice, this leads to significantly improved resilience to speckle decorrelations and an improved SBP. Our approach promises highly scalable, large information-throughput imaging through complex scattering media.

We envision that our technique can be useful in imaging biological samples. Several macroscopic parameters [54], such as absorption and scattering coefficients, and (transport) mean-free-path, are routinely used to characterize a sample’s scattering properties, as well as to make phantoms with controlled optical properties. One may train, classify, and image through these biological samples by adapting our technique.

We have demonstrated our technique to image through shift-variant scattering induced by a thin diffuser. This condition closely resembles those involving aberrations induced by a single scattering layer [14,55]. Our technique opens up the opportunity to compensate for these aberrations in real time without expensive hardware, and provide expanded FOVs and improved tolerance to the change of aberrations. The ultimate challenge for imaging through scattering is to deal with volumetric multiple scattering. Several learning-based approaches have been reported recently [56–61]. Future work could adapt our approach to handle these more challenging scenarios.

## Funding

National Science Foundation (NSF) (1711156); Directorate for Engineering (ENG).

## Acknowledgment

We thank Xiaojun Cheng for discussions on correlation analysis.

See Supplement 1 for supporting content.

## REFERENCES

**1. **V. Ntziachristos, “Going deeper than microscopy: the optical imaging frontier in biology,” Nat. Methods **7**, 603–614 (2010). [CrossRef]

**2. **M. C. Roggemann, B. M. Welsh, and B. R. Hunt, *Imaging Through Turbulence* (CRC Press, 2018).

**3. **I. Vellekoop and A. Mosk, “Focusing coherent light through opaque strongly scattering media,” Opt. Lett. **32**, 2309–2311 (2007). [CrossRef]

**4. **A. P. Mosk, A. Lagendijk, G. Lerosey, and M. Fink, “Controlling waves in space and time for imaging and focusing in complex media,” Nat. Photonics **6**, 283–292 (2012). [CrossRef]

**5. **S. Rotter and S. Gigan, “Light fields in complex media: mesoscopic scattering meets wave control,” Rev. Mod. Phys. **89**, 015005 (2017). [CrossRef]

**6. **J. W. Goodman, *Speckle Phenomena in Optics: Theory and Applications* (Roberts & Company, 2007).

**7. **S. Popoff, G. Lerosey, R. Carminati, M. Fink, A. Boccara, and S. Gigan, “Measuring the transmission matrix in optics: an approach to the study and control of light propagation in disordered media,” Phys. Rev. Lett. **104**, 100601 (2010). [CrossRef]

**8. **M. Kim, W. Choi, Y. Choi, C. Yoon, and W. Choi, “Transmission matrix of a scattering medium and its applications in biophotonics,” Opt. Express **23**, 12648–12668 (2015). [CrossRef]

**9. **I. Freund, “Looking through walls and around corners,” Physica A **168**, 49–65 (1990). [CrossRef]

**10. **S. Schott, J. Bertolotti, J.-F. Léger, L. Bourdieu, and S. Gigan, “Characterization of the angular memory effect of scattered light in biological tissues,” Opt. Express **23**, 13505–13516 (2015). [CrossRef]

**11. **O. Katz, P. Heidmann, M. Fink, and S. Gigan, “Non-invasive single-shot imaging through scattering layers and around corners via speckle correlations,” Nat. Photonics **8**, 784–790 (2014). [CrossRef]

**12. **A. Tokovinin, M. Le Louarn, and M. Sarazin, “Isoplanatism in a multiconjugate adaptive optics system,” J. Opt. Soc. Am. A **17**, 1819–1827 (2000). [CrossRef]

**13. **J. Mertz, H. Paudel, and T. G. Bifano, “Field of view advantage of conjugate adaptive optics in microscopy applications,” Appl. Opt. **54**, 3498–3506 (2015). [CrossRef]

**14. **J. Li, D. R. Beaulieu, H. Paudel, R. Barankov, T. G. Bifano, and J. Mertz, “Conjugate adaptive optics in widefield microscopy with an extended-source wavefront sensor,” Optica **2**, 682–688 (2015). [CrossRef]

**15. **A. Labeyrie, “Attainment of diffraction limited resolution in large telescopes by Fourier analysing speckle patterns in star images,” Astron. Astrophys. **6**, 85–87 (1970).

**16. **J. Bertolotti, E. G. van Putten, C. Blum, A. Lagendijk, W. L. Vos, and A. P. Mosk, “Non-invasive imaging through opaque scattering layers,” Nature **491**, 232–234 (2012). [CrossRef]

**17. **E. Edrei and G. Scarcelli, “Optical imaging through dynamic turbid media using the Fourier-domain shower-curtain effect,” Optica **3**, 71–74 (2016). [CrossRef]

**18. **T. R. Hillman, T. Yamauchi, W. Choi, R. R. Dasari, M. S. Feld, Y. Park, and Z. Yaqoob, “Digital optical phase conjugation for delivering two-dimensional images through turbid media,” Sci. Rep. **3**, 1909 (2013). [CrossRef]

**19. **M. Jang, H. Ruan, I. M. Vellekoop, B. Judkewitz, E. Chung, and C. Yang, “Relation between speckle decorrelation and optical phase conjugation (OPC)-based turbidity suppression through dynamic scattering media: a study on *in vivo* mouse skin,” Biomed. Opt. Express **6**, 72–85 (2015). [CrossRef]

**20. **Y. Liu, P. Lai, C. Ma, X. Xu, A. A. Grabar, and L. V. Wang, “Optical focusing deep inside dynamic scattering media with near-infrared time-reversed ultrasonically encoded (TRUE) light,” Nat. Commun. **6**, 5904 (2015). [CrossRef]

**21. **M. M. Qureshi, J. Brake, H.-J. Jeon, H. Ruan, Y. Liu, A. M. Safi, T. J. Eom, C. Yang, and E. Chung, “*In vivo* study of optical speckle decorrelation time across depths in the mouse brain,” Biomed. Opt. Express **8**, 4855–4864 (2017). [CrossRef]

**22. **D. Conkey, A. Caravaca-Aguirre, and R. Piestun, “High-speed scattering medium characterization with application to focusing light through turbid media,” Opt. Express **20**, 1733–1740 (2012). [CrossRef]

**23. **D. Wang, E. H. Zhou, J. Brake, H. Ruan, M. Jang, and C. Yang, “Focusing through dynamic tissue with millisecond digital optical phase conjugation,” Optica **2**, 728–735 (2015). [CrossRef]

**24. **Y. Liu, C. Ma, Y. Shen, J. Shi, and L. V. Wang, “Focusing light inside dynamic scattering media with millisecond digital optical phase conjugation,” Optica **4**, 280–288 (2017). [CrossRef]

**25. **B. Blochet, L. Bourdieu, and S. Gigan, “Focusing light through dynamical samples using fast continuous wavefront optimization,” Opt. Lett. **42**, 4994–4997 (2017). [CrossRef]

**26. **https://www.unc.edu/rowlett/units/scales/grit.html.

**27. **Y. Rivenson, Z. Göröcs, H. Günaydin, Y. Zhang, H. Wang, and A. Ozcan, “Deep learning microscopy,” Optica **4**, 1437–1443 (2017). [CrossRef]

**28. **H. Wang, Y. Rivenson, Y. Jin, Z. Wei, R. Gao, H. Gunaydin, L. Bentolila, and A. Ozcan, “Deep learning achieves super-resolution in fluorescence microscopy,” bioRxiv (2018), p. 309641.

**29. **Y. Rivenson, Y. Zhang, H. Günaydn, D. Teng, and A. Ozcan, “Phase recovery and holographic image reconstruction using deep learning in neural networks,” Light Sci. Appl. **7**, 17141 (2018). [CrossRef]

**30. **Z. Ren, Z. Xu, and E. Y. Lam, “Learning-based nonparametric autofocusing for digital holography,” Optica **5**, 337–344 (2018). [CrossRef]

**31. **A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica **4**, 1117–1125 (2017). [CrossRef]

**32. **T. Nguyen, Y. Xue, Y. Li, L. Tian, and G. Nehmetallah, “Convolutional neural network for Fourier ptychography video reconstruction: learning temporal dynamics from spatial ensembles,” arXiv: 1805.00334 (2018).

**33. **R. Horisaki, R. Takagi, and J. Tanida, “Learning-based imaging through scattering media,” Opt. Express **24**, 13738–13743 (2016). [CrossRef]

**34. **M. Lyu, H. Wang, G. Li, and G. Situ, “Exploit imaging through opaque wall via deep learning,” arXiv: 1708.07881 (2017).

**35. **R. Horisaki, R. Takagi, and J. Tanida, “Learning-based focusing through scattering media,” Appl. Opt. **56**, 4358–4362 (2017). [CrossRef]

**36. **S. Li, M. Deng, J. Lee, A. Sinha, and G. Barbastathis, “Imaging through glass diffusers using densely connected convolutional networks,” Optica **5**, 803–813 (2018). [CrossRef]

**37. **A. Turpin, I. Vishniakou, and J. D. Seelig, “Light scattering control with neural networks in transmission and reflection,” arXiv: 1805.05602 (2018).

**38. **N. Borhani, E. Kakkava, C. Moser, and D. Psaltis, “Learning to see through multimode fibers,” Optica **5**, 960–966 (2018). [CrossRef]

**39. **P. Fan, T. Zhao, and L. Su, “Deep learning the high variability and randomness inside multimode fibres,” arXiv: 1807.09351 (2018).

**40. **A. Drémeau, A. Liutkus, D. Martina, O. Katz, C. Schülke, F. Krzakala, S. Gigan, and L. Daudet, “Reference-less measurement of the transmission matrix of a highly scattering material using a DMD and phase retrieval techniques,” Opt. Express **23**, 11898–11911 (2015). [CrossRef]

**41. **Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature **521**, 436–444 (2015). [CrossRef]

**42. **http://yann.lecun.com/exdb/mnist/.

**43. **https://www.nist.gov/srd/nist-special-database-19.

**44. **https://quickdraw.withgoogle.com/data.

**45. **T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning from massive noisy labeled data for image classification,” in *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition* (2015), pp. 2691–2699.

**46. **O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolutional networks for biomedical image segmentation,” in *International Conference on Medical Image Computing and Computer-Assisted Intervention* (Springer, 2015), pp. 234–241.

**47. **G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)* (2017), pp. 2261–2269.

**48. **A. Kendall and Y. Gal, “What uncertainties do we need in Bayesian deep learning for computer vision?” in *Advances in Neural Information Processing Systems* (2017), pp. 5580–5590.

**49. **S. Suresh, N. Sundararajan, and P. Saratchandran, “Risk-sensitive loss functions for sparse multi-category classification problems,” Inf. Sci. **178**, 2621–2638 (2008). [CrossRef]

**50. **https://github.com/bu-cisl/deep-speckle-correlation.

**51. **A. Krizhevsky, I. Sutskevar, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Advances in Neural Information Processing Systems, Lake Tahoe, Nevada, 2012, pp. 1097–1105.

**52. **K. H. Zou, S. K. Warfield, A. Bharatha, C. M. Tempany, M. R. Kaus, S. J. Haker, W. M. Wells, F. A. Jolesz, and R. Kikinis, “Statistical validation of image segmentation quality based on a spatial overlap index1: scientific reports,” Acad. Radiol. **11**, 178–189 (2004). [CrossRef]

**53. **M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in *European Conference on Computer Vision* (Springer, 2014), pp. 818–833.

**54. **L. V. Wang and H.-I. Wu, *Biomedical Optics: Principles and Imaging* (Wiley, 2012).

**55. **N. Ji, D. Milkie, and E. Betzig, “Adaptive optics via pupil segmentation for high-resolution imaging in biological tissues,” Nat. Methods **7**, 141–147 (2010). [CrossRef]

**56. **L. Tian and L. Waller, “3D intensity and phase imaging from light field measurements in an LED array microscope,” Optica **2**, 104–111 (2015). [CrossRef]

**57. **L. Waller and L. Tian, “Machine learning for 3D microscopy,” Nature **523**, 416–417 (2015). [CrossRef]

**58. **U. S. Kamilov, I. N. Papadopoulos, M. H. Shoreh, A. Goy, C. Vonesch, M. Unser, and D. Psaltis, “Learning approach to optical tomography,” Optica **2**, 517–522 (2015). [CrossRef]

**59. **H.-Y. Liu, D. Liu, H. Mansour, P. T. Boufounos, L. Waller, and U. S. Kamilov, “SEAGLE: sparsity-driven image reconstruction under multiple scattering,” IEEE Trans. Comput. Imaging **4**, 73–86 (2018). [CrossRef]

**60. **E. Soubies, T.-A. Pham, and M. Unser, “Efficient inversion of multiple-scattering model for optical diffraction tomography,” Opt. Express **25**, 21786–21800 (2017). [CrossRef]

**61. **Y. Sun, Z. Xia, and U. S. Kamilov, “Efficient and accurate inversion of multiple scattering with deep learning,” Opt. Express **26**, 14678–14688 (2018). [CrossRef]