## Abstract

We present a deep neural network to reduce coherent noise in three-dimensional quantitative phase imaging. Inspired by the cycle generative adversarial network, the denoising network was trained to learn a transform between two image domains: clean and noisy refractive index tomograms. The unique feature of this network, distinct from previous machine learning approaches employed in the optical imaging problem, is that it uses *unpaired* images. The learned network quantitatively demonstrated its performance and generalization capability through denoising experiments of various samples. We concluded by applying our technique to reduce the temporally changing noise emerging from focal drift in time-lapse imaging of biological cells. This reduction cannot be performed using other optical methods for denoising.

© 2019 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

Recent advances in quantitative phase imaging (QPI) offer an extended opportunity for the label-free, non-destructive, and quantitative study of biological specimens [1]. As a scheme for three-dimensional (3D) QPI, optical diffraction tomography (ODT) is an imaging method that uses angularly varying illumination to reconstruct the 3D refractive index (RI) distribution of a microscopic sample. Since RI, an intrinsic optical property of a sample, provides morphological and biochemical information, ODT has been successfully applied to various fields, including histopathology [2,3], hematology [4–7] microbiology [8], cell biology [8–10] and nanotechnology [11]. The image quality of a reconstructed tomogram can be degraded by the noise originated from the use of coherent illumination (Fig. 1). Unwanted interference of the coherent light generates this noise in the form of fringe patterns and speckle grains [12]. This is mainly caused by multiple reflection from optical elements and dust particles. Misalignment of the optical system could also deteriorate the reconstructed tomogram. We term this category of noise as “coherent noise” throughout this paper.

To remedy the coherent noise, numerous studies involving modifications of experimental setups or additional data capturing have been conducted [13–18]. Unfortunately, this class of methods works only when the imaging system has sufficient stability during measurement. That is, it is challenging to remove the time-varying noises emerging from light source spectrum fluctuations, electro/mechanical vibrations, or focal drifts because of thermal or gravitational effects. Moreover, incoherent ODTs have also been proposed to sidestep the coherence issue, but the practical drawbacks arising from a short coherence length embody dispersion effect and limited angles of illumination [19,20]. Alternatively, numerical methods that exploit statistical knowledge can mitigate the suppression of the coherent noise, which may address the time-varying noise via post-processing. However, these approaches must assume specific statistics (e.g., Gaussian [21], Poisson [22], zero-mean probability distribution [23,24]) or need prior knowledge (e.g., sparsity [25,26]) to enforce the denoising process, which limits the direct application of these techniques to noises of unknown statistics.

In recent years, data-driven approaches, such as deep learning and machine learning, have been a powerful workhorse for various optical imaging problems [27], including resolution enhancement [28,29], classification [30–36], in silico fluorescence imaging [37,38], light scattering [39–41], phase recovery [42], optical system design [43], and noise reduction [44]. With a sufficiently large data set, a deep neural network embracing non-linear activation functions can approximate any continuous function in the real domain, as first proved by Cybenko’s work [45]. Hence, deep neural networks have the potential to design image-to-image transformation models for specific purposes.

A primary disadvantage of the existing networks is the requirement of “image pairs” (e.g., <low-resolution image, high-resolution image>, <brightfield image, fluorescent image>, <brightfield image, phase image>, <speckle pattern, sample image>, and <noisy image, denoised image>). However, in practice, it is often demanding or impossible to obtain such paired training data to use deep learning for denoising tomograms. Obtaining a clean tomogram and the paired coherent noisy tomogram, caused by the thermal focal drift or the inherent system instability, can be difficult and labor-intensive. In addition, preparing such input-output pairs may result in image registration issues.

Our approach for denoising tomograms employs a deep learning framework that takes “unpaired” tomogram sets for training. The deep neural network, inspired by the cycle-generative adversarial network (cycle-GAN), statistically learns to transform between two different image domains (i.e., clean and noisy tomograms) rather than relating one to one in a pair. The trained network was tested to remove coherent noise in the tomograms of silica microbeads for quantitative validation. The performance of the network was also confirmed through several experiments on biological cells never seen by the network during training, demonstrating its generality and potential applicability. Lastly, but most importantly, the denoising network successfully removed the coherent noise in a time-lapse experiment implying a HeLa cell imaging.

## 2. Deep neural network for denoising tomograms

The goal of the proposed deep neural network is to learn a high-dimensional function, translating between a noisy image domain *X* and a clean image domain *Y*, and denoise a 2D laterally sliced tomogram via a trained generator, *G _{XY}: X → Y*. Two functions,

*G*were trained using two discrimination losses, ${L}^{{D}_{Y}},$ and ${L}^{{D}_{X}},$ where

_{XY}: X → Y and G_{YX}: Y → X*D*and

_{Y}*D*are discriminator functions that attempt to discriminate a real image and an image generated by

_{X}*G*. For instance, ${L}^{{D}_{Y}}$computes how closely the denoised image

*G*(

_{XY}*x*) follows the true data distribution

*y ~P*. To enhance optimization convergence, two reconstruction losses, called cycle-consistency losses, were introduced. We minimized the losses comparing an input and a generated image passing through two mirrored functions,

_{data(y)}*G*and

_{XY}*G*Finally, the trained

_{YX}.*G*outputs the denoised tomogram image.

_{XY}#### 2.1 Data acquisition using optical diffraction tomography

We first summarize the ODT reconstruction procedure and the coherent noise for a better understanding of our data set. Every sample of interest was scanned at various illumination angles to obtain holograms, using a commercialized ODT imaging setup (HT-2H, Tomocube Inc., Republic of Korea), as detailed in Appendix A. Then, 2D optical fields at the sample plane were retrieved from each captured hologram using a field retrieval algorithm exploiting spatial filtering [46]. Following the Fourier diffraction theorem, formulated by Wolf [47], the 3D RI tomogram of the sample was reconstructed. Missing information owing to the limited bandwidth of the system was computationally regularized using a non-negativity constraint [48]. In theory, the spatial resolutions in the lateral and axial directions were 110 nm and 360 nm, respectively.

Two examples of 2D tomograms at different axial depths are shown in Figs. 1(d)–1(e). The coherent noise disturbs both the cell features and background in the tomogram at $\Delta z=0\text{\hspace{0.17em}}\mu m,$while the tomogram slice at $\Delta z=3.9\text{\hspace{0.17em}}\mu m$allows to clearly visualize the subcellular organelles. A 3D isosurface image of the reconstructed NIH3T3 tomogram is displayed in Fig. 1(f).

To train the proposed network, we prepared a data set containing 2D tomograms of NIH3T3 cells using the ODT protocol explained above. The reconstructed 3D tomograms were center-cropped to a dimension of 256 × 256 × 100 voxels. Because the coherent noises, as well as the cells, were not totally spread out along the axial direction, we first attempted to find the focal plane using one of the widely used methods, Brenner gradient [49], and extracted 25 sliced images centered at the determined focus from each 3D tomogram. Then, the 2D sliced tomograms were annotated by two experienced ODT users (refer to Appendix D) and categorized into two sets: (1) *x _{i}* ∈

*X*: noisy tomograms and (2)

*y*∈

_{i}*Y*: clean tomograms (see Fig. 2(a)). We denoted the data distributions as

*x*~

*P*and

_{data(x)}*y*~

*P*; here

_{data(y)}*x*~

*P*refers to the notion that a random variable has probability density distribution of P. Again, we emphasize that, in contrast with conventional deep learning frameworks that benefit from one-to-one paired set, two different data sets (

*X*and

*Y*) were prepared to train our denoising network. The data sets,

*X*and

*Y*(denoted as (1) and (2) in Fig. 2(a)), finally contained 455 and 5057 tomograms, respectively; there are less noisy images because they are harder to obtain. Note that we did not use any particular technique to overcome the data imbalance and randomly sampled batch of data from two groups, to fully make the algorithm take the whole set. The rest of the test data, examined below for our model, was composed by tomograms of silica microbead, HeLa, and MDA231 cells. Details about the preparation of the sample can be found in Appendix B.

#### 2.2 Architecture

To train the data sets, we employed a deep neural network, motivated by cycle-GAN [50], composed of two generator functions, *G _{XY}* and

*G*, and two discriminators,

_{YX}*D*and

_{Y}*D*

_{X}_{.}First, two generators using the U-net architecture learn a statistical model that maps between domain

*X*and domain

*Y*. They perform pixel-wise regression.

*G*attempts to make a clean image from every input image annotated as a noisy tomogram, ${x}_{i},$ while

_{XY}*G*performs exactly the inverse task. Second, two discriminators using the PatchGAN [51] architecture aim to differentiate a real image from an artificial image generated via

_{YX}*G. D*and

_{Y}*D*discriminate between translated images,

_{X}*G*(

_{XY}*x*) and

*G*(

_{YX}*y*), and real images,

*y*and

*x*, respectively. This architecture is illustrated in more detail in Appendix C.

#### 2.3 Loss function

To train the proposed network, we solved a min-max optimization problem with four loss functions. That is, the network was trained such that *G* and *D* competed against each other to minimize or maximize the loss functions. These functions were designed to capture, on the one hand, how well *G* maps one domain into the other and, on the other hand, how well *D* discriminates between generated and real images. Again, our goal was to train the network through direct competition between *G* and *D* so that *G* could perform the denoising task with enough accuracy after training.

First of all, adversarial losses were applied to both discriminator functions, *D _{Y}* and

*D*. For a function

_{X}*G*:

_{XY}*X*→

*Y*and the corresponding discriminator

*D*, we formulate the loss function as follows:

_{Y}*G*aims to minimize this function while the adversarial function

_{XY}*D*aims to maximize it:

*G*and

_{YX}: Y → X*D*, respectively, the loss function can be formulated as follows:

_{X}*G*and

_{YX}*G*, should perform exactly inverse operations in theory, which means that

_{XY}*G*(

_{YX}*G*(

_{XY}*x*)) should return

*x*. Hence, we formulated one cycle-consistency loss as,

## 3. Results and discussions

Here, we optimized our deep neural network and experimentally verified the performance of the optimized denoising network using the ODT imaging system (HT-1H, Tomocube. Inc, South Korea). We imaged and reconstructed them according to the optical setup and procedure detailed in Fig. 1 and Fig. 8. All the reconstructed tomograms had 256 × 256 × 100 voxels, a lateral resolution of 110 nm, and an axial resolution of 360 nm. Thus, the field of view (FOV) of all the 2D sliced tomogram images is 28.16 μm × 28.16 μm.

#### 3.1 Model optimization

To optimize the proposed method, we implemented and compared three networks that, as summarized in Fig. 3, differed in the loss function and up-sampling block of the generator. The sample consisted of NIH3T3 cells, which were also used in the training stage. Figures 3(a1)–3(a3) display the original tomogram and two representative subcellular features for comparison. We also examined the Fourier transform of the tomograms.

First, we used the ${l}_{1}$ loss function, along with a naïve-U-net architecture for the generator (adopted from [52]), to optimize our neural network. It is known that the ${l}_{1}$loss function performs well even in the presence of outliers. Figure 3(b) displays the denoised tomogram. However, Figs. 3(b2)–3(b3) show that the subcellular structure is not resolved and, as marked by the arrows, the clear round feature is missing. Moreover, the checkerboard artifacts are widely dispersed across the denoised tomogram, and they are displayed as a grid in the image (Fig. 3(b1)) and Fourier spectrum (Fig. 3(b4)). The artifacts arise from the overlapped computation of sliding convolution operations. This fact has been intensively studied in [53].

Next, to address the checkerboard artifacts, we used a resize-convolutional U-net [53] with the *l*_{1} loss function, as illustrated in Fig. 3(b). This change also led to an improved reconstruction result, displaying a better resolution of the guided features but not as clear as in the original image.

Finally, we attempted to employ a structural similarity index loss (SSIM) [54], instead of the *l*_{1} function, along with the same resize-convolutional U-net. The result conserves the detailed features of the cell, noted by the arrows, while the coherent noise in the tomogram is eliminated. In addition, the corresponding Fourier spectrum has reduced artifacts in comparison with other models. We adopted this final model as our denoising network, which was thoroughly used to obtain the results reported below.

#### 3.2 Quantitative validation

To quantitatively validate the proposed method, we measured the tomograms of silica microbeads (5-μm-diameter, 44054-5ML-F, Sigma-Aldrich Inc., USA). A different imaging setup (HT-1H, Tomocube Inc., Republic of Korea), which was not used to obtain the NIH3T3 training data, was utilized for the acquisition of the microbead tomograms to test the generalizability of the present method. As shown in Fig. 4(a), the captured tomograms had unwanted coherent noise, which led to the deterioration of the image quality. The cropped region at the top left corner displays the noise in the form of fringe patterns and speckle grains. The mean value (MV) and the standard deviation (SD) of the RI in this region are 1.3378 and 0.0019, respectively.

Furthermore, as displayed in Fig. 4(b), the proposed method reduces the coherent noise in the tomograms without a significant loss of sample region. Specifically, the generator, ${G}_{XY}$, trained in the whole network translated the noisy input tomogram into a clear tomogram. Regarding the background region, the MV and the SD of the denoised bead are 1.3369 and 0.0003, respectively, which indicates that, compared to the initial tomogram, the RI values show a better agreement with the theoretical one (n_{medium} = 1.337).

Moreover, the histograms and line profiles for the RI values of the original and denoised tomograms further validate the present method. First, Fig. 4(c) displays the data distributions of the background regions enclosed by the blue and orange boxes; a sharper RI distribution can be observed in the denoised tomogram around $\mathrm{1.337.}$ Second, the profiles along the center of the bead demonstrate the consistency of the RI values in the sample region after the application of the present method, as illustrated in Fig. 4(d). Not only is the background region significantly denoised but also the original and denoised RI maps show a close match. It is also noteworthy that the fabrication error of the microbeads makes it challenging to assess the region of the bead accurately.

#### 3.3 Comparison to non-data-driven approaches

Next, we benchmarked our data-driven approach against widely used non-data-driven approaches for denoising, including block matching and 3D filtering (BM3D), total variation (TV) minimization, and Haar wavelet shrinkage (Refer to Appendix G for more details). While our optimized and trained network operated to denoise the HeLa tomogram without further parameter tunning, we ran each non-data-driven algorithm with 3 different parameters that affect each algorithm: *σ* is the standard deviation of Gaussian distribution for the BM3D (1, 5, and 40); *λ* is regularization weight between a minimization term and a TV term for the TV denoising (0.1, 0.5, and 10); *τ* is a thresholding value for the wavelet shrinkage (0.0001, 0.0028, and 0.1).

The comparison on denoising 2D HeLa tomogram is depicted in Fig. 5. The non-data-driven algorithms with various parameters either remove coherent noises by blurring the entire images (BM3D and TV) or perform no effective denoising (Wavelet), whereas the denoised tomogram via our data-driven approach can simultaneously remove the noise and conserve the sample features (e.g., subcellular structures inside HeLa cell in this case). BM3D requires accurate prior knowledge on the standard deviation of Gaussian noise, which is not appropriate to apply to our case. Fundamentally, the noise statistics of our tomogram does not follow the Gaussian distribution; rather, it is dominated by fringe-like patterns that are difficult to model analytically. Likewise, TV has similar limitations because it assumes that high total variation indicates noisy signals in images, which is not necessarily the case here, and obtains a denoised image by minimizing the total variation. Lastly, the wavelet-based method has no effective denoising on tomograms with the coherent noise. In our experiment, the wavelet basis representation does not separate image signal and fringe noise signal, which fails to denoise tomograms via the shrinkage (i.e., thresholding) of small wavelet coefficients.

#### 3.4 Demonstration in various biological samples

To further validate our method, we tested the trained network on various tomograms of eukaryotic cells, including HeLa, NIH3T3, and MDA231. Figure 6 displays the degraded tomograms (first row) and the tomograms denoised using our method (second row). We first applied our network to the pre-split NIH3T3 data set, which was not used for training. The degraded tomograms, where the coherent noises are clearly visible as fringe patterns, were significantly denoised. As displayed in the second and third column in Fig. 6, we attempted to reduce the coherent noises of the HeLa and MDA231 tomograms to test the generalization capability of our model. As shown in the denoised tomograms, the coherent noises were effectively reduced and the cellular characteristics were conserved in both cases.

Finally, we validated the denoising network through the time-lapse imaging of HeLa cells, as a principal application of our method. We tomographically imaged the HeLa cells for 30 mins with a time interval of 10 mins. Either a thermal focal drift or a slight change of the specimen could generate a path length difference in the beam path of the ODT imaging system. Hence, as indicated by the orange arrows in Fig. 7 (top row), though we did not modify the imaging system at all, we encountered unwanted noise in the same axial plane, intensified in this time-lapse imaging experiment. In Fig. 7(b), the tomograms denoised using our trained network, which correspond to the images in the first row, are shown. Though the sample information observed slightly changes during the lapse because of the focal drift (e.g., some subcellular or cellular compartments, indicated by black arrows, become faint), the fringe artifacts are effectively eliminated. To more quantitatively compare the original and denoised images, we focused on the noisy and unsampled region of the cell (Fig. 7(c)). Again, we confirmed that the escalating noise level by lapse of time, quantified by the standard deviation, diminished from 0.7~0.9 × 10-3 (original) to around 0.3 × 10-3 (denoised) (Fig. 7(d)).

## 4. Conclusions

We have proposed and experimentally validated a deep learning algorithm that suppresses the coherent noise in refractive index tomograms. The deep neural network learns a statistical transformation between two “unpaired” tomogram data sets, without any prior knowledge or additional constraints which are enforced in most numerical algorithms. We demonstrated its quantitative denoising performance and generalization capability through various biological experiments. Furthermore, in contrast with other optical methods for denoising, the presented method effectively eliminated time-varying noise, generated by thermal focal drift, in time-lapse imaging.

One of the primary questions that “data-driven” approaches should answer concerns their generalization ability: can we really suppress the coherent noise in a wide variety of tomograms using a model trained using a specific data set? We can qualitatively confirm that the denoising performance on NIH3T3 somewhat exceeds the performance on other cells, such as HeLa and MDA231, which have fundamentally different data distribution. We anticipate that the data diversity in the training data set will improve the generalization ability of the algorithm. This is obviously something to consider to build a better data-oriented model. Secondly, transfer learning may enhance the model for a specific purpose. Instead of retraining the model using every existent data set, it would be time-efficient to adopt transfer learning (i.e., redesign the architecture of the already trained deep learning model based on an additional data set).

There are further directions of future work for this paper. First, one can extend the current 2D deep learning framework to a network aimed at denoising whole 3D tomograms using volumetric convolution. The 3D network would exploit additional information from 3D tomograms to help improve the denoising performance (e.g., correlations between different sliced tomograms). Second, though we have here employed a naïve grid-search for parameters optimization, a large number of the hyperparameters and layer designs for training the deep learning algorithm can be tuned using other cutting-edge deep learning technologies, such as reinforcement learning for architecture/parameters search, to improve the denoising performance. The algorithmic parameters include the number and shape of filters, batch size, optimizer, learning rate, regularization constant, and initialization of convolutional filters. Lastly, we envision that the present approach could be leveraged for the removal of noises belonging to other categories, such as shot noise and Gaussian noise, which prevail in many imaging modalities, including fluorescent imaging, computerized tomography, and X-ray imaging.

## Appendix A Optical system of ODT

The optical diffraction tomography setup, composed of Mach-Zehnder interferometry, is depicted in Fig 8. A diode-pumped solid-state laser beam (532 nm wavelength, 10 mW, MSL-S-532-10 mW, CNI laser, China) is coupled into a 1 × 2 fiber coupler (OZ Optics, Ltd., Canada), and is split into sample beam and reference beam. The sample beam, passing through lens 1, is diffracted into many orders by the digital micromirror device (DLP6500FYE, Texas Instruments, USA) that operates angularly varying illumination on a sample.

The first-order diffracted beam, conveyed by a condenser lens (Numerical aperture (NA) = 0.7, ×60), impinges on the sample, followed by a collecting objective lens (×60, NA = 0.8). Then the scattered light by the sample and reference beam forms a spatially modulated hologram at the CMOS camera (FL3-U3-13Y3M-C, FLIR Systems, USA). The captured holograms are processed to reconstruct a 3D tomogram, as explained in the main text.

## Appendix B Sample preparation

The HeLa cells (CCL-2, ATCC, Manassas, VA, USA), MDA231 cells (HTB-26, ATCC), NIH3T3 (CRL-1658, ATCC) were maintained in Dulbecco’s modified Eagle’s medium (DMEM; Life Technologies) containing 10% (vol/vol) fetal bovine serum (FBS; Gibco, Gaithersburg, MD, USA), 100 U/mL penicillin, and 100 U/mL streptomycin in humidified air (10% CO_{2}) at 37°C.

Silica microbeads (5 $\mu m$ diameter, 44054-5ML-F, Sigma-Aldrich Inc., USA), are diluted in DPBS (*n* = 1.337 at 532 nm) and placed on the microscopic slide on top of coverslip before imaging.

## Appendix C Architecture and implementation of deep neural network

The proposed deep neural network is composed of two generators (G_{XY} and G_{YX}) and two discriminators (D_{Y} and D_{X}), as illustrated in Fig. 2(b) of the main text. A generator is trained to make generated outputs undistinguished from discriminator that is trained to distinguish the fake outputs. G_{XY} and G_{YX} correspond to D_{Y} and D_{X}, respectively. The generator is based on “encoder-decoder” architecture that consists of down-sampling encoder and a symmetric up-sampling decoder, inspired by U-net [52] which have successfully been utilized in biomedical image-related task. Five different types of blocks, colored differently for highlighting constitutes the generator, as depicted in Fig. 9.

The encoder learns the overall features of input image at different scales, passing through a chain of down-sampling operations. The input image is first convolved with 4 × 4 filter, which yields 64 feature maps. Then, each feature maps are extracted via successive application of 4 × 4 convolutions and non-linear leaky rectifier unit (LeakyReLU) [55], followed by batch normalization [56] typically used to reduce internal covariate shift and boost the training speed. The number of 4 × 4 convolutional filters with stride 2 and pad 1 doubles until it gets to 512 and the image size halves by passing convolution layer with stride 2. Hence, at the end of the encoder part right before the decoder, the number of features increases from “1 to 512” and the image dimension squeezes into “1 × 1.”

The decoder attempts to upsample the acquired features to obtain a single image, which has identical dimension as the input. The up-convoluional block is composed of ReLu, upsampling with a factor of 2, 3 × 3 conv2D with stride 1, pad 1, and BatchNorm2D layer. Every layer, in contrast to the contracting encoder, the number of feature maps halves and the image size doubles on the decoding path. To preserve spatial information, feature maps at each layer on the encoding path are concatenated to feature maps at the corresponding layer on the decoding path, described as gray arrows in Fig. 9. Finally, the 4 × 4 convolutional filter transforms back to the original dimensional image. The generator can be either de-noising network or noising network, depending on the input/output data composition and its respective training. *G _{XY}* and

*G*are de-noising network and noising network for 2D sliced tomogram, respectively, as diagrammed in Fig. 2(b).

_{YX}The patchGAN [57] based discriminator aims to classify whether an image is real or fake, as illustrated in Fig. 10. Unlike original GAN [58,59] discriminator which classifies whether the whole image is realistic or not, patchGAN discriminator determines whether patches of the whole image, 70 $\times 70$ in our model, are realistic or not. That is, each pixel of L8 feature map uses 70 $\times $ 70 patch from the input image. Note that the patch size can be flexibly chosen upon network architectures. By averaging the outputs from each patch, we can get a scalar value which indicates whether the whole image is realistic or not. Since patchGAN discriminator analyzes specific areas of an image, it leads the generator to make more detailed images compared to the original GAN discriminator [60].

We adopted an open-source released code packages of cycle-consistent GAN from [61] and developed the model to apply to our tomogram de-noising problem. We chose tuned hyperparameters already optimized in the open-source package, except that grid-searched 16 batch size considering our computational resources. They were implemented based on Pytorch 0.3.1 and python 3.6. With 40 of Intel Xeon CPU E5-2630 v4 and 8 of GeForce GTX 1080 Ti (11GB RAM), the training time takes 14 hours as we employed an early stopping at epochs 500. It only takes 0.016 secs to denoise a single 2D tomograms once after the training.

In order to quantify and statistically analyze the results, we used a MATLAB 2017b using Intel Core i5-7500 personal computer. As for the iso-surface rendering of tomogram from Fig. 1(f) of the main text, we used a commercialized visualization software (Tomostudio, Republic of Korea).

## Appendix D Data annotation

To overcome the primary issue of unsupervised learning with unpaired dataset: weak mapping function between image domains, we tried to exclude tomograms with “weak” fringe patterns (Fig. 11), which is ambiguous to be annotated, from the training set consisting of perceptually very clean and noisy tomograms.

## Appendix E Failure cases for denoising

In Fig. 12, three examples of 2D tomograms are depicted, when the early version algorithm (naïve model adopted from Ref [61].) failed to correctly learn the mapping function between clean and noisy tomograms. In Fig 12(a), our network might not be able to fully remove strong fringe pattern. The noises disturbing both sample and background still remain after operating our algorithm on the noisy tomograms. In Fig 12(b), the network that specifically learns certain features of training image set can generate unwanted artifacts in denoised tomogram. Unknown hole is generated around the image center, which is the particular feature that our algorithm might have learned. Lastly, in Fig 12(c), because our network is trained to remove fringe-like noise, the ambiguous noises are not removed, as marked by arrows.

## Appendix F Identity mapping

To further demonstrate the feasibility of our method, we present tomograms passing through the denoising network, G_{XY} as well as noising network, G_{YX}. In the first row of Fig. 13, noisy tomogram, denoised tomogram, noisy tomogram again, error map between the original and identity-mapped tomogram are shown, respectively. Similarly, processed tomograms starting from clean image are displayed in the second row. The two small values in the error maps related to the cycle-consistent losses verify that our network was correctly trained.

## Appendix G Non-data-driven approaches

**Block Matching and 3D filtering (BM3D).** The key idea of the BM3D is to exploit sparsity, enhanced by a block matching that stacks similar data patches from different locations in 3D transform domain and attenuate noise signal via shrinkage of the transform spectrum. Here we performed discrete cosine transform for sparse representation, and achieved the shrinkage via collaborative Wiener filtering, which requires estimated standard deviation of Gaussian noise. In Fig. 5(c), 3 different numbers are used for the standard deviation: 1, 5, and 40. We adapted the released MATLAB code package [62], which implements the BM3D proposed in [63].

**Total variation (TV) minimization.** We performed l_{1} norm minimization with the total variation constraint. For this, we solved the optimization problem using the primal-dual algorithm, which adapted from the released MATLAB code [64]. In Fig. 5(d), the number of iterations is 100 and three regularization weights (λ = 0.1, 0.5, and 1) were used to control the amount of denoising. Note that larger weights imply harder denoising effect, which leads to more smoothing effect.

**Haar wavelet shrinkage.** As numerous existing denoising techniques have been performed in wavelet transform domain [65,66], we attempted to denoise our tomogram using Haar wavelet transform (i.e., Daubechies 1). We used MATLAB built-in functions for the sparse representation of our HeLa tomogram in Fig. 5. Then, 3 different threshold values were used for the shrinkage of weak coefficients in two-level wavelet decomposition: τ = 0.0001, 0.0028, and 0.1.

## Funding

KAIST, Tomocube, and National Research Foundation of Korea (2015R1A32066550, 2017M3C1A3013923, 2018K000396).

## Acknowledgments

The authors thank Soomin Lee and Jiwon Kim for providing the test data set of the HeLa cells and microbeads. The authors also thank Yoonseok Baek for his constructive review on the manuscript. Y. Jo acknowledges support from KAIST Presidential Fellowship and Asan Foundation Biomedical Science Scholarship.

## References

**1. **Y. Park, C. Depeursinge, and G. Popescu, “Quantitative phase imaging in biomedicine,” Nat. Photonics **12**(10), 578–589 (2018). [CrossRef]

**2. **S. A. Yang, J. Yoon, K. Kim, and Y. Park, “Measurements of morphological and biophysical alterations in individual neuron cells associated with early neurotoxic effects in Parkinson’s disease,” Cytometry A **91**(5), 510–518 (2017). [CrossRef] [PubMed]

**3. **K. Kim, J. Yoon, S. Shin, S. Lee, S.-A. Yang, and Y. Park, “Optical diffraction tomography techniques for the study of cell pathophysiology,” J. Biomed. Photonics Eng. **2**(2), 2994 (2016). [CrossRef]

**4. **M. Lee, E. Lee, J. Jung, H. Yu, K. Kim, J. Yoon, S. Lee, Y. Jeong, and Y. Park, “Label-free optical quantification of structural alterations in Alzheimer’s disease,” Sci. Rep. **6**(1), 31034 (2016). [CrossRef] [PubMed]

**5. **J. Jung, L. E. Matemba, K. Lee, P. E. Kazyoba, J. Yoon, J. J. Massaga, K. Kim, D.-J. Kim, and Y. Park, “Optical characterization of red blood cells from individuals with sickle cell trait and disease in Tanzania using quantitative phase imaging,” Sci. Rep. **6**(1), 31698 (2016). [CrossRef] [PubMed]

**6. **J. Yoon, K. Kim, H. Park, C. Choi, S. Jang, and Y. Park, “Label-free characterization of white blood cells by measuring 3D refractive index maps,” Biomed. Opt. Express **6**(10), 3865–3875 (2015). [CrossRef] [PubMed]

**7. **H. Park, S.-H. Hong, K. Kim, S.-H. Cho, W.-J. Lee, Y. Kim, S.-E. Lee, and Y. Park, “Characterizations of individual mouse red blood cells parasitized by Babesia microti using 3-D holographic microscopy,” Sci. Rep. **5**(1), 10827 (2015). [CrossRef] [PubMed]

**8. **J. Jung, S.-J. Hong, H.-B. Kim, G. Kim, M. Lee, S. Shin, S. Lee, D.-J. Kim, C.-G. Lee, and Y. Park, “Label-free non-invasive quantitative measurement of lipid contents in individual microalgal cells using refractive index tomography,” Sci. Rep. **8**(1), 6524 (2018). [CrossRef] [PubMed]

**9. **K. Kim, K. Choe, I. Park, P. Kim, and Y. Park, “Holographic intravital microscopy for 2-D and 3-D imaging intact circulating blood cells in microcapillaries of live mice,” Sci. Rep. **6**(1), 33084 (2016). [CrossRef] [PubMed]

**10. **F. Dubois, C. Yourassowsky, O. Monnom, J.-C. Legros, O. Debeir, P. Van Ham, R. Kiss, and C. Decaestecker, “Digital holographic microscopy for the three-dimensional dynamic analysis of in vitro cancer cell migration,” J. Biomed. Opt. **11**(5), 054032 (2006). [CrossRef] [PubMed]

**11. **G. Kim, S. Lee, S. Shin, and Y. Park, “Three-dimensional label-free imaging and analysis of Pinus pollen grains using optical diffraction tomography,” Sci. Rep. **8**(1), 1782 (2018). [CrossRef] [PubMed]

**12. **S. Shin, K. Kim, K. Lee, S. Lee, and Y. Park, “Effects of spatiotemporal coherence on interferometric microscopy,” Opt. Express **25**(7), 8085–8097 (2017). [CrossRef] [PubMed]

**13. **H. Farrokhi, J. Boonruangkan, B. J. Chun, T. M. Rohith, A. Mishra, H. T. Toh, H. S. Yoon, and Y.-J. Kim, “Speckle reduction in quantitative phase imaging by generating spatially incoherent laser field at electroactive optical diffusers,” Opt. Express **25**(10), 10791–10800 (2017). [CrossRef] [PubMed]

**14. **I. Choi, K. Lee, and Y. Park, “Compensation of aberration in quantitative phase imaging using lateral shifting and spiral phase integration,” Opt. Express **25**(24), 30771–30779 (2017). [CrossRef] [PubMed]

**15. **J. W. Cui, Z. Tao, Z. B. Liu, and J. B. Tan, “Reducing coherent noise in interference systems using the phase modulation technique,” Appl. Opt. **54**(24), 7308–7315 (2015). [CrossRef] [PubMed]

**16. **Y. Kim, H. Shim, K. Kim, H. Park, S. Jang, and Y. Park, “Profiling individual human red blood cells using common-path diffraction optical tomography,” Sci. Rep. **4**(1), 6659 (2014). [CrossRef] [PubMed]

**17. **Y. Park, W. Choi, Z. Yaqoob, R. Dasari, K. Badizadegan, and M. S. Feld, “Speckle-field digital holographic microscopy,” Opt. Express **17**(15), 12285–12292 (2009). [CrossRef] [PubMed]

**18. **F. Dubois, M. L. Requena, C. Minetti, O. Monnom, and E. Istasse, “Partial spatial coherence effects in digital holographic microscopy with a laser source,” Appl. Opt. **43**(5), 1131–1139 (2004). [CrossRef] [PubMed]

**19. **K. Lee, S. Shin, Z. Yaqoob, P. T. So, and Y. Park, “Low-coherent optical diffraction tomography by angle-scanning illumination,” arXiv preprint, arXiv:1807.05677 (2018).

**20. **X. Li, T. Yamauchi, H. Iwai, Y. Yamashita, H. Zhang, and T. Hiruma, “Full-field quantitative phase imaging by white-light interferometry with active phase stabilization and its application to biological samples,” Opt. Lett. **31**(12), 1830–1832 (2006). [CrossRef] [PubMed]

**21. **E. Y. Lam, X. Zhang, H. Vo, T.-C. Poon, and G. Indebetouw, “Three-dimensional microscopy and sectional image reconstruction using optical scanning holography,” Appl. Opt. **48**(34), H113–H119 (2009). [CrossRef] [PubMed]

**22. **S. Sotthivirat and J. A. Fessler, “Penalized-likelihood image reconstruction for digital holography,” J. Opt. Soc. Am. A **21**(5), 737–750 (2004). [CrossRef] [PubMed]

**23. **J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2Noise: Learning Image Restoration without Clean Data,” arXiv preprint, arXiv:1803.04189 (2018).

**24. **A. Krull, T.-O. Buchholz, and F. Jug, “Noise2Void - Learning Denoising from Single Noisy Images,” arXiv preprint, arXiv:1811.10980 (2018).

**25. **V. Bianco, P. Memmolo, M. Paturzo, A. Finizio, B. Javidi, and P. Ferraro, “Quasi noise-free digital holography,” Light Sci. Appl. **5**(9), e16142 (2016). [CrossRef] [PubMed]

**26. **P. Memmolo, I. Esnaola, A. Finizio, M. Paturzo, P. Ferraro, and A. M. Tulino, “SPADEDH: a sparsity-based denoising method of digital holograms without knowing the noise statistics,” Opt. Express **20**(15), 17250–17257 (2012). [CrossRef]

**27. **Y. Jo, H. Cho, S. Y. Lee, G. Choi, G. Kim, H. Min, and Y. Park, “Quantitative Phase Imaging and Artificial Intelligence: A Review,” IEEE J. Sel. Top. Quantum Electron. **25**(1), 1–14 (2019). [CrossRef]

**28. **T. Nguyen, Y. Xue, Y. Li, L. Tian, and G. Nehmetallah, “Deep learning approach for Fourier ptychography microscopy,” Opt. Express **26**(20), 26470–26484 (2018). [CrossRef] [PubMed]

**29. **Y. Rivenson, Z. Göröcs, H. Günaydin, Y. Zhang, H. Wang, and A. Ozcan, “Deep learning microscopy,” Optica **4**(11), 1437–1443 (2017). [CrossRef]

**30. **J. Yoon, Y. Jo, Y. S. Kim, Y. Yu, J. Park, S. Lee, W. S. Park, and Y. Park, “Label-Free Identification of Lymphocyte Subtypes Using Three-Dimensional Quantitative Phase Imaging and Machine Learning,” JoVE, e58305 (2018). [CrossRef]

**31. **Y. Jo, J. Jung, M. H. Kim, H. Park, S.-J. Kang, and Y. Park, “Label-free identification of individual bacteria using Fourier transform light scattering,” Opt. Express **23**(12), 15792–15805 (2015). [CrossRef] [PubMed]

**32. **G. Kim, Y. Jo, H. Cho, H. S. Min, and Y. Park, “Learning-based screening of hematologic disorders using quantitative phase imaging of individual red blood cells,” Biosens. Bioelectron. **123**, 69–76 (2019). [CrossRef] [PubMed]

**33. **S. Rawat, S. Komatsu, A. Markman, A. Anand, and B. Javidi, “Compact and field-portable 3D printed shearing digital holographic microscope for automated cell identification,” Appl. Opt. **56**(9), D127–D133 (2017). [CrossRef] [PubMed]

**34. **T. H. Nguyen, S. Sridharan, V. Macias, A. Kajdacsy-Balla, J. Melamed, M. N. Do, and G. Popescu, “Automatic Gleason grading of prostate cancer using quantitative phase imaging and machine learning,” J. Biomed. Opt. **22**(3), 36015 (2017). [CrossRef] [PubMed]

**35. **Y. Jo, S. Park, J. Jung, J. Yoon, H. Joo, M. H. Kim, S.-J. Kang, M. C. Choi, S. Y. Lee, and Y. Park, “Holographic deep learning for rapid optical screening of anthrax spores,” Sci. Adv. **3**(8), e1700606 (2017). [CrossRef] [PubMed]

**36. **K. Mahmood, P. L. Carmona, S. Shahbazmohamadi, F. Pla, and B. Javidi, “Real-time automated counterfeit integrated circuit detection using x-ray microscopy,” Appl. Opt. **54**(13), D25–D32 (2015). [CrossRef]

**37. **C. Ounkomol, S. Seshamani, M. M. Maleckar, F. Collman, and G. R. Johnson, “Label-free prediction of three-dimensional fluorescence images from transmitted-light microscopy,” Nat. Methods **15**(11), 917–920 (2018). [CrossRef] [PubMed]

**38. **E. M. Christiansen, S. J. Yang, D. M. Ando, A. Javaherian, G. Skibinski, S. Lipnick, E. Mount, A. O’Neil, K. Shah, A. K. Lee, P. Goyal, W. Fedus, R. Poplin, A. Esteva, M. Berndl, L. L. Rubin, P. Nelson, and S. Finkbeiner, “In silico labeling: Predicting fluorescent labels in unlabeled images,” Cell **173**(3), 792–803 (2018). [CrossRef] [PubMed]

**39. **B. Rahmani, D. Loterie, G. Konstantinou, D. Psaltis, and C. Moser, “Multimode optical fiber transmission with a deep learning network,” Light Sci. Appl. **7**(1), 69 (2018). [CrossRef] [PubMed]

**40. **Y. Li, Y. Xue, and L. Tian, “Deep speckle correlation: a deep learning approach toward scalable imaging through scattering media,” Optica **5**(10), 1181–1190 (2018). [CrossRef]

**41. **S. Li, M. Deng, J. Lee, A. Sinha, and G. Barbastathis, “Imaging through glass diffusers using densely connected convolutional networks,” Optica **5**(7), 803–813 (2018). [CrossRef]

**42. **Y. Rivenson, Y. Zhang, H. Günaydın, D. Teng, and A. Ozcan, “Phase recovery and holographic image reconstruction using deep learning in neural networks,” Light Sci. Appl. **7**(2), 17141 (2018). [CrossRef]

**43. **R. Horstmeyer, R. Y. Chen, B. Kappes, and B. Judkewitz, “Convolutional neural networks that teach microscopes how to image,” arXiv preprint, arXiv:1709.07223 (2017).

**44. **W. Jeon, W. Jeong, K. Son, and H. Yang, “Speckle noise reduction for digital holographic images using multi-scale convolutional neural networks,” Opt. Lett. **43**(17), 4240–4243 (2018). [CrossRef] [PubMed]

**45. **G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Math. Contr. Signals Syst. **2**(4), 303–314 (1989). [CrossRef]

**46. **E. Cuche, P. Marquet, and C. Depeursinge, “Spatial filtering for zero-order and twin-image elimination in digital off-axis holography,” Appl. Opt. **39**(23), 4070–4075 (2000). [CrossRef] [PubMed]

**47. **E. Wolf, “Three-dimensional structure determination of semi-transparent objects from holographic data,” Opt. Commun. **1**(4), 153–156 (1969). [CrossRef]

**48. **J. Lim, K. Lee, K. H. Jin, S. Shin, S. Lee, Y. Park, and J. C. Ye, “Comparative study of iterative reconstruction algorithms for missing cone problems in optical diffraction tomography,” Opt. Express **23**(13), 16933–16948 (2015). [CrossRef] [PubMed]

**49. **Y. Sun, S. Duthaler, and B. J. Nelson, “Autofocusing in computer microscopy: selecting the optimal focus algorithm,” Microsc. Res. Tech. **65**(3), 139–149 (2004). [CrossRef] [PubMed]

**50. **J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” Proc. IEEE Int. Conf. Comput. Vis. **2018**, 2242–2251 (2017). [CrossRef]

**51. **P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” arXiv preprint, arXiv:1611.07004 (2017).

**52. **O. Ronneberger, P. Fischer, and T. Brox, *U-net: Convolutional networks for biomedical image segmentation*, Med. Image Comput. Comput. Assist. Interv. - MICCAI 2015 (Springer, Cham, 2015), Vol. 9351, pp. 234–241.

**53. **A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,” (Distill, 2016).

**54. **Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process. **13**(4), 600–612 (2004). [CrossRef] [PubMed]

**55. **A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in *In*Proceedings of the International Conference on Machine Learning, 2013), 3.

**56. **S. Loffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint, arXiv:1502.03167 (2015).

**57. **C. Li and M. Wand, “Precomputed real-time texture synthesis with markovian generative adversarial networks., ” arxiv preprint, arXiv:1604.04382 (2016).

**58. **A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks., ” arXiv preprint, arXiv:1511.06434 (2015).

**59. **I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in *In Proceedings of Advances in Neural Information Processing Systems*, (NIPS, 2014), 2672– 2680.

**60. **P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in *In*Proceedings of IEEE Conference on Computer Vision and Pattern Recognition*,* (IEEE, 2017), pp. 1063–6919.

**61. **J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “https://github.com/junyanz/CycleGAN” (2017), retrieved.

**62. **M. Maggioni, E. Sánchez-Monge, A. Foi, A. Danielyan, K. Dabov, V. Katkovnik, and K. Egiazarian, “Image and video denoising by sparse 3D transform-domain collaborative filtering,” https://www.cs.tut.fi/~foi/GCF-BM3D/BM3D_TIP_2007.pdf.

**63. **K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-D transform-domain collaborative filtering,” IEEE Trans. Image Process. **16**(8), 2080–2095 (2007). [CrossRef] [PubMed]

**64. **A. Mordvintsev, “ROF and TV-L1 denoising with Primal-Dual algorithm,” http://www.webcitation.org/6rEjLnF1F, (2017).

**65. **D. L. Donoho and I. M. Johnstone, “Adapting to unknown smoothness via wavelet shrinkage,” J. Am. Stat. Assoc. **90**(432), 1200–1224 (1995). [CrossRef]

**66. **D. L. Donoho, “De-noising by soft-thresholding,” IEEE Trans. Inf. Theory **41**(3), 613–627 (1995). [CrossRef]