## Abstract

Single molecule localization microscopy (SMLM) is one of the fastest evolving and most broadly used super-resolving imaging techniques in the biosciences. While image recordings could take up to hours only ten years ago, scientists are now reaching for real-time imaging in order to follow the dynamics of biology. To this end, it is crucial to have data processing strategies available that are capable of handling the vast amounts of data produced by the microscope. In this article, we report on the use of a deep convolutional neural network (CNN) for localizing particles in three dimensions on the basis of single images. In test experiments conducted on fluorescent microbeads, we show that the precision obtained with a CNN can be comparable to that of maximum likelihood estimation (MLE), which is the accepted gold standard. Regarding speed, the CNN performs with about 22k localizations per second more than three orders of magnitude faster than the MLE algorithm of ThunderSTORM. If only five parameters are estimated (3D position, signal and background), our CNN implementation is currently slower than the fastest, recently published GPU-based MLE algorithm. However, in this comparison the CNN catches up with every additional parameter, with only a few percent extra time required per additional dimension. Thus it may become feasible to estimate further variables such as molecule orientation, aberration functions or color. We experimentally demonstrate that jointly estimating Zernike mode magnitudes for aberration modeling can significantly improve the accuracy of the estimates.

Published by The Optical Society under the terms of the Creative Commons Attribution 4.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.

## 1. Introduction

Since it became technologically feasible to image single fluorescent molecules, researchers strive to learn from their three-dimensional localizations and trajectories [1, 2]. In biological light microscopy, the localization of individual emitters became particularly important with the advent of superresolution techniques such as PALM or STORM [3–5], which also increased the demand for fast algorithms, because ten thousands of molecules need to be localized. For this reason, several single-iteration algorithms have been developed [6–12], which usually trade precision for speed, although remarkable precisions can be obtained with some methods.

Regarding the used PSF model, it has been shown that Gaussian fitting is sufficiently accurate in order to localize molecules within a 2D plane [13], although the photon numbers tend to be systematically underestimated [14]. However, for more advanced localization tasks, which for instance take into account the axial coordinate [15], dipole orientations [16] or engineered point spread functions (PSF) [17], the use of maximum likelihood estimation (MLE) in conjunction with an accurate PSF model became the accepted gold standard. MLE is based on searching the maximum of a multidimensional likelihood function, which expresses the probability that the data was measured at a certain combination of existing parameter values. Usually, a set of five parameters is considered, each spanning an individual dimension in the search space: x,y,z-position as well as background and signal photon numbers. Although algorithmic implementations of MLE can be very fast [18], it is nevertheless required to find the global maximum of a multi-dimensional scalar function for every localization event. The time required for finding the maximum will inevitably increase with the number of involved dimensions. Furthermore, MLE is prone to wrong estimates caused by ill-chosen start parameters, which will be further discussed in section 3.

Deep learning (DL) (see [19] for a review) is a powerful variant of machine learning. Convolutional neuronal networks (CNN), a subcategory of DL, have been shown to perform remarkably well on various image processing tasks such as object recognition [20, 21] or single-image superresolution [22–24]. Lately, CNNs find increasing use in the field of optical microscopy, again for obtaining higher resolution [25], but also for the more general task of image restoration [26,27].

Very recently, CNNs have been applied to particle tracking [28] and SMLM [29–32]. The presented implementations partly focus on optimizing different aspects, such as the localization in dense environments [29] or object reconstructions based on sparse localization data [30]. The potential speed advantage of machine learning algorithms originates from their basic architecture: In contrast to MLE, localizing molecules only requires applying a single “forward pass” and includes no search for a maximum, although this forward pass involves a high number of convolutions. CNNs are also parameter-free, such that potential errors caused by ill-chosen initial values are avoided.

Here we report on the application of a CNN for the 3D localization of fluorescent microparticles and provide a quantitative comparison of its performance to that of MLE as well as the theoretical optimum in the form of calculated Cramer-Rao Lower Bounds (CRLB). Although 3D localization using CNNs has been demonstrated earlier this year [28, 31], a quantitative comparison to theoretical limits has – to the best of our knowledge – not been provided to date.

Furthermore, we provide experimental data from the joint estimation of aberrations together with the commonly estimated five parameters signal, background and 3D position. To this end, precisely known aberrations are applied to the wavefront using a spatial light modulator (SLM) and afterwards successfully detected by the CNN.

In contrast to related work [29–31], where neural networks have been applied to whole frames containing many molecules, our CNN and MLE implementations both operate on individual small images containing single particles that have been already coarsely localized and cropped with a rather low precision of ±500 nm. We employ ThunderSTORM [33] for this task, because a convenient plugin for ImageJ is readily available. However, it can also be performed by much simpler and faster algorithms. Any time consumed by this preprocessing step must be added to the times we state for our algorithms, which focus on sub-pixel localization.

While our pursued single-particle approach might not necessarily be the optimal strategy for a CNN in terms of speed, it nonetheless already competes with the fastest iterative methods and represents a direct and straightforward implementation which – due to the absence of the additional difficulty of finding many emitters – is able to obtain high precisions at little training efforts (about 10 minutes training time). Thus we believe it to be well suited for precision benchmarking as presented in this paper. Furthermore, as will be discussed later, our approach can be straightforwardly expanded to include the estimation of additional parameters such as color [32], molecule orientation and aberrations.

Both methods, MLE and CNN, draw their information from those single particle images and rely on a calculated PSF model, which is refined using phase retrieval [17, 34–37]. We find that both methods achieve variances which are similarly close to the CRLB. The CNN, however, performs significantly faster than our own MLE implementation and the MLE algorithm of ThunderSTORM. Furthermore, it proves to be more robust in axial ranges where the MLE approach fails, probably due to the inverse problem being less well-posed and a high susceptibility to ill-chosen initial values. Compared to a recently published GPU-based MLE algorithm [18] our implementation is still slower if only the usual five parameters are estimated, but already performs within the same order of magnitude. However, while MLE slows down significantly with every further variable, the CNN computation time increases only marginally. Therefore the benefit of using CNNs grows with an increasing size of the estimate vector. We believe that including more parameters such as Zernike mode magnitudes for aberration modeling is viable: Even if the exact PSF is known, small shape variations happen over time and occur over the field of view. Such variations can already have a significant negative impact on the accuracy of the estimates.

## 2. Experimental setup and procedure

Figure 1 shows the experimental setup. Microparticles (170 nm PS-Speck beads from Thermo-Fisher, dye: deep red) are immobilized on the surface of a sapphire coverslip by air-drying drops of diluted bead suspension. The beads were subsequently covered with distilled water and imaged onto an EMCCD camera (Andor iXon 888 Life) using a high NA objective (Olympus NA 1.7 APON100xHOTIRF) and high quality tube lens (Nikon 200 mm tube lens). The objective is mounted on an axial piezo positioner to enable precise movements in the nanometer range (NV 40/1 CL E from Piezosystem Jena). The light of a 640 nm laser diode (Toptica iBeam smart) is used for fluorescence excitation in an epi-geometry. The following filters from Chroma were used to separate the excitation and emission wavelengths: excitation = ZET642/20×, dichroic = ZT532/640rpc-UF2, emission = ZET532/640m-TRF.

Experiments to compare the performances of MLE and CNN were conducted by taking many widefield images of a single bead at various axial (z−) positions of the objective. The z-positions were set from −1 μm to −0.3 μm with respect to the in-focus plane and changed in increments of 100 nm. Thus only one side of focus cone is covered, which, however, has been shown to be a viable localization strategy [38], especially if particle locations on the other focus side can be excluded such as in TIRF experiments. At each z-step, 25 images were taken.

After recording, individual cropped (39×39 pixels) raw images were fed into both algorithms, which estimate five parameters (x, y, z, signal, background) from single images on the basis of a calculated PSF model *h*(*x*, *y*, *z*). Since the ground truths of the z-positions are known from the feedback sensor of the piezo actuator, the accuracies (i.e. systematic errors) and precisions of the estimates can be effectively evaluated.

#### 2.1. Maximum-likelihood estimation

We assumed a Poissonian probability for the number of photons *n _{k}* detected in pixel

*k*:

*μ*

_{θ,k}depend on the parameter vector

*θ*= (

*β*,

*s*,

*x*

_{0},

*y*

_{0},

*z*

_{0}), where

*β*and

*s*denote the mean photon numbers for background level and signal and

*x*

_{0},

*y*

_{0},

*z*

_{0}the 3D particle position, and further on the PSF model

*h*:

*µ*

_{θ,k}=

*β*+

*s*·

*h*(

_{k}*x*

_{0},

*y*

_{0},

*z*

_{0}).

For each recorded image, the parameter vector *θ̂* is estimated by finding a minimum of the total negative log-likelihood function using the MATLAB tool *fminunc*:

#### 2.2. Neural network

The architecture of our CNN is based on the VGG16 [39] and depicted in Fig. 2. One or two convolutional layers are followed by pooling layers (max pooling). The first convolutional layer features 128 4×4 kernels and is followed by a second layer with 128 3×3 kernels. Then the kernel size is fixed at 2×2 and the number of kernels increased to 1024. Each layer is followed by a ReLu activation up to the last fully connected layer, which has a linear activation function with a slope of 0.5. The network was implemented in Keras [40] using the Theano [41] backend. It was trained with the Adam [42] optimizer with an initial learning rate of 5 · 10^{−3} and an decay of 10^{−4} for 100 epochs.

The training data needs to cover variations of all parameters that are to be estimated. A priori information about sensible parameter intervals can be used to reduce the training effort, as no training data has to be generated for parameter combinations that will never occur. The training was performed on a NVidia GTX 1080 graphics card and took around 5 minutes to converge. The generation of the training data took another 5 minutes. It comprises 10000 simulated images of emitters at random 3D positions, varying between −1.3 and 0.1 μm for the z-position and ±500 nm for the x and y positions. Signal and background level were randomly changed between 1000 and 9000 photons for the signal and between 75 and 200 photons for the background level. Here we assume that an initial position estimator (e.g. a center of mass calculation or a Gaussian fit) provides a sufficiently accurate estimate, which allows one to center the molecule within this defined area. For the estimation of aberrations (shown in Fig. 7) the magnitudes of Zernike modes *Z*_{5} and *Z*_{6} (astigmatism in x and y) were randomly varied within the interval of ±0.5 rad (RMS value over the entire pupil). All CNN estimates presented in this paper are based on training sets comprising 10000 images.

The error function used for training is the mean of those calculated for 1000 individual simulated images. For each image, the error function is the *L*_{1} norm of the estimate vector minus the target vector. Each entry of the target vector is divided by the maximal respective training value, to equalize the importance of all dimensions.

Additionally the RMS error of the estimated z-position was monitored. Comparing this value with the corresponding CRLB allows for judging the network’s training status: The error function of a well-trained CNN should be close to the CRLB. On the other hand, undercutting the CRLB is a sign for overfitting. In this case, the network is too well adapted to the training data, which usually leads to poor performance on new images.

The CNN does not need to be trained from scratch for every new dataset. If the system aberrations do not change significantly (i.e. less than ≈ 0.5 rad), transfer learning can be applied, which means that only the fully connected layer has to be retrained to handle the new dataset. Such a retraining of the network can be done with less training data (between 1000 to 5000 images, depending on the magnitude of the change of the aberrations) and is therefore more efficient than training the CNN from scratch.

#### 2.3. Refining the PSF model

The PSF model needs to be accurate for optimal localization performance. We found that it is not sufficient to assume aberration-free imaging, even if a vectorial imaging theory is employed [43]. Therefore, we refine the PSF model by taking phase aberrations in the objective pupil into account, which we express in a Zernike mode basis using Noll coefficients [44] ranging from 2 (tip) to 37 (3rd order spherical) plus the 4th order spherical mode *Z*_{56}:

*a*represent the magnitudes of the Zernike modes

_{i}*Z*.

_{i}Our phase retrieval method is related to previous work on this topic [17,34–37]. In the first step, a focus stack *I _{exp}* of a single 170 nm bead is recorded. The stack consists of 17 images in a focus range from −1.6 μm to 1.6 μm with a step size of 200 nm. The outer images (typically from −1.6 μm to −0.8 μm and from 0.8 μm to 1.6 μm) are imaged with an increased exposure time to ensure high signal to noise ratios even in strongly defocused regions. All images are laterally centered according to the centroid of the in-focus image. Subsequently, all images are cropped to a size of 40×40 pixels, with one pixel measuring 115 nm in the focal plane. Finally, constant background values are subtracted.

A matching image stack *I _{sim}* is simulated using a vectorial model [43]. This simulated point response is convolved with the shape of the bead [34] as well as with an additional Gaussian kernel (

*σ*= 0.5 pixels, empirically determined), to closer match the experimental images. For the monochromatic simulation we assume the maximum emission wavelength (measured to be 680 nm) as well as unpolarized emission, except when the stack was recorded via the SLM path (see Appendix for details), in which case we altered the model to contain a linear polarizer in the pupil plane.

The phase retrieval algorithm compares the simulated 3D model stack *I _{sim}*(

*i*,

*j*,

*k*) with the experimental data

*I*(

_{exp}*i*,

*j*,

*k*) by calculating the differences and summing up their absolute values:

*n*,

_{x}*n*and

_{y}*n*denote the respective dimensions. We empirically found this metric to yield robust results in conjunction with the MATLAB function

_{z}*fminunc*. By varying the Zernike mode coefficients of the simulation, the algorithm fits the model to the experimental stack.

After those Zernike magnitudes which minimize the error *E* have been found, the coefficients *a*_{2}, *a*_{3} (corresponding to tip and tilt) are set to zero and high-NA defocus [45] is subtracted in order to center the PSF model on the computational grid. The final PSF model *h* is obtained by calculating one last focus stack using the updated Zernike coefficients and normalizing it to an integral value of one.

Figure 3 shows x-z-cross sections of measured and simulated bead images. The images are in logarithmic scale to highlight differences in dim regions. Since the bead is much smaller than the wavelength (170 nm vs. 680 nm), the images practically represent the PSF itself. The measured image appears slightly stretched along the z-direction compared to the simulated aberration-free PSF and shows deviations especially in defocused regions. Note that the refined PSF model shows similar characteristics, albeit generally appearing sharper. The model has been obtained by calculating the bead image under consideration of the retrieved pupil phase aberrations shown in the bar plot at the bottom. Especially the spherical coefficients (11, 22, 37, 56) show non-negligible magnitudes.

In order to obtain accurate phase retrieval results, we found it important to take a polarization-dependent objective transmission function into account. Further details about its measurement and impact on the phase retrieval are provided in the Appendix.

## 3. Results

Figure 4 summarizes the main results from estimating the particle position from single images using MLE and the CNN. Only data from one half of the focusing cone is shown (the half which is closer to the objective lens, where the bead is located below the focal plane). This kind of defocused imaging is also used in practical applications. Using both sides of the focus would be ill-conditioned due to similar looking images for positive and negative defoci of equal magnitude. The 8 steps shown span a range of 700 nm. The numbers of signal and background photons in each image are around 3500 and 130, respectively. These values have been used to calculate the shown CRLBs.

The axial position estimates for each recorded frame are shown in Fig. 4(a). Note that the true absolute position is not precisely known as the bead is manually put into focus before the measurement, therefore it is impossible to say which prediction is more accurate.

Both methods make systematic errors in step heights, which are due to remaining inaccuracies of the PSF model. Figure 4(b) shows these deviations with respect to the central z-position. They may have several reasons: Firstly, not all aberrations can be truthfully modeled by a phase function in the objective pupil. Tiny bubbles in the immersion medium for instance are located in a different optical plane and cause phase as well as amplitude changes. Secondly, we noticed that aberrations can vary over time. This is presumably caused by changes in temperature or the refractive index of the immersion medium, which is partly volatile and develops crystallized sulfur during longer measurement periods (within 2–3 hours). Finally, we find that the accuracy of the phase retrieval method strongly depends on the implemented method for image formation. Only slightly false assumptions on critical parameters such as the NA can lead to Zernike magnitude changes on the order of 0.1 rad. That being said, we would like to emphasize that a careful fine-tuning of phase-retrieval parameters was performed using a separate optical path, where the quality of retrieved phase functions can be directly tested by applying its negative on a spatial light modulator (SLM). More information on this test setup is provided in the Appendix.

Figures 4(c) and 4(d) visualize localization precisions in x–y and z-directions. The graphs show the experimentally obtained standard deviations for MLE and the CNN, as well as the square roots of the relevant CRLBs (orange lines), which mark the theoretical optima for any unbiased estimator. Both methods perform close to these thresholds. Most importantly, however, we find that the performances of MLE and the CNN are comparable.

The time required for calculating a single estimate is 0.045 ms for our CNN using a NVidia GTX 1080 graphics card, thus enabling 22k estimates per second. For comparison, our own MLE implementation takes 480 ms on an Intel Core i7-5700 processor. The popular ImageJ-based software ThunderSTORM, which uses a simpler, Gaussian PSF model, requires 150 ms for single estimates from images of equal size using the MLE-based 3D astigmatism method.

The probably fastest MLE algorithm to date uses a spline-fitted PSF model and GPU-based calculations [18], achieving about 100k localizations per second. Compared to this, the CNN is still slower by about a factor of 5, but already operates within the same order of magnitude. Furthermore, it has to be taken into account that the CNN we use represents a standard architecture and has not yet been optimized for the specific task of fast particle localization. There are several methods which can provide a speed improvement such as pruning [46], fine-tuned CP-decomposition [47] and tensor decompositions [48]. The listed methods promise a speedup by several factors (around 2 to 4.5×) each. It is to expect that future developments will provide the community with even faster and more powerful machine learning algorithms.

A further advantage of CNNs compared to MLE is that they don’t rely on any initial values for parameter estimation. In this regard, adequately trained CNNs can be more robust. To support this argument, Fig. 5(a) shows the data of Fig. 4(a) with an extended z-range, now reaching from −1 μm up to the in-focus position. The extension is located right to the dashed vertical line and represents a rather ill-suited region for performing axial localizations. Nevertheless, while the CNN delivers robust estimates throughout the entire z-range, the MLE estimates flip to the opposite side of the focal cone at the last two piezo steps. The situation further deteriorates if 200 nm are added to the optimal initial z-estimates (their retrieval is explained in section 2.1). An error of this magnitude is not unrealistic for an initial z-estimator. Then, the flipping already occurs at z=−400 nm, which is depicted in Fig. 5(b). We note that such a “flipping-effect” can only occur if the positive side of the focusing cone (i.e. the PSF model) is provided to the MLE. If this is not the case, the algorithm runs into the upper boundaries of the cone, returning “in-focus”-localizations, i.e. *z* = 0.

#### 3.1. Importance of using phase retrieval

We investigate the benefit of using a refined PSF model for position estimates using MLE and the CNN. These investigations provide an impression of the error magnitudes which are to expect if such a refinement is not undertaken. For this sake we evaluate the data of Fig. 4 once more, this time wrongly assuming the aberration-free case. The results are shown in Fig. 6. The z-position estimates presented in Fig. 6(a) can be directly compared to those of Fig. 4(a), where the refined PSF model of Fig. 3 has been used. From the comparison it is evident that falsely assuming the absence of aberrations generally leads to lower accuracies for both methods, especially towards small defocus values where the 100 nm steps appear to be smaller. This “compression” also effects the variation of z-estimates and misleadingly suggests a smaller variance. Note that the CRLB for the z-position is comparably high around the focus, indicating that the spreading should indeed be larger.

To facilitate a direct comparison of accuracies, systematic z-deviances with respect to the mid-position (z=−550 nm) are shown as bar plots in Fig. 6(b). The bars belonging to the refined cases are identical to those in Fig. 4(b).

#### 3.2. Jointly estimating aberrations with a CNN

The previous section highlighted the importance of using a correct PSF model. Obtaining such a model, however, requires additional experimental and computational steps, such as measuring a z-stack of fluorescent beads and estimating aberrations using a dedicated algorithm. Even then, field- and time-dependent aberrational drifts are typically ignored. Therefore, it would be valuable if aberrations were directly estimated from the sample data, jointly with 3D position, signal and background. On the other hand, significantly increasing the number of unknown parameters is challenging, because every additional variable adds another dimension to the search space and significantly slows down iterative search algorithms (about 25% computation time increase per additional parameter for our MLE implementation).

In such situations, machine learning becomes increasingly attractive. In contrast to MLE, the CNN computation time only marginally increases by merely a few percent per additional parameter. This is because the number of convolutions – which consume the majority of time – remains the same, only the last fully connected layer becomes more complex. While it is to expect that the obtainable precision suffers from estimating a high number of parameters, we find only a small impact when including three Zernike terms to model aberrations. Furthermore, systematic errors are made if the potential presence of aberrations is ignored.

To investigate this further, we conducted an experiment where first order astigmatism (*Z*_{5}) with a magnitude of 0.3 rad RMS was intentionally added to the wavefront using a spatial light modulator (details to the setup are provided in the Appendix). Subsequently, images of a single bead at eight different z-steps were recorded in analogy to the previously described measurement.

A CNN was trained to perceive the usual five parameters (signal, background, 3D position) as well as aberrations composed from first order astigmatism (*Z*_{5}, *Z*_{6}) and primary spherical aberration (*Z*_{11}). Astigmatism is relevant because it is a typical problem at off-axis field points.

The results presented in Fig. 7 clearly show that the aberration-sensitive network does not only reliably estimate astigmatism (0.29(8)), but also delivers more accurate values for the z-position. The orange line in Fig. 7 marks the piezo set positions. The magnitude of *Z*_{11} is close to zero as expected. A small systematic deviance of about −0.1 rad appears at larger defocus values. This is probably due to a crosstalk with defocus, which technically also represents a spherical aberration.

On the other hand, the “rigid” CNN, which is insensitive to aberrations (red plot in Fig. 7(a)) delivers erroneous z-position estimates which are up to 200 nm off from the piezo position.

## 4. Summary and discussion

We have investigated the use of a CNN for 3D localization microscopy and found that it can achieve a similar precision level to that of MLE. This can be explained by the fact that the CNN learns the noise statistics, which is an integral part of the MLE definition, sufficiently well from the training data set.

A significant advantage of CNNs is that they are parameter-free, i.e. no initial starting parameters are required such as in MLE. All parameters are determined in the supervised learning process. The second potential advantage arising from using a CNN is a speed gain. While the time required for training the CNN outweighs any performance plus for small data sets, the speed advantage will eventually become significant if millions of data points are evaluated and many measurements need to be made. While our current CNN implementation still lags behind the fastest MLE algorithm if only five parameters are estimated, we find that the CNN becomes increasingly attractive if additional parameters are taken into account. To this end we experimentally demonstrated the feasibility of jointly estimating aberrations with a CNN. We would like to emphasize, however, that aberration estimation can be further improved. Field-dependent aberrations for instance should be similar for spatially close molecules, while temporally changing aberrations should vary slowly, such that recurrent networks with a “memory” of past events could be helpful.

## Appendix

## 4.1. SLM-aided optical path

Our setup allows one to insert an optical relay system containing a liquid crystal SLM (Holoeye Pluto). By flipping mirrors, this extra path can be inserted between tube lens and camera. The SLM path is used to verify and optimize the phase retrieval routine as well as to apply precisely defined aberration functions in order to investigate the estimation of aberrations with a CNN. Figure 8 shows results from two subsequent phase-retrieval procedures: An initial run and a second one, after *Z*_{11}=−0.5 rad has been applied with the SLM. The bar plot in Figure 8(c) shows the difference between the two phase retrieval results, indicating a difference of −0.54 rad for the respective mode.

## 4.2. Measurement of objective transmission

The transmission of the objective lens was measured by applying a thin layer of fluorescent marker to the lower side of the sapphire coverslip, which is then exposed to a weak laser focus. A drop of index-matching oil is put on the top side of the coverslip in order to suppress undesired reflections. A rotatable polarizer is introduced in a plane conjugated to the objective exit pupil. This plane is imaged twice using a Bertrand lens. The polarizer is turned by 90 degrees in between the two recordings. In the next step, the coverslip is cleaned and put back on, to collect the background fluorescence from the immersion medium and sapphire glass. This background is subtracted from both images. The final back focal plane images are fitted using the first four spherical Zernike polynomials. Finally, the transmission functions for both linear polarization states are obtained by multiplying the back focal plane images with the factor cos(arcsin(NA/RI)), to compensate for the irradiance increase towards the pupil edge which is already included in the imaging model. In this expression, NA denotes the numerical aperture of the objective lens and RI the refractive index of the immersion oil. Depending on whether the polarizing SLM-path or the unpolarized light path of Fig. 1 are used for taking phase retrieval measurements, a polarized or unpolarized transmission function is used for calculating the PSF, the latter being calculated by taking the average of both measured polarized transmission functions.

Figure 9 shows phase retrieval results obtained with and without taking the measured objective transmission into account. The main differences concern the coefficients for spherical aberrations: In the absence of a correctly apodized pupil, the fitting procedure models its effect (the effectively smaller NA) by increased spherical aberrations. A further noticeable difference exists for astigmatism (predominantly *Z*_{6}). This is because the measurements have been performed using the SLM path and thus polarized light. Models which don’t consider a polarization-dependent objective transmission predict a distinct elliptical focus shape, while in reality this ellipticity is less pronounced. This shape mismatch is misinterpreted as an effect caused by astigmatism.

## 4.3. Calculation of Cramer-Rao lower bounds

The CRLBs for parameter uncertainties are given by the corresponding diagonal elements in the inverse Fisher information matrix, which is calculated as follows [49]:

The signal and background photon numbers required for CRLB calculation were provided by the MLE estimates.

## Funding

Austrian Science Fund (FWF) (P 30214-N36); Horizon 2020 under Marie Skłodowska-Curie grant agreement (721358).

## References

**1. **C. Manzo and M. F. Garcia-Parajo, “A review of progress in single particle tracking: from methods to biophysical insights,” Reports on Prog. Phys. **78**, 124601 (2015). [CrossRef]

**2. **G. J. Schütz, V. P. Pastushenko, H. J. Gruber, H.-G. Knaus, B. Pragl, and H. Schindler, “3d imaging of individual ion channels in live cells at 40nm resolution,” Single Mol. **1**, 25–31 (2000). [CrossRef]

**3. **E. Betzig, G. H. Patterson, R. Sougrat, O. W. Lindwasser, S. Olenych, J. S. Bonifacino, M. W. Davidson, J. Lippincott-Schwartz, and H. F. Hess, “Imaging intracellular fluorescent proteins at nanometer resolution,” Science **313**, 1642–1645 (2006). [CrossRef] [PubMed]

**4. **S. T. Hess, T. P. Girirajan, and M. D. Mason, “Ultra-high resolution imaging by fluorescence photoactivation localization microscopy,” Biophys. J. **91**, 4258–4272 (2006). [CrossRef] [PubMed]

**5. **M. J. Rust, M. Bates, and X. Zhuang, “Sub-diffraction-limit imaging by stochastic optical reconstruction microscopy (storm),” Nat. Methods **3**, 793 (2006). [CrossRef] [PubMed]

**6. **P. N. Hedde, J. Fuchs, F. Oswald, J. Wiedenmann, and G. U. Nienhaus, “Online image analysis software for photoactivation localization microscopy,” Nat. Methods **6**, 689 (2009). [CrossRef] [PubMed]

**7. **R. Henriques, M. Lelek, E. F. Fornasiero, F. Valtorta, C. Zimmer, and M. M. Mhlanga, “Quickpalm: 3d real-time photoactivation nanoscopy image processing in imagej,” Nat. Methods **7**, 339 (2010). [CrossRef] [PubMed]

**8. **B. Yu, D. Chen, J. Qu, and H. Niu, “Fast fourier domain localization algorithm of a single molecule with nanometer precision,” Opt. Lett. **36**, 4317–4319 (2011). [CrossRef] [PubMed]

**9. **H. Ma, F. Long, S. Zeng, and Z.-L. Huang, “Fast and precise algorithm based on maximum radial symmetry for single molecule localization,” Opt. Lett. **37**, 2481–2483 (2012). [CrossRef] [PubMed]

**10. **R. Parthasarathy, “Rapid, accurate particle tracking by calculation of radial symmetry centers,” Nat. Methods **9**, 724 (2012). [CrossRef] [PubMed]

**11. **H. Ma, J. Xu, J. Jin, Y. Gao, L. Lan, and Y. Liu, “Fast and precise 3d fluorophore localization based on gradient fitting,” Sci. Reports **5**, 14335 (2015). [CrossRef]

**12. **K. J. Martens, A. N. Bader, S. Baas, B. Rieger, and J. Hohlbein, “Phasor based single-molecule localization microscopy in 3d (psmlm-3d): an algorithm for mhz localization rates using standard cpus,” J. Chem. Phys. **148**, 123311 (2018). [CrossRef] [PubMed]

**13. **S. Stallinga and B. Rieger, “Accuracy of the gaussian point spread function model in 2d localization microscopy,” Opt. Express **18**, 24461–24476 (2010). [CrossRef] [PubMed]

**14. **C. Franke, M. Sauer, and S. van de Linde, “Photometry unlocks 3d information from 2d localization microscopy data,” Nat. Methods **14**, 41 (2016). [CrossRef] [PubMed]

**15. **F. Aguet, D. Van De Ville, and M. Unser, “A maximum-likelihood formalism for sub-resolution axial localization of fluorescent nanoparticles,” Opt. Express **13**, 10503–10522 (2005). [CrossRef] [PubMed]

**16. **K. I. Mortensen, L. S. Churchman, J. A. Spudich, and H. Flyvbjerg, “Optimized localization analysis for single-molecule tracking and super-resolution microscopy,” Nat. Methods **7**, 377 (2010). [CrossRef] [PubMed]

**17. **S. Quirin, S. R. P. Pavani, and R. Piestun, “Optimal 3d single-molecule localization for superresolution microscopy with aberrations and engineered point spread functions,” Proc. Natl. Acad. Sci. **109**, 675–679 (2012). [CrossRef] [PubMed]

**18. **Y. Li, M. Mund, P. Hoess, J. Deschamps, U. Matti, B. Nijmeijer, V. J. Sabinina, J. Ellenberg, I. Schoen, and J. Ries, “Real-time 3d single-molecule localization using experimental point spread functions,” Nat. Methods **15**, 367 (2018). [CrossRef] [PubMed]

**19. **Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature **521**, 436 (2015). [CrossRef] [PubMed]

**20. **K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE International Conference on Computer Vision, (2015), pp. 1026–1034.

**21. **A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica **4**, 1117–1125 (2017). [CrossRef]

**22. **C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in *European Conference on Computer Vision*, (Springer, 2014), pp. 184–199.

**23. **C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Transactions on Pattern Analysis Mach. Intell. **38**, 295–307 (2016). [CrossRef]

**24. **K. Hayat, “Super-resolution via deep learning,” CoRR **abs/1706.09077** (2017).

**25. **Y. Rivenson, Z. Göröcs, H. Günaydin, Y. Zhang, H. Wang, and A. Ozcan, “Deep learning microscopy,” Optica **4**, 1437–1443 (2017). [CrossRef]

**26. **M. Weigert, U. Schmidt, T. Boothe, A. Müller, A. Dibrov, A. Jain, B. Wilhelm, D. Schmidt, C. Broaddus, S. Culley, M. Rocha-Martins, F. Segovia-Miranda, C. Norden, R. Henriques, M. Zerial, M. Solimena, J. Rink, P. Tomancak, L. Royer, F. Jug, and E. W. Myers, “Content-aware image restoration: Pushing the limits of fluorescence microscopy,” bioRxiv (2017).

**27. **M. Weigert, L. Royer, F. Jug, and G. Myers, “Isotropic reconstruction of 3d fluorescence microscopy images using convolutional neural networks,” in *International Conference on Medical Image Computing and Computer-Assisted Intervention*, (Springer, 2017), pp. 126–134.

**28. **J. M. Newby, A. M. Schaefer, P. T. Lee, M. G. Forest, and S. K. Lai, “Convolutional neural networks automate detection for tracking of submicron-scale particles in 2d and 3d,” Proc. Natl. Acad. Sci. **115**, 9026–9031 (2018). [CrossRef] [PubMed]

**29. **E. Nehme, L. E. Weiss, T. Michaeli, and Y. Shechtman, “Deep-storm: super-resolution single-molecule microscopy by deep learning,” Optica **5**, 458–464 (2018). [CrossRef]

**30. **W. Ouyang, A. Aristov, M. Lelek, X. Hao, and C. Zimmer, “Deep learning massively accelerates super-resolution localization microscopy,” Nat. Biotechnol. **36**, 460–468 (2018). [CrossRef] [PubMed]

**31. **N. Boyd, E. Jonas, H. P. Babcock, and B. Recht, “Deeploco: Fast 3d localization microscopy using neural networks,” BioRxiv p. 267096 (2018).

**32. **T. Kim, S. Moon, and K. Xu, “Information-rich localization microscopy through machine learning,” BioRxiv p. 373878 (2018).

**33. **M. Ovesný, P. Křížek, J. Borkovec, Z. Švindrych, and G. M. Hagen, “Thunderstorm: a comprehensive imagej plug-in for palm and storm data analysis and super-resolution imaging,” Bioinformatics **30**, 2389–2390 (2014). [CrossRef] [PubMed]

**34. **B. M. Hanser, M. G. Gustafsson, D. Agard, and J. W. Sedat, “Phase-retrieved pupil functions in wide-field fluorescence microscopy,” J. Microsc. **216**, 32–48 (2004). [CrossRef] [PubMed]

**35. **S. Liu, E. B. Kromann, W. D. Krueger, J. Bewersdorf, and K. A. Lidke, “Three dimensional single molecule localization using a phase retrieved pupil function,” Opt. Express **21**, 29462–29487 (2013). [CrossRef]

**36. **P. N. Petrov, Y. Shechtman, and W. Moerner, “Measurement-based estimation of global pupil functions in 3d localization microscopy,” Opt. Express **25**, 7945–7959 (2017). [CrossRef] [PubMed]

**37. **M. Siemons, C. Hulleman, R. Thorsen, C. Smith, and S. Stallinga, “High precision wavefront control in point spread function engineering for single emitter localization,” Opt. Express **26**, 8397–8416 (2018). [CrossRef] [PubMed]

**38. **M. Speidel, A. Jonáš, and E.-L. Florin, “Three-dimensional tracking of fluorescent nanoparticles with subnanometer precision by use of off-focus imaging,” Opt. Lett. **28**, 69–71 (2003). [CrossRef] [PubMed]

**39. **K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR **abs/1409.1556** (2014).

**40. **F. Chollet, “Keras: Deep learning library for theano and tensorflow,” URL: https://keras.io/k7 (2015).

**41. ** Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e-prints **abs/1605.02688** (2016).

**42. **J. B. Diederik Kingma, “Adam: A Method for Stochastic Optimization,” arXiv e-prints **abs/1605.02688** (2014).

**43. **D. Axelrod, “Fluorescence excitation and imaging of single molecules near dielectric-coated and bare surfaces: a theoretical study,” J. Microsc. **247**, 147–160 (2012). [CrossRef] [PubMed]

**44. **R. J. Noll, “Zernike polynomials and atmospheric turbulence,” JOSA **66**, 207–211 (1976). [CrossRef]

**45. **E. J. Botcherby, R. Juškaitis, M. J. Booth, and T. Wilson, “Aberration-free optical refocusing in high numerical aperture microscopy,” Opt. Lett. **32**, 2007–2009 (2007). [CrossRef] [PubMed]

**46. **S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” CoRR **abs/1510.00149** (2015).

**47. **V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky, “Speeding-up convolutional neural networks using fine-tuned cp-decomposition,” CoRR **abs/1412.6553** (2014).

**48. **M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional neural networks with low rank expansions,” CoRR **abs/1405.3866** (2014).

**49. **J. Chao, E. S. Ward, and R. J. Ober, “Fisher information theory for parameter estimation in single molecule microscopy: tutorial,” JOSA A **33**, B36–B57 (2016). [CrossRef] [PubMed]