## Abstract

Fourier-based wavefront sensors, such as the Pyramid Wavefront Sensor (PWFS), are the current preference for high contrast imaging due to their high sensitivity. However, these wavefront sensors have intrinsic nonlinearities that constrain the range where conventional linear reconstruction methods can be used to accurately estimate the incoming wavefront aberrations. We propose to use Convolutional Neural Networks (CNNs) for the nonlinear reconstruction of the wavefront sensor measurements. It is demonstrated that a CNN can be used to accurately reconstruct the nonlinearities in both simulations and a lab implementation. We show that solely using a CNN for the reconstruction leads to suboptimal closed loop performance under simulated atmospheric turbulence. However, it is demonstrated that using a CNN to estimate the nonlinear error term on top of a linear model results in an improved effective dynamic range of a simulated adaptive optics system. The larger effective dynamic range results in a higher Strehl ratio under conditions where the nonlinear error is relevant. This will allow the current and future generation of large astronomical telescopes to work in a wider range of atmospheric conditions and therefore reduce costly downtime of such facilities.

© 2020 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

The past few decades have seen an increase in the use of adaptive optics (AO) to correct for wavefront errors in imaging systems. These wavefront errors are usually created when the light travels through turbulent media such as Earth’s atmosphere in astronomy [1], or intracellular fluids in microscopy [2]. Accurate measurements of the aberrations are necessary to reach the best image quality after compensation. In most systems a dedicated optical system, a wavefront sensor, is used to measure the aberrations because the phase of visible and near-infrared light can not be directly measured. A wavefront sensor transforms the incoming wavefront aberrations into measurable intensity modulations.

The reconstruction of the wavefront from these sensor measurements is not straightforward and almost all commonly used methods depend on a linearization of the response of the wavefront sensor around a non-aberrated wavefront [3–5]. However, this assumption of linearity is not always valid as most wavefront sensors have intrinsic nonlinearities when the input wavefront has large deviations [6–8]. To reconstruct the wavefront accurately it is necessary to use nonlinear reconstruction (NLR) methods.

A particular focus of NLR in astronomy has been on the reconstruction of the unmodulated Pyramid Wavefront sensor (PWFS) [9,10] measurements. The PWFS is a promising wavefront sensor for astronomy due to its high sensitivity, which allows astronomers to use AO on fainter stars [11,12]. The linear range of the unmodulated PWFS is relatively small and depends on the wavefront aberration mode. For tip and tilt errors the linear range is 1 rad while for higher-order modes it is even smaller [6,13]. Atmospheric turbulence usually generates wavefront errors that are outside the linearity range of the unmodulated PWFS, which is why most, if not all, of the current AO systems add modulation to increase the linear range [14–16]. While the modulation does increase the linear range and allows the PWFS to be used in realistic conditions, the sensitivity also decreases [12,13,17].

As the PWFS response depends on the shape of the input Point Spread Function (PSF), there is also strong nonlinear cross-talk between different modes which quickly leads to an underestimation of the actual wavefront [18,19]. Recent work has shown that the use of model based nonlinear reconstruction methods can improve the reconstruction accuracy and increase the effective dynamic range of the PWFS [20,21].

Other approaches to increase the linear range of the sensor focus on changing the hardware. As briefly discussed above, dynamic modulation of the input beam on top of the pyramid face can be used to increase the linear range, but this decreases the sensitivity [9]. Different modulation schemes and even other types of focal plane masks instead of the pyramidal prism can be described in the general formalism of Fourier-based wavefront sensing [13], where a two lens system is used with a focal plane mask in its intermediate focus to measure wavefront errors. This formalism shows that there is a linear trade-off between the sensitivity of the wavefront sensor and its linear range. The only way to get around this linear trade-off between sensitivity and dynamic range is to introduce nonlinearities in the wavefront sensor. This approach led to the development of the generalised Optical Differentiation Wavefront Sensor (g-ODWFS) [8].

The g-ODWFS is similar to Fourier-based sensors but uses two filters in the focal plane instead of a single mask. Each filter encodes either the x-gradient or the y-gradient of the wavefront aberrations. A schematic of this setup can be seen in Fig. 1. A Wollaston prism splits the collimated input beam into two polarized beams, which are focused onto the different focal plane filters. For the g-ODWFS the focal plane filters are half-wave plates with a spatially varying fast-axis orientation, therefore the angle of polarization of the input beams will rotate by an amount that depends on its spatial position in the focal plane. A second Wollaston prim that is rotated by 45 degrees with respect to the first Wollaston prism is used as a polarization analyzer that creates four output beams, two for each gradient. The intensity difference between the two output beams per gradient depends on the angle of polarization, and therefore on the orientation of the fast axis of the focal plane mask where the beam hit it. We can measure for each position in the pupil what the angle of polarization is and derive where the ray hit the focal plane mask. From this we can measure the local tilt of the wavefront and reconstruct the wavefront aberrations. The fast-axis orientation profile is a step-wise profile parameterized by $\beta$ which determines how close the profile is to a step function. This profile can be seen in Fig. 1(b).

Mathematically the g-ODWFS can be described as a Fourier filter,

Here $E_0$ is the input electric field, $\mathcal {F}$ is the Fourier transform operator, $T$ is the focal plane transmission mask and $E_1$ is the output electric field. The g-ODWFS uses two tranmission masks, one that measures wavefront variations in the x direction and on that measures variations in the y direction. For a single direction, we choose to describe the x-direction, the transmission mask creates two effective amplitude filters,The g-ODWFS has been demonstrated in the lab and recently at the telescope as a wavefront sensor that can be used for astronomical adaptive optics [22]. Through the introduction of an intentional nonlinearity we can increase the dynamic range without decreasing the sensitivity of the wavefront sensor. However, for large wavefront errors the response is nonlinear, which shows that the only way to actually increase the effective dynamic range of these wavefront sensors is through nonlinear reconstruction of the incoming wavefront.

Instead of using a model based approach to nonlinear wavefront reconstruction, such as done by [20,21], it is also possible to use a data-driven approach. The advantages of a data-driven approach is that it is not necessary to know the exact details of the optical system and its environment. Model based methods are very sensitive to any model errors, which can even dominate the reconstruction error [21]. Deep learning is a promising approach to data-driven learning of complicated nonlinear relations between input-ouput sets. The use of neural networks has been explored in the past, but focused mainly on the comparison between the precision of linear reconstructors and neural networks [23,24].

In this work we propose to use a Convolutional Neural Network (CNN) as reconstructor to extend the effective dynamic range of Fourier-based wavefront sensors well into the nonlinear regime where conventional linear reconstructors degrade substantially in quality. Section 2 describes the different reconstruction methods that are used in this paper. Section 3 provides a comparison between the reconstruction methods in simulations to show the potential gain in an idealised system. Section 4 discusses the results from applying the CNN in a lab setup of the Leiden EXoplanet Instrument (LEXI) [22] that has an implementation of the g-ODWFS as its main wavefront sensor.

## 2. Reconstruction methods

The phase profile of the deformable mirror $\phi _{DM}(x,y)$ can be described by a set of modal functions $f_i(x,y)$:

Modal reconstruction involves the estimation of the modal coefficients $c_i$ from the slope measurements of the wavefront sensor. Throughout this paper, the actuator modes will be used as the modal basis. The actuator modes consist of the phase profiles induced when the individual actuators on the DM are poked. For the g-ODWFS, the wavefront slope along one direction is related to the normalized differences between the pupil intensities resulting from that polarisation rotator, as explained in the previous section. The normalized differences will be used as measurement vector throughout this work.#### 2.1 Matrix Vector Multiplication (MVM)

The most common technique of wavefront reconstruction is to assume a linear relation between the measurement vector $\vec {s}$ and the modal coefficients $\vec {c}$. The forward model is then given by:

Where $A$ is the interaction matrix and $\vec {\eta }$ some measurement noise from the WFS. The interaction matrix $A$ can be calibrated by measuring the actuator response within the linearity range of the wavefront sensor. The inverse operation of reconstructing the wavefront from the slope measurements involves the estimation of the least-squares estimate of $\hat {\vec {c}}$. This estimation requires inversion of the interaction matrix. To avoid degradation of the reconstruction by small singular values, regularization can be added. This can for example be done by truncating the singular value decomposition, where small singular values are set to zero, or Tikhonov regularization. We will use Tikhonov regularization in the inversion, which gives the following cost function: Where $\|\|$ denotes the $L^2$ norm and $\Gamma$ is the Tikhonov matrix. We choose $\Gamma = \alpha I$, with $I$ the identity matrix and $\alpha$ the regularization strength. The modal coefficients that minimize this cost function are given by: The reconstruction matrix $(A^TA + \alpha I)^{-1}A^T$ can be precomputed. The optimal value for the regularization strength $\alpha$ depends on the noise statistics and will be chosen by iteratively trying different values and choosing the value that gives the lowest mean squared error on a training set for the used noise level.#### 2.2 Convolutional neural networks (CNN)

Convolutional Neural Networks have proven to be excellent at various image processing and classification tasks [25,26]. They use the assumptions that features are local and translational invariant to drastically reduce the number of weights that have to be optimized as compared to traditional Neural Networks. The main building blocks of CNNs are convolutional layers, in which kernels of weights are convolved with the image to produce feature maps. A CNN often consists of many stacked convolutional layers and a few Dense layers at the end, in which all inputs and outputs are connected. After each layer, a nonlinear activation function is applied, allowing it to model nonlinear relations. The model is learned by optimizing the weights of the CNN for a loss function on known input-output pairs. This is done by using the backpropagation algorithm in combination with a gradient based optimizer.

For the reconstruction of the g-ODWFS measurements, the input of the CNN consists of the observed normalised differences in both directions, resulting in a 3D input array with a depth of 2. This input array is propagated through a number of convolutional layers. Finally, a single Dense layer is used to produce the output of the model, the estimated modal coefficients that are present in the wavefront. Using the actuator modes as modal basis effectively uses the assumptions from the convolutional layer, since all actuators have the same local response. This is in contrast to for example the Zernike basis, which depends on various global features. As the number of mirror modes and pupil dimensions are different in the simulations and lab work, the architectures are slightly different and the details will be discussed in their respective sections.

The calibration is done by applying known combinations of the mirror modes on the DM and measuring the response. Using the Adam optimizer [27] we minimize a loss function for the observed measurements and applied modal coefficients. Commonly, the mean squared error is used as the loss function for regression problems. This is also the initial cost function that was tried to optimize the CNN, but it did not lead to satisfying results. For AO systems we need to provide gain over the full range of wavefronts that are observed in closed-loop, ranging from Strehls of $<0.1\%$ to Strehls of $>80\%$. The mean squared error takes the square of the absolute error, which effectively causes the optimizer to focus more on wavefronts with relatively larger input RMS. This would not be a problem for linear regression, because a better solution for the large coefficients would also imply a better solution for the small coefficients due to linearity of the problem. For nonlinear regression with neural networks this does not hold, because neural networks can introduce non-continuous piecewise functions, which effectively means that the linear range and nonlinear range could be completely decoupled. This makes it easier for the optimizer to reduce the total cost by improving the fit in the large aberration regime while ignoring the small aberration regime. As this is not desirable for AO systems, we have opted to weigh the mean squared error by the square of the RMS of the true modal coefficients, leading to the following relative loss function:

Here, $\left\langle \right \rangle$ denotes the average along the modes and $\hat {c}$ denotes the predicted modal coefficients. The $\epsilon$ term avoids a diverging loss for small input RMS. This is an additional hyperparameter that can be tweaked. With this relative loss function we enforce the algorithm to put the same amount of weight on wavefronts with small RMS and those with large RMS. Our models are implemented and trained using the Keras package [28] for Python.#### 2.3 CNN+MVM

As will be shown in section 3.4, solely using a CNN results in suboptimal closed-loop performance under simulated atmospheric turbulence. We have solved this problem by decomposing the reconstruction in a linear term $\vec {c}_{l}$ and a nonlinear error term $\vec {c}_{nl}$:

The MVM method discussed in the section 2.1 can be used to estimate the linear term, while the CNN is calibrated to only reconstruct the nonlinear error term: Where $A^+$ refers to the regularized inverted interaction matrix described in section 2.1. Figure 2 shows the pipeline for reconstructing the wavefront sensor measurements for this approach. The implementation of the CNN is the same as in the previous section.## 3. Simulations

#### 3.1 Simulation setup

All simulations are done in Python with the hcipy [29] package, which contains an implementation of the g-ODWFS. The deformable mirror has 32 actuators across the pupil diameter with a total number of 848 illuminated actuators over the full pupil. Each of the actuators has a Gaussian influence function with a standard deviation equal to the actuator pitch. Unless noted otherwise, we assume that the WFS measurements contain $10^6$ photons per frame. We will test the performance of the reconstruction method for three different step sizes $\beta$: 0, 1/3 and 1. The g-ODWFS with a step size of 1 is the equivalent of the unmodulated PWFS and is a good estimation of the performance that could be obtained with this WFS. The physical parameters used in the simulations are shown in Table 1.

#### 3.2 Calibration

The input array for the CNN, containing the normalized differences in both directions, has a shape of 64x64x2. We apply five layers of 3x3 convolutions with 32 kernels to this input array. The third convolutional layer uses a stride of two, reducing the spatial dimensions of the image. After these layers we have a single 1x1 convolution with 4 kernels to reduce the depth, which results in a 32x32x4 array. After each convolutional layer we apply a batch normalization layer [30] and the ReLU activation. Finally, we flatten the array and have one fully connected layer with 848 neurons, equal to the number of mirror modes. This final layer introduces a lot of parameters, most of which are irrelevant because the actuator coefficients should mostly be determined from the extracted nonlinear features located close to the actuator. Therefore, we use $L_1$ regularization in this final layer with a strength of $10^{-6}$, which was found to give good results after tuning by hand. The architecture has a total of 3,512,308 trainable parameters.

The CNN needs to be calibrated with data that is representative of the distortions that will be observed on sky. One limitation is that the CNN can only be trained on wavefront profiles that are spanned by the DM modes. This means that higher spatial frequencies are not present in the training data, while they are observed on sky. Modelling the expected closed loop spatial power spectral density (PSD) and generating DM profiles approximating that distribution is an approach one could take. However, this would introduce a bias in the CNN. If the statistics of the residuals would differ from the data used to train the CNN this may degrade the reconstruction. Furthermore, the closed-loop PSD in itself depends on the reconstructor [31]. Due to the nonlinearity of the CNN this relation is not trivial. To introduce the least amount of bias in the reconstructor we have opted to train the CNN with independent normally distributed modal coefficients with a given standard deviation. To achieve good performance over a large range of wavefront errors the standard deviation of the amplitude distribution is sampled from a log-uniform distribution between 0.003 and 5 radians. In total, we simulate 80,000 input-output pairs. From this dataset, 80% will be used for the training and 20% for the validation and model selection. The loss function from Eq. (9) is used with $\epsilon =0.01$ and we use the Adam algorithm with default initial hyperparameters as provided in [27] and a batch size of 64 to optimize the weights. To improve the accuracy of the model, we decay the learning rate by a factor two after every 20 epochs. Furthermore, the loss on the validation set is calculated after every epoch and the model is only saved when the validation loss decreases. Total training time of the model is $\sim$2 hours on a 12GB Tesla K80 GPU.

The interaction matrix is calibrated by applying DM modes with amplitudes of 0.3 rad, to ensure we are within the linearity range of the WFS. The same dataset as in the calibration for the CNN is used to find the optimal regularization strength for the MVM method, except only wavefronts with RMS below 0.3 rad are included. We iteratively test 20 different values of the regularization strength, logarithmically distributed between $10^{-5}$ and $10^5$ and the model with the lowest mean squared error will be used. For the $\beta =0$ profile we have to include larger amplitudes in the optimization of the regularization strength as 0.3 rad is too close to the sensitivity limit of the WFS for an input beam with $10^6$ photons. Since the wavefront sensor response is still linear over the considered range we include all data in the optimization.

#### 3.3 Simulation results: reconstruction of DM modes

We compare the reconstruction of the DM modes for the different reconstruction methods. Normally distributed modal coefficients with chosen standard deviation were applied on the DM and the calibrated models were used to reconstruct these coefficients. Figure 3 shows the relative residual root mean square phase variance (RMS) after the reconstruction for different step sizes in the rotation profile. For the profile without a step, we do not observe an increase in performance by using a CNN. This is as expected, since this profile has close to a linear response. The results for the $\beta =1/3$ and $\beta =1$ profiles show a significant improvement in the estimation of large phase aberrations by using a CNN. While the CNN-only reconstructor slightly degrades the estimation of small aberrations, the CNN+MVM approach does not have this problem while still providing similar improvement for large aberrations. Comparing the different step sizes shows the high sensitivity of the $\beta =1$ profile. However, this filter profile leads to more nonlinearities and has a smaller change in response in the nonlinear regime. This is shown by the increased residual RMS for both the linear and nonlinear reconstruction for large aberrations as compared to the filter with a step of 1/3.

#### 3.4 Simulation results: closed loop performance

The closed loop performance of the reconstruction methods is evaluated on a simulated single-layer atmospheric phase screen as implemented in hcipy with given Fried parameter $r_0$ and wind velocity $v$. The simulated AO system corrects at a frequency of 1 kHz and all simulations use a leaky integrator with a gain of 0.3 and leakage of 0.003. Figure 4 shows the results for the simulations under a variety of different atmospheric conditions.

The simulations show that the CNN-only reconstructor does not perform well in closed loop situations. One explanation for this is that this is the result of higher order modes negatively influencing the reconstruction, as these are not present in the training data. Another explanation may be that the CNN gets stuck on a combination of modes that it cannot reconstruct. Due to the nonlinearity of the CNN there may be combinations of WFS measurements that exactly cancel out the output coefficients, which it may converge to in closed-loop. However, using the CNN as an additional nonlinear term on the MVM does not have this issue. We see that this approach leads to an increased Strehl ratio in the simulations with relatively bad conditions. Under these conditions the wavefront sensor operates in its nonlinear regime. In this case, the linear reconstructor underestimates the wavefront aberrations, while the CNN correction is able to improve the estimation. With the CNN the AO system is able to lock on to the highly variable turbulence, while the linear reconstruction can not do this. For larger filter step sizes $\beta$ the nonlinearity error increases. The extreme case of $\beta =1$ is used to demonstrate this. The $\beta =1$ filter is very similar to the unmodulated pyramid wavefront sensor. This shows that while the CNN is trained for the g-ODWFS, a CNN could also be used to improve the reconstruction of the unmodulated PWFS measurements. The resulting Strehl ratios from these simulations are shown in Table 2.

To test the effect photon noise has on the reconstruction, we have simulated the performance of the AO system for different photon fluxes. Figure 5 shows the results of these simulations under conditions of $r_0=0.2$, $v=5$ m/s and $r_0=0.1$, $v= 10$ m/s respectively, all using the $\beta =1/3$ filter profile. The nonlinear correction does not provide any improved performance when there are not enough photon because the nonlinear structure is well within the noise. On the other hand, it also shows that the nonlinear correction does not lead to an amplification of the photon noise, as both methods have the same performance. When enough photons are available and the nonlinear residuals become visible, we again see the advantages of the increased effective dynamic range. This means that using the CNN as nonlinear correction only improves the reconstruction when enough photons are available for the nonlinear regime.

## 4. Lab experiments

#### 4.1 Experimental setup

The lab setup uses the AO system of the exoplanet path-finder instrument LEXI [22], which uses a g-ODWFS as its main wavefront sensor. To generate a diffraction-limited input beam a 633nm HeNe laser is injected into a single-mode fiber. A relay system with two achromatic doublets changes the focal ratio from F3 to F11. A 140 mm achromatic doublet collimates the beam. The system uses a Alpao 97-15 DM that is placed in the pupil plane of the collimated beam. A second achromatic lens with an identical focal length of 140 mm is used for 1 to 1 imaging. This image plane is the input of the g-ODWFS.

The g-ODWFS uses a 60 mm achromatic doublet to collimate the beam. A Wollaston prism is used to split the beam into two orthogonal polarizations. A 300 mm achromat is used to focus these two beams onto the focal plane mask. The focal plane mask is made of a single layer of patterned liquid-crystals [32,33] and consists of two adjacent patterns with spatially varying fast-axes. The two patterns are used to measure the wavefront gradient in two directions. After the focal plane mask a second Wollaston prism, which is rotated by 45 degrees with respect to the first one, splits each beam into two again. This results in four output beams that are collimated by a third achromat with a 70 mm focal length onto an EM-CCD camera. The EM-CCD is conjungated to the DM surface. The DM surface is sampled with 45 pixels across the pupil.

The actuators of the DM are controlled in terms of their peak-to-valley amplitude. An amplitude of $\pm 1$ corresponds to the maximum actuator displacement of 30 $\mu$m in either direction. Data is taken for normally distributed actuator amplitudes with different variances. For each of the different amplitude variances used, 2000 random combinations of voltages are applied to the actuators and the images are saved. This gives a total of 16,000 measurements from which we use 60% for training, 20% for validation and 20% for comparing the reconstruction.

The CNN architecture is slightly modified for the reconstruction of the lab data. The input array has a shape of 45x45x2 and since there are significantly less actuator modes we need to further reduce the spatial dimensions of the image. This is done by changing the stride of the first convolutional layer to 2. The fully connected layer now has 97 neurons, the number of actuators of the DM. The altered architecture has a total of 94,029 trainable parameters. The training procedure is the same as in the simulations but the loss function uses $\epsilon =10^{-5}$ due to the different units of the modal coefficients and the batch size is reduced to 8. The regularization strength for the MVM method is optimized using the data with mode RMS of up to 0.008. We have tested rotation profiles with steps of 0, 1/3 and 2/3. Since the simulations showed that the CNN-only approach is suboptimal, we will not consider it in this section.

#### 4.2 Lab results: reconstruction of DM modes

The performance of the reconstructors is evaluated on the test data and the results are shown in Fig. 6. This demonstrates that the CNN improves the reconstruction of larger modal coefficients. The reconstruction of these large aberrations is better for the $\beta =1/3$ rotation profile compared to $\beta =2/3$. The limit is not determined by the sensitivity of the WFS but due to other error sources. Figure 7 shows the reconstructed coefficients as a function of the input coefficient for different input RMS with the $\beta =1/3$ rotation profile. This shows the underestimation of the coefficients by the linear model for larger aberrations, while the CNN is able to correct for the nonlinearities. These results confirm our findings from the simulations.

## 5. Conclusions & discussion

In this paper, we have demonstrated the use of Convolutional Neural Networks for nonlinear wavefront reconstruction. Both simulations of the g-ODWFS and a lab implementation have shown an improvement in the estimation of large aberrations. Furthermore, we have compared the closed-loop performance of the reconstructors for simulated atmospheric turbulence. This showed that solely using a CNN results in suboptimal performance. On the other hand, using the CNN as a nonlinear correction term on the MVM results in a higher Strehl ratio under conditions where the WFS operates in its nonlinear regime as compared to the standard MVM approach. It was demonstrated that the nonlinear correction only provides improved effective dynamic range when the number of photons is sufficiently high such that nonlinear structures can be measured. The results show that the CNN does not amplify noise, which is a concern for nonlinear reconstruction methods.

The nonlinear correction does come at additional computational cost, often measured in the number of floating point operations (FLOPs). For the architecture used in the simulations propagating a single measurement through the CNN takes $\sim 73$ megaFLOPs compared to $\sim 7$ megaFLOPs for the MVM. For the lab architecture this reduces to $\sim 9$ megaFLOPs and $\sim 0.4$ megaFLOPs respectively. The cost for the CNN+MVM approach is the sum of these two. Astronomical AO systems run at roughly 1 kHz, so to reach these speeds the algorithm requires a computer with the computational power of at least 100 GFLOPs per second, which is well within the current computing capabilities. There are a few possibilities for decreasing the computational demand in our approach. First of all, a lot of research has been done on creating efficient Convolutional Neural Network architectures for low latency predictions on mobile hardware [34,35]. We did not focus on efficiency and it should be relatively easy to decrease the computational load by optimizing the architecture. Secondly, the matrix multiplications in the prediction can be heavily parallelized on a GPU. Finally, one could consider reconstructing less nonlinear modes. However, the wavefront sensor measurements will still need to be propagated through the CNN, even if the output dimension is smaller. Applying software binning on the wavefront sensor measurements may relieve some of the computational demand but may also increase the effect of higher order modes on the reconstruction.

Our work shows that there is still much to be gained by improving the wavefront reconstruction algorithms, especially in the nonlinear regime. By applying nonlinear corrections we are able to reach equal or higher Strehl ratios under all atmospheric conditions. In some cases the gain in Strehl is more than 40 percentage points. Adding these nonlinear corrections could be important for the upcoming Extremely Large Telescopes that will need AO to operate their instruments. The precise impact is difficult to estimate because the nonlinearity error depends strongly on the amount of turbulence cells across the pupil. All ELTs have a significantly larger aperture than the one that was considered in this work and the instruments that will use a PWFS all sense the wavefront at different wavelengths. Both aspects influence the nonlinearity error, and make the precise impact instrument specific. It is therefore important both to test the proposed reconstruction method at the telescope, to show that it can operate under realistic turbulence, and to apply the proposed technique with accurate models for the ELT-sized AO instruments.

## Acknowledgments

The authors would like to thank Prof. C.U. Keller for the supervision of the research and Dr. M. Kenworthy for the feedback on this work. We also thank the anonymous reviewers for their comments to improve this work.

## Disclosures

The authors declare no conflicts of interest.

## References

**1. **F. Roddier, “The effects of atmospheric turbulence in optical astronomy,” Progess Opt. **19**, 281–376 (1981). [CrossRef]

**2. **M. Schwertner, M. J. Booth, M. A. Neil, and T. Wilson, “Measurement of specimen-induced aberrations of biological samples using phase stepping interferometry,” J. Microsc. **213**(1), 11–19 (2004). [CrossRef]

**3. **B. L. Ellerbroek and C. R. Vogel, “Inverse problems in astronomical adaptive optics,” Inverse Probl. **25**(6), 063001 (2009). [CrossRef]

**4. **V. Hutterer and R. Ramlau, “Nonlinear wavefront reconstruction methods for pyramid sensors using Landweber and Landweber-Kaczmarz iterations,” Appl. Opt. **57**(30), 8790 (2018). [CrossRef]

**5. **C. T. Heritier, S. Esposito, T. Fusco, B. Neichel, S. Oberti, R. Briguglio, G. Agapito, A. Puglisi, E. Pinna, and P.-Y. Madec, “A new calibration strategy for adaptive telescopes with pyramid WFS,” Mon. Not. R. Astron. Soc. **481**, 2829–2840 (2018). [CrossRef]

**6. **A. Burvall, E. Daly, S. R. Chamot, and C. Dainty, “Linearity of the pyramid wavefront sensor,” Opt. Express **14**(25), 11925–11934 (2006). [CrossRef]

**7. **O. Guyon, “High sensitivity wavefront sensing with a nonlinear curvature wavefront sensor,” Publ. Astron. Soc. Pac. **122**(887), 49–62 (2010). [CrossRef]

**8. **S. Y. Haffert, “Generalised optical differentiation wavefront sensor: a sensitive high dynamic range wavefront sensor,” Opt. Express **24**(17), 18986 (2016). [CrossRef]

**9. **R. Ragazzoni, “Pupil plane wavefront sensing with an oscillating prism,” J. Mod. Opt. **43**(2), 289–293 (1996). [CrossRef]

**10. **R. Ragazzoni, E. Diolaiti, and E. Vernet, “A pyramid wavefront sensor with no dynamic modulation,” Opt. Commun. **208**(1-3), 51–60 (2002). [CrossRef]

**11. **R. Ragazzoni and J. Farinato, “Sensitivity of a pyramidic wave front sensor in closed loop adaptive optics,” Astron. Astrophys. **350**, L23–L26 (1999).

**12. **C. Vérinaud, “On the nature of the measurements provided by a pyramid wave-front sensor,” Opt. Commun. **233**(1-3), 27–38 (2004). [CrossRef]

**13. **O. Fauvarque, B. Neichel, T. Fusco, J.-F. Sauvage, and O. Girault, “General formalism for fourier-based wave front sensing,” Optica **3**(12), 1440 (2016). [CrossRef]

**14. **S. Esposito, A. Riccardi, E. Pinna, A. Puglisi, F. Quirós-Pacheco, C. Arcidiacono, M. Xompero, R. Briguglio, G. Agapito, L. Busoni, L. Fini, J. Argomedo, A. Gherardi, G. Brusa, D. Miller, J. C. Guerra, P. Stefanini, and P. Salinari, “Large Binocular Telescope Adaptive Optics System: new achievements and perspectives in adaptive optics,” Proc. SPIE **8149**, 814902 (2011). [CrossRef]

**15. **J. R. Males, L. M. Close, O. Guyon, K. M. Morzinski, P. Hinz, S. Esposito, E. Pinna, M. Xompero, R. Briguglio, A. Riccardi, A. Puglisi, B. Mazin, M. J. Ireland, A. Weinberger, A. Conrad, M. Kenworthy, F. Snik, G. Otten, N. Jovanovic, and J. Lozi, “The path to visible extreme adaptive optics with MagAO-2K and MagAO-X,” Proc. SPIE **9909**, 990952 (2016). [CrossRef]

**16. **C. Z. Bond, P. Wizinowich, M. Chun, D. Mawet, S. Lilley, S. Cetre, N. Jovanovic, J.-R. Delorme, E. Wetherell, S. Jacobson, C. Lockhart, E. Warmbier, J. K. Wallace, D. N. Hall, S. Goebel, O. Guyon, C. Plantet, G. Agapito, C. Giordano, S. Esposito, and B. Femenia-Castella, “Adaptive optics with an infrared pyramid wavefront sensor,” Proc. SPIE **10703**, 107031Z (2018). [CrossRef]

**17. **O. Guyon, “Limits of adaptive optics for high contrast imaging,” Astrophys. J. **629**(1), 592–614 (2005). [CrossRef]

**18. **S. Esposito and A. Riccardi, “Pyramid wavefront sensor behavior in partial correction adaptive optic systems,” Astron. Astrophys. **369**(2), L9–L12 (2001). [CrossRef]

**19. **V. Deo, E. Gendron, G. Rousset, F. Vidal, and T. Buey, “A modal approach to optical gain compensation for the pyramid wavefront sensor,” Proc. SPIE **10703**, 1070320 (2018). [CrossRef]

**20. **R. A. Frazin, “Efficient, nonlinear phase estimation with the nonmodulated pyramid wavefront sensor,” J. Opt. Soc. Am. A **35**(4), 594 (2018). [CrossRef]

**21. **V. Hutterer and R. Ramlau, “Nonlinear wavefront reconstruction methods for pyramid sensors using landweber and landweber-kaczmarz iterations,” Appl. Opt. **57**(30), 8790 (2018). [CrossRef]

**22. **S. Y. Haffert, M. J. Wilby, C. U. Keller, I. A. G. Snellen, D. S. Doelman, E. H. Por, M. van Kooten, S. P. Bos, and J. Wardenier, “On-sky results of the leiden exoplanet instrument (lexi),” Proc. SPIE **10703**, 1070323 (2018). [CrossRef]

**23. **H. Guo, N. Korablinova, Q. Ren, and J. Bille, “Wavefront reconstruction with artificial neural networks,” Opt. Express **14**(14), 6456 (2006). [CrossRef]

**24. **C. Alvarez Diez, F. Shao, and J. Bille, “Pyramid and hartmann-shack wavefront sensor with artificial neural network for adaptive optics,” J. Mod. Opt. **55**(4-5), 683–689 (2008). [CrossRef]

**25. **A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, (2012), pp. 1097–1105.

**26. **K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, (2016), pp. 770–778.

**27. **D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” (2014).

**28. **F. Chollet, “Keras,” https://keras.io (2015).

**29. **E. H. Por, S. Y. Haffert, V. M. Radhakrishnan, D. S. Doelman, M. Van Kooten, and S. P. Bos, “High contrast imaging for python (hcipy): an open-source adaptive optics and coronagraph simulator,” Proc. SPIE **10703**, 1070342 (2018). [CrossRef]

**30. **S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, (2015), pp. 448–456.

**31. **L. Jolissaint, J.-P. Véran, and R. Conan, “Analytical modeling of adaptive optics: foundations of the phase spatial power spectrum approach,” J. Opt. Soc. Am. A **23**(2), 382 (2006). [CrossRef]

**32. **M. N. Miskiewicz and M. J. Escuti, “Direct-writing of complex liquid crystal patterns,” Opt. Express **22**(10), 12691–12706 (2014). [CrossRef]

**33. **D. S. Doelman, F. Snik, N. Z. Warriner, and M. J. Escuti, “Patterned liquid-crystal optics for broadband coronagraphy and wavefront sensing,” Proc. SPIE **10400**, 104000U (2017). [CrossRef]

**34. **A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv e-prints (2017).

**35. **I. Freeman, L. Roese-Koerner, and A. Kummert, “Effnet: An Efficient Structure for Convolutional Neural Networks,” in Proceedings - International Conference on Image Processing, ICIP, (2018), pp. 6–10.