## Abstract

Modern consumer electronics market dictates the need for small-scale and high-performance cameras. Such designs involve trade-offs between various system parameters. In such trade-offs, Depth Of Field (DOF) is a significant issue very often. We propose a computational imaging-based technique to overcome DOF limitations. Our approach is based on the synergy between a simple phase aperture coding element and a convolutional neural network (CNN). The phase element, designed for DOF extension using color diversity in the imaging system response, causes chromatic variations by creating a different defocus blur for each color channel of the image. The phase-mask is designed such that the CNN model is able to restore from the coded image an all-in-focus image easily. This is achieved by using a joint end-to-end training of both the phase element and the CNN parameters using backpropagation. The proposed approach provides superior performance to other methods in simulations as well as in real-world scenes.

© 2018 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

Imaging system design has always been a challenge, due to the need of meeting many requirements with relatively few degrees of freedom. Since digital image processing has become an integral part of almost any imaging system, many optical issues can now be solved using signal processing. However, in most cases the design is done separately, i.e., the optical design is done in the traditional way, aiming at the best achievable optical image, and then the digital stage attempts to improve it even more.

Intuitively, there is a place to argue that a joint design of the optical and signal processing stages may lead to better overall performance. Indeed, such effort is an active research area for many applications, e.g., extended depth of field (EDOF) [1–7], image deblurring both due to optical blur [8] and motion blur [9], high dynamic range [10], depth estimation [8,11–13], light field photography [14], and many more. This holistic design technique is generally known as ’computational imaging’: Some changes are made to the image acquisition stage, resulting in an output that is not necessarily the best optical image for a human observer. Yet, the follow-up processing takes advantage of the known changes in the acquisition process in order to generate an improved image or to extract additional information from it (such as depth, different viewpoints, motion data etc.) with a quality that is beyond the capabilities of the best achievable optical performance followed by a state-of-the-art image processing method. Yet, in the vast majority of computational imaging processes, the optics and post-processing are designed separately, to be adapted to each other, and not in an end-to-end fashion.

In recent years, deep learning (DL) methods ignited a revolution across many domains including signal processing. Instead of attempts to explicitly model a signal, and utilize this model to process it, DL methods are used to model the signal implicitly, by learning its structure and features from labeled datasets of enormous size. Such methods have been successfully used for almost all image processing tasks including denoising [15–17], demosaicing [18], deblurring [19], super-resolution [20], high dynamic range [21], to name a few. The main innovation in the DL approach is that inverse problems are solved by an end-to-end learning of a function that performs the inversion operation, without any explicit signal model.

As mentioned above, DOF imposes limitations in many optical designs. To ease these limitations, several computational imaging approaches have been investigated (a comprehensive review is brought in [22]). Among the first ones, we may name the method of Dowski and Cathey [1], where a cubic phase-mask is incorporated in the imaging system exit pupil. This mask is designed to manipulate the lens point spread function (PSF) to be depth invariant for an extended DOF. The resulting PSF is relatively wide, and therefore the modulation transfer function (MTF) of the system is quite narrow. Images acquired with such a lens are uniformly blurred with the same PSF, and therefore can be easily restored using a non-blind deconvolution method.

Similar approaches avoid the use of the cubic phase-mask (which is not circularly symmetric and therefore requires complex fabrication), and achieve depth invariant PSF using a random diffuser [3, 23] or by enhancing chromatic aberrations (albeit still producing a monochrome image) [2]. The limitation of these methods is that the intermediate optical image quality is relatively poor (due to the narrow MTF), resulting in noise amplification in the deconvolution step.

Other approaches [4,24] have tried to create a PSF with a strong and controlled depth variance, using it as a prior for the image deblurring step. In [4], the PSF is encoded using an amplitude mask that blocks 50% of the input light, which makes it impractical for low-light applications. In [24], the depth dependent PSF is achieved by enhancing axial chromatic aberrations, and then ’transferring’ resolution from one color channel to another (using a RGB sensor). While this method is light efficient, its design imposes two limitations: (i) its production requires custom and non-standard optical design; and (ii) by enhancing axial chromatic aberrations, lateral chromatic aberrations are usually also enhanced.

To overcome these issues, Haim *et al.* [12] suggested to achieve a chromatic and depth dependent PSF using a simple diffractive binary phase-mask element having a concentric ring/s pattern. Such a mask changes the PSF differently for each color channel, thus achieving color diversity in the imaging system response. The all-in-focus image is restored by a sparse-coding based algorithm with dictionaries that incorporate the encoded PSFs response. This method achieves good results, but with relatively high computational cost (due to the sparse coding step).

Note that in all mentioned approaches, the optics and the processing algorithm are designed separately. Thus, the designer has to find a balance between many system parameters: aperture size, number of optical elements, exposure time, pixel size, sensor sensitivity and many other factors. This makes the pursuit of the “correct” parameters tradeoff, which leads to the desired EDOF, harder.

#### Contribution

This work proposes an end-to-end design approach for EDOF imaging. Following the work in [12], a method for DOF extension that can be added to an existing optical design, and as such provides an additional degree of freedom to the designer, is presented. The solution is based on a simple binary phase-mask, incorporated in the imaging system exit pupil (or any of its conjugate optical surfaces). The mask is composed of a ring/s pattern, whereby each ring introduces a different phase-shift to the wavefront emerging from the scene; the resultant image is aperture coded.

Differently than [12], where sparse coding is used, we feed the image to a CNN, which restores the all-in-focus image. Moreover, while in [12] the mask is manually designed, in this work the imaging step is modeled as a layer in the CNN, where its weights are the phase-mask parameters (ring radii and phase). This leads to an end-to-end training of the whole system; both the optics and the computational CNN layers are trained all together, for a true holistic design of the system. Such a design eliminates the need to determine an optical criterion for the mask design step. In the presented design approach, the optical imaging step and the reconstruction method are jointly learned together. This leads to improved performance of the system as a whole (and not to each part, as happens when optimizing each of them separately). Figure 1 presents a scheme of the system, and Fig. 4 demonstrates the advantage of training the mask.

Figure 2 presents the MTF curves of the system with the optimized phase-mask incorporated for various defocus conditions. The separation of the RGB channels is clearly visible. This separation serves as a prior for the all in-focus CNN described in the following section.

Different deep end-to-end designs of imaging systems using backpropagation have been presented before, for other image processing or computer vision tasks, such as demosaicking [25], depth estimation [11,13], object classification [26,27] and video compressed sensing [28]. This work differs from all these contributions as it presents a more general design approach, applied for DOF extension, with a possible extension to blind image deblurring and low-light imaging. In addition, we show that the improvement achieved by our mask design is not specific to post processing performed by a neural network; it can also be utilized with other restoration method, such as sparse coding.

The rest paper is organized as follows. Section 2 presents the phase-mask structure and its design principle. Section 3 introduces the end-to-end system design by DL. Section 4 presents simulation results and Section 5 demonstrates the processing of real-world images acquired with our phase coded aperture camera. Section 6 concludes the paper.

## 2. Color diversity phase mask

The proposed method is based on manipulating the PSF/MTF of the imaging system based on a desired joint color and depth variance. Generally, one may design a simple binary phase-mask to generate a MTF with color diversity such that at each depth in the desired DOF, at least one color channel provides a sharp image [12, 29]. This design can be used as-is (without a post-processing step) for simple computer vision application such as barcode reading [29], and also for all-in-focus image recovery using a dedicated post-processing step [12].

The basic principle behind the mask operation, is that a single-ring phase-mask exhibiting a *π*-phase shift for a certain illumination wavelength allows very good DOF extension capabilities [6,7,30,31]. This basic design is extended for the three RGB channels of a color image.

The defocus domain is quantified using the *ψ* defocus measure [30], defined as:

*z*

_{img}is the sensor plane location for an object in the nominal position (

*z*

_{n});

*z*

_{i}is the ideal image plane for an object located at

*z*

_{o};

*f*and

*R*are the imaging system focal length and exit pupil radius; and

*λ*is the illumination wavelength. The phase shift

*ϕ*applied by a phase ring is expressed as: where

*λ*is the illumination wavelength,

*n*is the refractive index, and

*h*is the ring height. Notice that the performance of such a mask is very sensitive to the illumination wavelength. Taking advantage of the nature of the diffractive optical element structure, such a mask can be designed for a significantly different response for each band in the illumination spectrum. For the common color RGB sensor, three ’separate’ system behaviors can be generated with a single mask, such that in each depth of the scene, a different channel is in focus while the others are not. Such mask designs have been proposed in [12,29] using optical considerations and insight intuition. An important property of our phase-mask (and therefore of our entire imaging system), is that it manipulates the PSF of the system as a function of the defocus condition (Eq. (1)). In other words, our method is not designed to handle a specific DOF (in meters), but to a certain defocus domain in the vicinity of the original focus point (

*ψ*= 0). Thus, a reconstruction algorithm based on such phase-mask depends on the defocus range rather than on the actual depth of the scene.

These days, the DL revolution has shown that an end-to-end learning process based on large datasets, which is devoid of previous designer intuitions, generally leads to improved performance. In view of this notion, the phase-mask is designed together with the CNN model using backpropagation. In order to make such a design, the optical imaging operation is modeled as the first layer of the CNN. In this case, the weights of the optical imaging layer are the phase-mask parameters: the phase ring/s radii *r _{i}*, and the phase shifts

*ϕ*. Thus, the imaging operation is modeled as the ’forward step’ of the optical imaging layer, according to the imaging model presented in [32]. To design the phase-mask pattern in conjunction with the CNN using backpropagation [33], one requires computation of the relevant derivatives (

_{i}*∂PSF*/

*∂r*,

_{i}*∂PSF*/

*∂ϕ*) when the ’backward pass’ is carried. Using backpropagation theory, the optical imaging layer is integrated in a DL model, and its weights (i.e. the phase-mask parameters) are learned together with the classic CNN model so that optimal end-to-end performance is achieved (detailed description of the forward and backward steps of the optical imaging layer is provided in Appendix A).

_{i}As mentioned above, the optical imaging layer is *ψ*-dependent and not distance dependent. The *ψ* dependency enables an arbitrary setting of the focus point, which in turns ’spreads’ the defocus domain under consideration for a certain depth range, as determined by Eq. (1). This is advantageous since the CNN is trained in the *ψ* domain, and thereafter one can translate it to various scenes where actual distances appear. The range of *ψ* values on which we optimize our network is a hyper-parameter of the optical imaging layer. Its size tradeoffs the depth range for which the network performs the all-in-focus operation vs. the reconstruction accuracy.

In our analysis, the domain was set to *ψ* = [0, 8], as it provides a good balance between the reconstruction accuracy and depth of field size. For such setting, we have examined a circularly symmetric phase-ring/s pattern, having up to three rings. Such phase-mask patterns are trained along with the all in-focus CNN (described in the following section). We found out that a single phase-ring mask was sufficient to provide most of the required PSF coding, and the added-value of additional phase rings is negligible. Thus, in the performance vs. fabrication complexity tradeoff, a single-ring mask was selected.

The optimized parameters of the mask are **r** = [0.68, 1] and *ϕ* = 2.89*π* (both *ψ* and *ϕ* are defined for the blue wavelength, where the RGB wavelengths taken are the peak wavelengths of the camera color filter response: *λ*_{R,G,B} = [600, 535, 455]*nm*). Since the solved optimization problem is non-convex, a global/local minima analysis is required. Various initial guesses for the mask parameters were experimented. For the domain of 0.6 < *r*_{1} < 0.8, 0.8 < *r*_{2} < 1 and 2*π* < *ϕ* < 4*π* the process converged to the same values mentioned above. However, for initial values outside this domain, the convergence was not always to the same minimum (which is probably the global one). Therefore, our process has some sensitivity to the initial values (as almost any non-convex optimization), but this sensitivity is relatively low. It can be mitigated by trying several initialization points and then picking the one with the best minimum value.

## 3. All-in-focus CNN

As mentioned above, the first layer of our proposed CNN model is the optical imaging layer. It simulates the imaging operation of a lens with the color diversity phase-mask incorporated. Thereafter, the imaging output is fed to a conventional CNN model that restores the all-in-focus image. Here, we rely on the power of DL to jointly design the phase-mask and the network that restores the all-in-focus image.

Our EDOF scheme can be considered as a partially blind deblurring problem (partially, since we limit ourselves to blur kernels inside the required EDOF). Every deblurring problem is essentially an ill-posed problem. Yet, in our case the phase-mask operation makes this inverse problem more well-posed by manipulating the response between the different RGB channels, which makes the blur kernels coded in a known manner. Due to that fact, we claim that a relatively small CNN model can approximate this inversion function. One may consider that in some sense the optical imaging step (carried with a phase-mask incorporated in the pupil plane of the imaging system) performs part of the required CNN operation, with no conventional processing power needed. Moreover, the optical imaging layer ’has access’ to the object distance (or defocus condition), and as such it can use it for smart encoding of the image. A conventional CNN (operating on the resultant image) cannot perform such encoding. Therefore the phase coded aperture imaging leads to an overall better deblurring performance of the network.

The main challenge for developing such a model is the requirement for extensive training data. We train our model to restore all-in-focus natural scenes that have been blurred with the color diversity phase-mask PSFs. This task is generally considered as a local task (image-wise), and therefore we set the training patches size to 64 × 64 pixels. Following the locality assumption, if we inspect natural images in local neighborhoods, i.e., focus on small patches in them, almost all of these patches seem like a part of a generic collection of various textures. In view of this claim, we chose to train the CNN model with the Describable Textures Dataset (DTD) [34], which is a large dataset of various natural textures. We took 20*K* texture patches of size 64 × 64 pixels. Each patch is replicated a few times such that each replication corresponds to a different depth in the DOF under consideration. In addition, data augmentation by rotations of 90°, 180° and 270° was used, to achieve rotation-invariance in the CNN operation. 80% of the data is used for training, and the rest for validation.

Figure 3 presents the all-in-focus CNN model. It is based on consecutive layers composed of a convolution (CONV), Batch Normalization (BN) and the Rectified Linear Unit (ReLU). Each CONV layer contains 32 channels with 3*x* 3 kernel size. In view of the model presented in [19], the convolution dilation parameter (denoted by *d* in Fig. 3) is increased and then decreased, for receptive field enhancement. Since the target of the network is to restore the all-in-focus image, it is much easier for the CNN model to estimate the required ’correction’ to the blurred image instead of the corrected image itself. Therefore, a skip connection is added from the imaging result directly to the output, in such a way that the consecutive convolutions estimate only the residual image. Note that the model does not contain any pooling layers and the CONV layers stride is always one, meaning that the CNN output size is equal to the input size.

We evaluate the restoration error using the L1 loss function. The L1 loss serves as a good error measure for image restoration, since it does not over penalize large error (like the L2 loss), which results in a better image restoration for a human observer [35]. The network was trained using SGD+momentum solver (with *γ* = 0.9), with batch size of 100, weight decay of 5e−4 and learning rate of 1e−4 for 2500 epochs. Both training and validation loss functions converged to *L*_{1} ≈ 6.9 (on a [0, 255] image intensity scale), giving evidence to good reconstruction accuracy and negligible over-fitting.

Since the mask fabrication process has its inherent errors, sensitivity analysis is required. By fixing the CNN computational layers and perturbing the phase-mask parameters, it can be deduced that fabrication errors of 5% (either in **r** or *ϕ*) results in performance degradation of 0.5%, which is tolerable. Moreover, to compensate these errors one may fine-tune the CNN computational layers with respect to the fabricated phase-mask with its known errors, and then most of the lost quality is gained back.

Due to the locality assumption and the training dataset generation process, the trained CNN both (i) encapsulates the inversion operation of all the PSFs in the required DOF; and (ii) performs a relatively local operation. Thus, a real-world image comprising an extensive depth can be processed ’blindly’ with our restoration model; each different depth (i.e. defocus kernel) in the image is restored appropriately, with no additional guidance on the scene structure.

## 4. Simulation results

To demonstrate the advantage of our end-to-end training of the mask and the reconstruction CNN, we first test it using simulated imaging. As an input, we take an image from the ’TAU-Agent’ dataset [11]. The Agent dataset includes synthetic realistic scenes created using ’Blender’ computer graphics software. Each scene consists of an all-in focus image with low-noise level, along with its corresponding pixel-wise accurate depth map. Such data enables an exact depth dependent imaging simulation, with the corresponding DOF effects.

For demonstration, we took a close-up photo image of a man’s face, with a wall in the background (see Fig. 4(a)). Such a scene serves as a ’stress-test’ for an EDOF camera, since focus on both the face and the wall cannot be maintained. For performance comparison, we took a smart-phone camera with a lens similar to the one presented in [36] (*f* = 4.5*mm*, *F*# = 2.5), and a sensor with pixel size of 1.2*μm* . We simulate the imaging process of a system with our learned phase coded aperture on this image, and then process it with the corresponding CNN.

For comparison, we performed the same process using the EDOF method of Dowski and Cathey [1] (with the mask parameter *α* = 40). Since [1] is based on a depth invariant (and therefore wide) PSF for the entire DOF under consideration, such comparison is partially valid for [3,23] too (since they also achieve depth invariant PSF as [1], but with a mask that is much easier to fabricate). We present two variants of the Dowski and Cathey method: with the original processing (simple Wiener filtering), and using one of the state-of-the-art non-blind image deblurring methods (Zhang *et al.* [19]). In both cases, a very moderate noise is added to the imaging result, simulating a high quality sensor noise in very good lighting conditions (AWGN with *σ* = 3). The intermediate images and the post processing results are presented in Fig. 4. One can see that the method of [1] is very sensitive to noise (in both processing methods), due to the narrow bandwidth MTF of the imaging system and the noise amplification of the post-processing stage. Ringing artifacts are also very dominant. In our method, where in each depth a different color channel provides good resolution, the deblurring operation is considerably more robust to noise and provides much better results.

In order to estimate the contribution of the phase-mask parameters training compared to a mask designed separately, we performed a similar simulation with the mask presented by Haim *et al.* [12] and a CNN model fine tuned for it (similar model to ours but without training the mask parameters). The results are presented in Figs. 4(g) and 4(j). It is evident that while using a separately designed mask based on optical considerations leads to good performance, a joint training of the phase-mask along with the CNN results in an improved overall performance. In addition, the phase-mask trained along with the CNN achieves improved performance even when using the sparse coding based processing presented in [12] (see Figs. 4(f) and 4(i)). Therefore, the design of optics related parameters using CNN and backpropagation seems to be effective also when other processing methods are used.

## 5. Experimental results

Following the satisfactory simulation results, we proceeded to an experimental demonstration of our method. We have fabricated our phase-mask (described in Section 2) and incorporated it in the aperture stop of a *f* = 16*mm* lens (see Fig. 1), mounted on a 18MP sensor with pixel size of 1.25*μm*. This phase coded aperture camera performs the learned optical imaging layer, and then the all-in-focus image can be restored using our trained CNN model. The lens equipped with the mask performs the phase-mask based imaging, simulated by the optical imaging layer described in Section 2.

We hereby present the all-in-focus camera performance for three indoor scenes and one outdoor scene. In the indoor scenes, the focus point is set to 1.5*m*, and therefore the EDOF domain covers the range between 0.5 − 1.5*m* . We have composed several scenes with such depth, each one containing several objects laid on a table, with a printed photo in the background (see Fig. 5, top two and bottom left images). In the outdoor scene (Fig. 5, bottom right), we set the focus point to 2.2*m*, spreading the EDOF to 0.7 − 2.2*m*. Since our model is trained on a defocus domain and not on a metric DOF, the same CNN is used for both scenarios.

Our performance is compared to two other methods (Figs. 6–9): Krishnan *et al.* blind deblurring method [37] (on the clear aperture image), and the phase coded aperture method of Haim *et al.* [12], implemented using our learned phase-mask. One can see in the zoom-in figures (Figs. 6–9) that our performance is better than Krishnan *et al.*, and slightly better than Haim *et al.* Note that we use our optimized mask with the method of Haim *et al.*, which leads to improved performance compared to the manually designed mask as discussed in Section 4. Besides the reconstruction performance, our method outperforms both methods also in runtime by 1–2 orders of magnitude as detailed in Table 5. For the comparison all timings were done on the same machine: Intel i7-2620 CPU and NVIDIA GTX 1080Ti GPU. All the algorithms have been implemented in MATLAB: [37] using the code published by the authors; [12] using the SPAMS toolbox [38]; and [19] and ours using MatConvNet [39]. The fast reconstruction is achieved due to the fact that using a learned phase-mask in the optical train enables reconstruction with a relatively small CNN model.

## 6. Summary and conclusions

We present an approach for DOF extension using joint processing by a phase coded aperture in the image acquisition process, followed by a corresponding CNN model. The phase-mask is designed to encode the imaging system response in a way that the PSF is both depth and color dependent. Such encoding enables an all-in-focus image restoration using a relatively simple and computationally efficient CNN.

In order to achieve a better over-all performance, the phase-mask and the CNN are optimized together and not separately as in the common practice. In view of the end-to-end learning approach of DL, we model the optical imaging as a layer in the CNN model, and its parameters are ’trained’ along with the CNN model. This joint design achieves two goals: (i) it leads to a true synergy between the optics and the post-processing step, to obtain optimal performance; and (ii) it frees the designer from formulating the optical optimization criterion in the phase-mask design step.

Improved performance compared to other competing methods, in both reconstruction accuracy as well as run-time is achieved. An important advantage of our method is that the phase-mask can be easily added to an existing lens, and therefore our solution for EDOF can be used by any optical designer to compensate for other parameters. The fast run-time allows fast focusing, and in some cases may even spare the need for a mechanical focusing mechanism. The final all-in-focus image can be used in both computer vision application, where EDOF is needed, and in ’artistic photography’ applications for applying refocusing/Bokeh effects after the image had been taken.

The proposed joint optical and computational processing scheme can be used for other image processing applications such as blind deblurring and low-light imaging. In blind deblurring, it would be possible to use a similar scheme for ’partial blind deblurring’ (i.e. having a closed set of blur kernels such as in the case of motion blur). In low-light imaging, it is desirable to increase the aperture size as larger apertures give more light. Our solution can overcome the DOF issue and allow more light throughput in such scenarios.

## A. Appendix Optical imaging as a CNN layer

As described in the paper, our all-in-focus imaging method is based on a lens equipped with a binary-rings phase-mask (incorporated in the lens aperture) that manipulates the lens Point Spread Function (PSF) to be both depth and color dependent. The resultant coded images are later processed by a conventional CNN in order to restore the all-in-focus image of the scene. The coded image processing is done using Deep Learning (DL) model. In order to have an end-to-end DL based solution, we model the optical imaging step as the first layer in the deep network, and optimize its parameters using backpropagation, along with the network weights. In the following we present in detail the forward and backward models of the optical imaging layer.

## A.1. Forward model

Following the imaging system model presented in [32], the physical imaging process is modeled as a convolution of the aberration free geometrical image with the imaging system PSF. In other words, the final image is the scaled projection of the scene to the image plane, convolved with the system’s PSF, which contains all the system properties: wave aberrations, chromatic aberrations and diffraction effects. Note that in the presented model, the geometric image is a perfect reproduction of the scene (up to scaling), with no resolution limit. In this model, the PSF calculation contains all the optical properties of the system (both geometrical aberrations and diffraction effects). Following [32], the PSF of an incoherent imaging system is defined as:

*h*is the coherent system impulse response, and

_{c}*P*(

*ρ*,

*θ*) is the system’s exit pupil function (the amplitude and phase profile in the imaging system exit pupil). The pupil function reference is a perfect quadratic phase function of proper curvature to focus at the ideal imaging point. Therefore, for an in-focus and aberration free (or diffraction limited) system, the pupil function is just the identity for the amplitude in the active area of the aperture, and zero for the phase.

**Out-of-Focus (OOF) imaging:** An imaging system acquiring an object in OOF conditions suffers from blur that degrades the image quality. This results in low contrast, loss of sharpness and even loss of information. The OOF error is expressed analytically as a quadratic phase wave-front error in the pupil function. In order to quantify the defocus condition, we introduce the parameter *ψ*. For the case of a circular aperture, we define *ψ* as:

*z*

_{img}is the image distance (or sensor plane location) of an object in the nominal position

*z*

_{n},

*z*

_{i}is the ideal image plane for an object located at

*z*

_{o},

*R*is the exit pupil radius and

*λ*is the illumination wavelength. The defocus parameter

*ψ*measures the maximum quadratic phase error at the exit pupil edge, i.e., for a circular pupil: where

*P*is the OOF pupil function,

_{OOF}*P*(

*ρ*,

*θ*) is the in-focus pupil function, and

*ρ*is the normalized pupil coordinate.

**Aperture Coding:** As mentioned above, the pupil function represents the amplitude and phase profile in the imaging system exit pupil. Therefore, by adding a coded pattern (amplitude, phase or both) at the exit pupil, the PSF of the system can be manipulated by some pre-designed pattern. (The exit pupil is not always accessible. Therefore, the mask may be added also in the aperture stop, entrance pupil, or in any other surface conjugate to the exit pupil.) The coded pupil function can be expressed as:

*P*is the coded aperture pupil function,

_{CA}*P*(

*ρ*,

*θ*) is the in-focus pupil function, and

*CA*(

*ρ*,

*θ*) is the aperture coding function. In our case of phase-mask for aperture coding,

*CA*(

*ρ*,

*θ*) is a circularly symmetric piece-wise constant function representing the phase rings pattern. For the sake of simplicity, we will consider a single ring phase-mask, applying a

*ϕ*phase shift in a ring starting at

*r*

_{1}to

*r*

_{2}. Therefore,

*CA*(

*ρ*,

*θ*) =

*CA*(

**r**,

*ϕ*), where:

**Depth dependent coded PSF:** By combining the various factors, we can formulate the complete term for the depth dependent coded pupil function:

*PSF*(

*ψ*) can be easily calculated.

**Imaging Output:** Using the coded aperture PSF, the imaging output can be calculated simply by:

## A.2. Backward model

As described in the previous subsection, the forward model of the optical imaging layer is expressed as:

The*PSF*(

*ψ*) is dynamically changing with both color and depth (or

*ψ*), but it has also a constant dependence in the phase ring pattern parameters

**r**and

*ϕ*, as expressed in Eq. (8). In the network training process, we are interested in designing both

**r**and

*ϕ*. Therefore, we should evaluate three separate derivatives:

*∂I*/

_{out}*∂r*for

_{i}*i*= 1, 2 (the inner and outer radius of the phase ring, as detailed in Eq. (7)) and

*∂I*/

_{out}*∂ϕ*. All three can be derived in a similar fashion:

Thus, we need to derive *∂PSF*/*∂r _{i}* and

*∂PSF*/

*∂ϕ*. Since both derivatives are almost similar, we start with

*∂PSF*/

*∂ϕ*, and describe the differences in the derivation of

*∂PSF*/

*∂r*later. Using Eq. (3), we get

_{i}We may see that the main term in Eq. (12) is $\frac{\partial}{\partial \varphi}\left[\mathcal{F}\left\{P(\psi ,\mathbf{r},\varphi )\right\}\right]$ or its complex conjugate. Due to the linearity of the derivative and the Fourier transform, and since dimensions being Fourier-transformed are orthogonal to the dimension being differentiated, we can change the order of operations and rewrite the term as: $\mathcal{F}\left\{\frac{\partial}{\partial \varphi}P(\psi ,\mathbf{r},\varphi )\right\}$. Therefore, the last term remaining for calculating the PSF derivative is:

In similarity to the derivation of *∂PSF*/*∂ϕ*, for calculating *∂PSF*/*∂r _{i}* we need to have $\frac{\partial}{\partial {r}_{i}}P(\psi ,\mathbf{r},\varphi )$. Similar to Eq. (13), we have

*ρ*) achieves good enough results for the phase step approximation.

With the full forward and backward model, the phase coded aperture layer can be incorporated as a part of the CNN model, and the phase-mask parameters **r** and *ϕ* can be learned along with the network weights.

## Funding

H2020 European Research Council (ERC) (757497).

## References and links

**1. **E. R. Dowski and W. T. Cathey, “Extended depth of field through wave-front coding,” Appl. Opt. **34**, 1859–1866 (1995). [CrossRef] [PubMed]

**2. **O. Cossairt and S. Nayar, “Spectral focal sweep: Extended depth of field from chromatic aberrations,” in Proceedings of IEEE International Conference on Computational Photography (IEEE, 2010), pp. 1–8.

**3. **O. Cossairt, C. Zhou, and S. Nayar, “Diffusion coded photography for extended depth of field,” in Proceedings of ACM SIGGRAPH, (ACM, New York, NY, USA, 2010), pp. 31:1–31:10.

**4. **A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image and depth from a conventional camera with a coded aperture,” in Proceedings of ACM SIGGRAPH, (ACM, New York, NY, USA, 2007), pp. 70:1–70:10.

**5. **F. Zhou, R. Ye, G. Li, H. Zhang, and D. Wang, “Optimized circularly symmetric phase mask to extend the depth of focus,” J. Opt. Soc. Am. A **26**, 1889–1895 (2009). [CrossRef]

**6. **C. J. R. Sheppard, “Binary phase filters with a maximally-flat response,” Opt. Lett. **36**, 1386–1388 (2011). [CrossRef] [PubMed]

**7. **C. J. Sheppard and S. Mehta, “Three-level filter for increased depth of focus and bessel beam generation,” Opt. Express **20**, 27212–27221 (2012). [CrossRef] [PubMed]

**8. **C. Zhou, S. Lin, and S. K. Nayar, “Coded aperture pairs for depth from defocus and defocus deblurring,” Int. J. Comput. Vis. **93**, 53–72 (2011). [CrossRef]

**9. **R. Raskar, A. Agrawal, and J. Tumblin, “Coded exposure photography: Motion deblurring using fluttered shutter,” in Proceedings of ACM SIGGRAPH, (ACM, New York, NY, USA, 2006), pp. 795–804.

**10. **G. Petschnigg, R. Szeliski, M. Agrawala, M. Cohen, H. Hoppe, and K. Toyama, “Digital photography with flash and no-flash image pairs,” in Proceedings of ACM SIGGRAPH, (ACM, New York, NY, USA, 2004), pp. 664–672.

**11. **H. Haim, S. Elmalem, R. Giryes, A. M. Bronstein, and E. Marom, “Depth estimation from a single image using deep learned phase coded mask,” IEEE Transactions on Comput. Imaging (to be published).

**12. **H. Haim, A. Bronstein, and E. Marom, “Computational multi-focus imaging combining sparse model with color dependent phase mask,” Opt. Express **23**, 24547–24556 (2015). [CrossRef] [PubMed]

**13. **P. A. Shedligeri, S. Mohan, and K. Mitra, “Data driven coded aperture design for depth recovery,” in Proceedings of IEEE International Conference on Image Processing (IEEE, 2017), pp. 56–60.

**14. **R. Ng, M. Levoy, M. Brédif, G. Duval, M. Horowitz, and P. Hanrahan, “Light field photography with a hand-held plenoptic camera,” in “Computer Science Technical Report CSTR 2 (11),” (2005), pp. 1–11.

**15. **H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with BM3D?” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2012), pp. 2392–2399.

**16. **S. Lefkimmiatis, “Non-local color image denoising with convolutional neural networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 5882–5891.

**17. **T. Remez, O. Litany, R. Giryes, and A. M. Bronstein, “Deep class-aware image denoising,” in Proceedings of IEEE International Conference on Image Processing (IEEE, 2017), pp. 138–142.

**18. **M. Gharbi, G. Chaurasia, S. Paris, and F. Durand, “Deep joint demosaicking and denoising,” ACM Trans. Graph. **35**, 191 (2016). [CrossRef]

**19. **K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser prior for image restoration,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 3929–3938.

**20. **C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 105–114 .

**21. **N. K. Kalantari and R. Ramamoorthi, “Deep high dynamic range imaging of dynamic scenes,” ACM Trans. Graph. **36**, 144 (2017). [CrossRef]

**22. **J. Ojeda-Castaneda and C. M. Gómez-Sarabia, “Tuning field depth at high resolution by pupil engineering,” Adv. Opt. Photon. **7**, 814–880 (2015). [CrossRef]

**23. **E. E. García-Guerrero, E. R. Méndez, H. M. Escamilla, T. A. Leskova, and A. A. Maradudin, “Design and fabrication of random phase diffusers for extending the depth of focus,” Opt. Express **15**, 910–923 (2007). [CrossRef] [PubMed]

**24. **F. Guichard, H. P. Nguyen, R. Tessières, M. Pyanet, I. Tarchouna, and F. Cao, “Extended depth-of-field using sharpness transport across color channels,” Proc. SPIE **7250**, 725012 (2009).

**25. **A. Chakrabarti, “Learning sensor multiplexing design through back-propagation,” in Proceedings of Advances in Neural Information Processing Systems 29, (Curran Associates, Inc., 2016), pp. 3081–3089.

**26. **H. G. Chen, S. Jayasuriya, J. Yang, J. Stephen, S. Sivaramakrishnan, A. Veeraraghavan, and A. C. Molnar, “ASP vision: Optically computing the first layer of convolutional neural networks using angle sensitive pixels,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 903–912.

**27. **G. Satat, M. Tancik, O. Gupta, B. Heshmat, and R. Raskar, “Object classification through scattering media with deep learning on time resolved measurement,” Opt. Express **25**, 17466–17479 (2017). [CrossRef] [PubMed]

**28. **M. Iliadis, L. Spinoulas, and A. K. Katsaggelos, “Deepbinarymask: Learning a binary mask for video compressive sensing,” CoRR **abs/1607.03343** (2016).

**29. **B. Milgrom, N. Konforti, M. A. Golub, and E. Marom, “Novel approach for extending the depth of field of barcode decoders by using rgb channels of information,” Opt. express **18**, 17027–17039 (2010). [CrossRef] [PubMed]

**30. **E. Ben-Eliezer, N. Konforti, B. Milgrom, and E. Marom, “An optimal binary amplitude-phase mask for hybrid imaging systems that exhibit high resolution and extended depth of field,” Opt. Express **16**, 20540–20561 (2008). [CrossRef] [PubMed]

**31. **S. Ryu and C. Joo, “Design of binary phase filters for depth-of-focus extension via binarization of axisymmetric aberrations,” Opt. Express **25**, 30312–30326 (2017). [CrossRef] [PubMed]

**32. **J. Goodman, *Introduction to Fourier Optics* (MaGraw-Hill, 1996), 2nd ed.

**33. **D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature **323**, 533–536 (1986). [CrossRef]

**34. **M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of IEEE Conf. on Computer Vision and Pattern Recognition (IEEE, 2014), pp. 3606–3613.

**35. **H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Transactions on Comput. Imaging **3**, 47–57 (2017). [CrossRef]

**36. **Y. Ma and V. N. Borovytsky, “Design of a 16.5 megapixel camera lens for a mobile phone,” OALib **2**, 1–9 (2015).

**37. **D. Krishnan, T. Tay, and R. Fergus, “Blind deconvolution using a normalized sparsity measure,” in Proceedings of IEEE Conf. on Computer Vision and Pattern Recognition (IEEE, 2011), pp. 233–240.

**38. **J. Mairal, F. Bach, J. Ponce, and G. Sapiro., “Online learning for matrix factorization and sparse coding,” J. Mach. Learn. Res. **11**, 19–60 (2010).

**39. **A. Vedaldi and K. Lenc, “MatConvNet – convolutional neural networks for MATLAB,” in Proceedings of Proceeding of the ACM Int. Conf. on Multimedia, (ACM, 2015), pp. 689–692.