Parallel lensless compressive imaging via deep convolutional neural networks

Xin Yuan; Yunchen Pu

doi:10.1364/OE.26.001962

1. Introduction

Inspired by compressive sensing (CS) [1, 2], diverse compressive cameras have been built, including the compression in space [3, 4], depth [5–7], time [7–14], spectrum [15–18], polarization [19], and dynamic range [20]. The single-pixel camera [3], which pioneers the compressive imaging in space, is an elegant architecture to prove the concept of CS. However, the hardware elements used in the single-pixel camera are more expensive than conventional CCD or CMOS cameras in the visual bandwidth, thus limit the applications of this new imaging regime. The lensless compressive camera, proposed in [4], enjoys low-cost, low-power properties and has demonstrated excellent results using advanced reconstruction algorithms [21]. Without the lens, the camera can be built with reduced size, weight, cost and complexity. Furthermore, the same architecture can be used for imaging of visible spectrum, and other spectra such as infrared and millimeter waves. The architecture can also be used to capture hyperspectral images [18] and polarized images [19] by integrating related hardware.

1.1. Challenges in current lensless compressive cameras

Though enjoying various advantages, e.g. directly implementing CS and minimizing problems of cameras with lens, the existing lensless compressive camera (Fig. 1(a)) suffers from low capture rates. Specifically, the capture rate is around 50Hz due to the limited refresh rate of LCD and the integration time of the sensor [21]. For a 128 × 128 image, if we desire a high resolution image, we need on the order of 10% measurements (relative to the pixel numbers), which requires about 0.5 minute with current aperture switching technique. This is far from our real time requirement as cameras are built to get instant images/videos. In order to obtain an image in a shorter time, we have to sacrifice the spatial resolution, i.e. providing a low-resolution image. An alternative solution is to increase the refresh rate of the aperture assembly (Fig. 1(a)). However, even we can achieve a higher refresh rate using an expensive hardware, the sensor needs a relatively longer time to integrate the light or a more expensive sensor is required. Speed is thus an critical issue in this lensless compressive imaging architecture for practical applications. In order to mitigate this problem, in this paper, we propose the block-wise lensless compressive camera (Fig. 1(b)). The blocks in our system are performing in parallel, thus termed parallel lensless compressive imaging.

Fig. 1 Demonstration of the lensless camera: (a) single sensor lensless compressive camera [4]. A ray (black line) is starting from a point on the scene, passing through the point (x, y) on the aperture assembly, and ending at the sensor. (b) Proposed parallel (block-wise) lensless camera and its components (below). Four sensors are shown in this example. Each sensor will capture a fraction of the scene. These fractions can be overlapping. The image is reconstructed via first performing block-based inversion (reconstruction) and then stitching these blocks. Each part of the component can be built with off-the-shelf components.

Download Full Size | PDF

Another common challenge of compressive cameras is the reconstruction time. As iteration algorithms are usually employed, users need to wait for quite a few minutes or hours to see the reconstructed image. Bearing the speed in mind, in this paper, we propose to use close-form inversion algorithms to reconstruct the image. Thanks to the block-wise architecture in our parallel lensless camera, each block (fraction) of the scene can be recovered in parallel and instant results are available. In addition to the Gaussian mixture model (GMM) closed-form inversion, inspired by the recent success of deep learning in diverse image processing tasks [22, 23], we train a deep convolutional neural network (CNN) to recover each block of the scene using different numbers of measurements, leading to real-time, high-quality images. Both simulation and experimental results demonstrate the feasibility of our camera architecture and the performance of our real-time reconstruction algorithms.

1.2. Machine learning for computational imaging

There has been recent interests using machine learning algorithms in computational imaging, with the most representative one in [24], where a deep CNN was trained to recover a phase object given a raw intensity image recorded some distance away. Research in neural network can be dated back to 1950s [25]. It has been advanced significantly by employing the convolutions [22], nonlinearities [26], and efficient back-propagation [27] algorithms. These networks have been applied to various tasks, from playing complex games [28] to image processing. In addition, Horisaki et al. [29] used support vector machines to recover face images when the obscuration is caused by scattering media.

In computational image applications considered in this work, discarding the noise, the capture process can be modeled as y = Ax, where x is the desired signal, A is called the sensing matrix, and y is the measurement, i.e., captured signal by the sensor. When A is a fat matrix, i.e., more columns than rows, this is called the compressive sensing problem [30]. Usually, this problem is solved by iterative optimization algorithms [2] to obtain the desired signal, i.e., solving the minimization problem $\hat{x} = \arg \min_{x} ‖ y - A x ‖ + τ R (x)$ , where R(x) is a constraint in which the sparsity of the signal is usually employed, and τ is a parameter to balance the two terms. As mentioned earlier, these optimization algorithms require many iterations to provide decent results and thus are time consuming. The inversion problem can be thought of estimating the inverse operator of A, denoted by A^inv. On the other hand, CNN can be recognized as a generic function approximation: given a training set, a CNN attempts to generate a computational architecture that accurately maps all inputs in a test set (distinct from the training set) to their corresponding outputs [24]. Therefore, it is reasonable and feasible to use a CNN to learn A^inv in computational imaging, denoted by $A_{CNN}^{inv}$ . More specifically, $A_{CNN}^{inv}$ is learned from training data, and during testing, the desired signal is estimated via $\hat{x} = A_{CNN}^{inv} (y)$ given the measurement y.

One common challenge in these learning based systems is that sufficient training data are required to achieve a high performance. In this paper, by contrast, we train the CNN on patch level to mitigate this challenge. On one hand, patches are easy to obtained since each image can have thousands of patches. On the other hand, the training will be fast and it fits our hardware system, specifically, the parallel block-wise structure. For example, the training using a NVIDIA GeForce GTX TITAN X GPU with 12GB memory on CelebA [31] with 200,000 images took around 2 hours in our face image applications, while the training of a CNN on entire images usually takes several days [24]. Compared with the deep CNN developed in [32], various recent techniques, e.g., deconvolution [33], batch normalization [34], Leaky ReLU [35], and the all convolutional net [36], are used in our deep CNN.

1.3. Contributions

The contributions of this paper are twofold. (i) A parallel lensless camera is developed for fast capture, and (ii) a real-time inversion algorithm is proposed to reconstruct the image using deep learning techniques. The rest of this paper is organized as follows. Section 2 presents the theoretical background of the parallel lensless camera. Section 3 describes the hardware setup of our built camera. Section 4 introduces the reconstruction algorithms. Both simulation and experimental results are shown in Section 5 and Section 6 concludes the paper.

2. Theory

2.1. Image and pixel formulation

Consider a single sensor lensless camera demonstrated in Fig. 1(a), the analog scene I(x, y) can be defined on any plane between the scene and the sensor [4]. For convenience, we here define the image on the aperture assembly. Considering the optical path, there is a ray, the black line in Fig. 1(a), starting from a point on the scene, passing through the point (x, y) on the aperture assembly, and ending at the sensor. Let r (x, y; t) denote the intensity of this ray corresponding to point (x, y) at time t. The image point I(x, y), defined by the intensity within the integration time Δt can be formulated as

I (x, y) = \int_{0}^{Δ t} r (x, y; t) d t .

Similarly, the continuous transmitting pattern on the aperture assembly is denoted by T(x, y).

The pixel of the lensless compressive camera is defined by the transmitting pattern on the aperture assembly and consider the region of each pixel is (Δx, Δy), the (i, j)^th pixel of the discretized image (modulated by the pattern T(x, y)) can be described as

I (i, j) = \int_{(i - 1) Δ x}^{i Δ x} \int_{(j - 1) Δ y}^{i Δ y} I (x, y) T (x, y) d x d y .

Consider that there are n_x × n_y pixels in each block, the image fraction

{I (i, j)}_{i, j = 1}^{n_{x}, n_{y}}

is now the desired image when T (x, y) = 1, which means the light will transmit the pattern. To be explicit, T(x, y) = 0 signifies the light is blocked. It is worth noting that in Eq. (2), we did not consider the cross-talk issue between sensors in the parallel lensless compressive camera as demonstrated in Fig. 3(a).

Fig. 2 Photo of our prototype. From left to right, (a) transparent LCD, (b) isolation chamber, and (c) sensor board.

Download Full Size | PDF

Fig. 3 (a) Demonstration of the cross-talk issue introduced by the configuration in Fig. 1(b). Two adjacent sensors, {S₁, S₂} are used, and two corresponding rays {R_1,1, R_1,2}, and {R_2,1, R_2,2} are plotted for each sensor. When the sensor is not close to the aperture, there will be cross-talk between adjacent sensors (the red region). (b) Mitigate the cross-talk by putting the sensors together. Rays for adjacent sensors will not overlap and thus the cross-talk issue can be mitigated. When the scene is infinitely far away, the sensing gap between adjacent sensors is negligible. (c–e) Cross-sectional view of different configurations of the “Concentration-Sensor Regime”, where the aperture assembly can be a plane (c–d) or a spherical surface (e). The sensors can be mounted on a plane (c) or a sphere (d–e). (f–g) Sensor layout of the “Concentration-Sensor Regime” in (e). Each sensor covers a hexagon like area (f) and the senor array forms a sphere (g), and the isolation chamber in the “Concentration-Sensor Regime” is now a “trumpet” shape (h).

Download Full Size | PDF

2.2. Block-wise sensing and compressive imaging

As we have multiple sensors capturing the scene independently and in parallel in Fig. 1(b), let ${x_{s}}_{s = 1}^{S} \in ℝ^{n_{x} n_{y}}$ denote the corresponding vectorized image fraction of the scene, composed of pixels defined in Eq. (2). Concatenating these S fractions of the image forms the matrix $X \in ℝ^{n_{x} n_{y} \times S}$ . Though different patterns can be imposed on different fractions of the aperture assembly, we hereby consider the same pattern for every fraction (block), as this enjoys the following advantages: i) only a single pattern is needed to generate and save, thus saving the memory and bandwidth, and ii) the same pattern will speed-up the reconstruction. This is important for the deep learning based inversion as the CNN is trained for the specific sensing matrix, while different sensing matrices require training different CNNs for inversion.

Let $A \in ℝ^{m \times n}$ , where n = n_xn_y, denote the sensing matrix and each row is composed of a pattern defined by T(x, y) in Eq. (2). Different rows in A thus correspond to different patterns imposed on the aperture assembly. We consider m patterns and thus m measurements for each sensor. The measurement model for these S sensors can be jointly written by

Y = AX + ϵ,

where

Y \in ℝ^{m \times S}

is the measurement for all S sensors, and ϵ denotes the measurement noise, which is generally modeled as zero-mean white Gaussian in conventional inversion algorithms. However, in our CNN based algorithm, ϵ is not limited to Gaussian. As mentioned earlier, each column of X corresponds to a fraction of the (vectorized) scene captured by that sensor.

Equation (3) is the forward model of our parallel lensless compressive imaging system and we have built a prototype using a low-cost LCD, 16 sensors (a 4 × 4 array), and isolation chambers (described in Section 3). The following problem is that given A and the measurements Y, how to estimate X in real-time, with high quality and using as few measurements as possible. This will be addressed in Section 4.

3. Hardware setup

Figure 1(b) depicts the geometry of the proposed block-wise lensless compressive camera. It consists of three components as shown in the bottom of Fig. 1(b): a) the sensor board which contains multiple sensors and each one corresponding to one block, b) the isolation chamber which prevents the light from other blocks, and c) the aperture assembly which can be the same as that used in the lensless compressive camera [4]. Note that the block sizes can be different such that each block can reconstruct different resolution image fractions, thus leading to multi-scale compressive camera [37]. The pattern used for each block can also be different and can be adapted to the content of the image part, thus leading to adaptive compressive sensing [38]. However, in this case, as each block is using a different sensing matrix, different inversion is required for each block, but they can still be performed in parallel. The disadvantage is that we need to store these different sensing matrices. Based on this, in this work, we consider each block uses the same pattern as this will not only save the memory of storing these patterns but also enables fast reconstruction for each block.

We have built our prototype using a 3.5 inch programmable transmission LCD and a 4 × 4 array sensor board shown in Fig. 2, where the isolation chamber is printed by a 3D printer and so is the camera box. The specific details are as follows: The LCD is (3.5 inch) of 70.08mm(Width) × 52.56mm(Height) with 240 × 320 pixels (QVGA) and therefore each pixel is of size 0.219mm × 0.219mm. In the results we show in Section 5, the image we reconstruct is of 64 × 64 pixels, where we merged 3 × 3 original pixels in the LCD; therefore, the pixel size on the LCD in our results is of 0.657mm × 0.657mm. As we are doing compressive sensing in each block with 16 × 16 pixels, each sensor will cover the area of 10.5mm × 10.5mm on the LCD. We placed the sensor board 50mm (the depth of the camera box) away from the LCD. The board of the isolation chamber is of 0.5mm thickness. The sensor model is TSL25711 ambient light sensor with sensing area 2mm × 2mm. We control the sensor via I²C digital interface using the Arduino UNO ReV3 board and connect the computer with the control board using a USB port. The light collection efficiency of the lensless camera is proportional to the size of the sensor and a smaller distance between the scene and camera will have a larger light collection efficiency.

As mentioned earlier, this new block-wise lensless compressive camera enjoys the following advantages compared to the existing lensless compressive camera. i) Since each block can be very small, e.g. 16 × 16 pixels, we only need to capture a small number of measurements to achieve high resolution reconstruction. Therefore the capture time is short. ii) The coding patterns used in each block can be the same, therefore the sensing matrix is only of the block size. This saves the memory requirement of sensing matrix as well as speeds up the reconstruction. iii) Block based image reconstruction is fast and since real time stitching algorithms exist, we can perform real time reconstruction. iv) These blocks can be integrated to any desirable number, leading to extra high resolution images [37] while retaining the fast capture rate and reconstruction; v) The sensor board can be very close to the aperture assembly, which leads to small size camera. In particular, the thickness of the camera can be extremely small [39].

3.1. Overlapping regions and stitching

From the simple geometry shown in Fig. 1(b), we can see that if the scene is far from the sensor, we will have significant overlapping regions, which will lead to the cross-talk issue between adjacent sensors as demonstrated in Fig. 3(a). This will be discussed in the following subsection by proposing a new camera geometry. On the other hand, the image stitching algorithms are usually based on the features within the overlapping areas to perform registration. In addition, since the angular resolution is limited by the distance between sensors and the aperture assembly, we can adapt this distance in different applications.

3.2. Concentration-sensor regime

In order to mitigate the problem mentioned in Section 3.1, i.e. the low angular resolution for far field scenes (the cross-talk issue), we propose the Concentration-Sensor Regime, where the sensors are placed (closely) in a “cellular” shape. The aperture assembly can be a plane or a spherical surface as shown in Fig. 3. Figure 3(b) depicts the idea of mitigating the cross-talk by putting the sensors together, where two adjacent sensors, i.e.{S₁, S₂} are plotted, and two corresponding rays, {R_1,1, R_1,2}, and {R_2,1, R_2,2} are plotted for each sensor. It can seen that by putting the sensors in the concentration manner, rays for adjacent sensors will not overlap and thus the cross-talk issue can be mitigated. When the scene is far from the camera, the sensing area gap between adjacent sensors can be ignored.

The planer aperture assembly in Figs. 3(c)–3(d) can be replaced by a spherical aperture as shown in Fig. 3(e), since the curved LCD exists. The sensor layout of the spherical regime is detailed in Figs. 3(f)–3(g). In this configuration, the isolation chamber will be a “trumpet” shape starting from the sensor to the aperture assembly, as demonstrated in the bottom of Fig. 3(h). Note that the configuration in Fig. 3(g) is a wide angle camera.

4. Reconstruction algorithms

Recall the forward model in Eq. (3), we aim to estimate X, given Y and A. In the CS scenario, m ≪ n, which leads to an ill-posed problem. Therefore, the prior knowledge of the signal X will play a vital role during the reconstruction. Sparsity, a generally used prior, investigates the signal property in the transform domain. In the following, we first review existing algorithms, including the closed-form inversion with Gaussian mixture models [40], and then develop a CNN to perform real-time inversion.

4.1. Existing algorithms

4.1.1. Dictionary learning based inversion

Introducing a basis or dictionary D for blocks we have

X = DS,

where we have considered that D is shared across different blocks.

D \in ℝ^{n \times p}

can be an orthonormal basis with n = p or an over-complete dictionary [41] with n ≪ p. This dictionary can be pre-learned for fast inversion or learned in situ [18, 20].

Eq. (3) can be reformulated as

Y = ADS + ϵ,

where

S \in ℝ^{p \times S}

is desired to be sparse so that various ℓ₁-based algorithms can be used to solve the following problem [42, 43]

\hat{S} = \min_{S} | | Y - ADS | |_{F}^{2} + τ | | S | |_{1},

or variant problems [44], given A, Y and D, where the ℓ₁-norm is imposed on each column of S, F denotes the Frobenius norm, and τ is a parameter to balance the two terms in Eq. (6). After S is solved, we can obtain X by

\hat{X} = D \hat{S}

. Diverse algorithms have been proposed to solve this problem and we will use the advanced Gaussian mixture models (GMM) described below since it does not need any iteration as an analytic solution exists [40].

4.1.2. Closed-form inversion via Gaussian mixture models

The GMM has recently been re-recognized as an efficient dictionary learning algorithm [40, 45, 46]. Recall the image blocks (fractions corresponding to each sensor) $X \in ℝ^{p \times S}$ extracted from the image. For i^th block x_i (the i^th column of X), it is modeled by a GMM with K Gaussian components:

x_{i} ~ \sum_{k = 1}^{K} π_{k} N (μ_{k}, Σ_{k}),

where

{μ_{k}, Σ_{k}}_{k = 1}^{K}

represent the mean and covariance matrix of k^th Gaussian, and

{π_{k}}_{k = 1}^{K}

denote the weights of these Gaussian components, and Σ_k π_k = 1.

Dropping the block index i, in a linear model y = Ax + ϵ, $ϵ \in N (0, R)$ , if x~ p(x) in (7), p(x|y) has the following analytical form

p (x | y) = \sum_{k = 1}^{K} {\tilde{π}}_{k} N (x | {\tilde{μ}}_{k}, {\tilde{Σ}}_{k}),

where

{\tilde{π}}_{k} = \frac{π_{k} N (y | A x_{k}, R^{- 1} + A Σ_{k} A^{T})}{Σ_{l = 1}^{K} π_{l} N (y | A x_{l}, R^{- 1} + A Σ_{l} A^{T})},

{\tilde{Σ}}_{k} = {(A^{T} RA + Σ_{k}^{- 1})}^{- 1},

{\tilde{μ}}_{k} = {\tilde{Σ}}_{k} (A^{T} Ry + Σ_{k}^{- 1} μ_{k}) .

While (8) provides a posterior distribution for x, we obtain the point estimate of $\hat{x}$ via the posterior mean

E [\hat{x}] = \sum_{k = 1}^{K} {\tilde{π}}_{k} {\tilde{μ}}_{k},

which is a closed-form solution.

Note that ${π_{k}, μ_{k}, Σ_{k}}_{k = 1}^{K}$ are pre-trained on other datasets and given A, ${\tilde{Σ}}_{k}$ only needs to be computed once and saved. Same techniques can be used for AΣ_kA^T. The only computation left for each block is to calculate ${{\tilde{μ}}_{k}, {\tilde{π}}_{k}}$ , which can be obtained very efficiently. Most importantly, no iteration is required using the above GMM method, leading to efficient reconstruction for each block. Furthermore, each block can be reconstructed in parallel with GPU. Since real time stitching algorithm exists, after blocks are obtained via the GMM, we can get the entire image instantly.

Though enjoying diverse advantages, the GMM approach still needs to compute the matrix inversion and also requires the prior knowledge of the noise variance R in Eq. (9). Recent advances of CNN provides an alternative way for fast inversion [32]. We have built a CNN to reconstruct the image blocks in our camera and it has led to better performance than GMM.

4.2. Parallel reconstruction via deep convolutional neural networks

The CNN based algorithms have achieved state-of-the-art results for diverse image processing tasks. Several papers have tried to reconstruct the images from compressed measurements via CNN. A most recent proposal is to train a CNN with the input of image patches from the pseudo-inverse computed from the measurements and the output being patches from the true image [47]. However, there is a mismatch since the pseudo-inverse does not contain all the information in the measurements. Therefore, an end-to-end CNN with input as measurements and output as true images is desired. Thanks to the block-wise lensless compressive camera proposed in this paper, we can train a CNN in this manner as shown in Fig. 4. Different from the CNN used in [32], various recent techniques have been used in our model and we describe our model architecture in full detail below.

Fig. 4 Architecture of the deep CNN used in our imaging system (a). The input is the measurement $y_{i} \in ℝ^{m}$ . For example, if the block size is 16 × 16 and CSr = 0.1, then m = round(0.1 × 16 × 16) = 26. The “output” on the bottom of (a) lists the size of data at each step. The final output on the right is the reconstructed image block of size 16 × 16. The “Deconv.” denotes the deconvolutional (a.k.a transposed convolutional) layers [33]. (b–d) The “Deconvolution + ReLU” units, where ×3 means the network in the dashed box is stacked three times. “BN” denotes batch normalization. The layer without “stride” denotes stride 1.

Download Full Size | PDF

In conventional image and computer vision applications, the CNN is used to extract multi-scale features from images and the final fully-connected layer is used to classify the data. A CNN usually consists of the catenation of convolutional layer, nonlinear activity, and pooling with a softmax classifier at the end. Generally, the dimension of the input data is larger than the output, e.g., labels. However, in our application, the input is the compressed data, an m dimension measurements, and we aim to recover the image patches. In other words, we need to get a larger dimension data than the input. Therefore, a generative model is required. Specifically, we first use a fully-connected layer to convert the m dimension measurement to a 4096 dimension vector, and then reshape it to a 4 × 4 × 256 tensor. Following this, a series of deconvolutional (a.k.a tranposed convoluational) layers [33] along with the batch normalization (BN) and leaky rectified activation (Leaky ReLU) [35] with different number of neurons are used to generate the desired image patch with size of 16 × 16 pixels, which fits our hardware implementations. All the deconvolutional layers are associated with zero padding to make the dimension of input and output feature maps be the same.

As demonstrated in Fig. 4, we employ the all convolutional net [36] followed by one fully-connected layer. The all convolutional net [36] replaces the deterministic pooling (e.g., max-pooling) with stridden convolutions, which have been shown to be effective for training deep generative models [33, 48]. Batch normalization [34] is used to stabilize learning by normalizing the activations throughout the network and preventing relatively small parameter changes from being amplified into larger but suboptimal activation changes in other layers. Residual connections [49] are incorporated to encourage gradient flow.

Taking the first “deconvolutional + Leaky ReLU” in the dashed box in Fig. 4(b) as an example, the input is a 4 × 4 × 256 tensor, which is fed into a deconvolution layer with 256 kernels with each kernel of size 3 × 3. After going through batch normalization and leaky ReLU activation function, we get a 4 × 4 × 256 output tensor, and it is fed into another deconvolution network with the same architecture. Following this, we get a 4 × 4 × 256 tensor. We then add the output with the input tensor to obtain the final output of the dashed box. We repeat this process three times. After this, we impose the “3 × 3 decon. 128 stride 2 BN Leaky ReLU”, and the output now is of 8 × 8 × 128. The “3 × 3” denotes the spatial size of deconvolutional kernels, and “128” signifies the number of kernels, i.e., the number of channels of the output.

All parameters were initialized with Xavier [51]. Specifically,

w ~ Uniform (- \frac{\sqrt{6}}{\sqrt{n_{i n} + n_{o u t}}}, \frac{\sqrt{6}}{\sqrt{n_{i n} + n_{o u t}}}),

where w is the trainable parameters in each convolutional layer, and n_in and n_out are the number of input and output channels, respectively. The intuition is to enforce the variance of input and output signal in each layer to be the same. The testing process is following the graph in Fig. 4. In the training step, we synthesize the pairs of measurements and ground truth image patches. The parameters are learned via Adam [50], a first-order gradient-based optimization algorithm for stochastic objective functions with adaptive estimations of lower-order moments. It is designed for sparse gradients, on-line learning and non-stationary settings. Another advantage of Adam is that the magnitudes of parameter updates are invariant to rescaling of the gradient, allowing it to work well for large scale deep neural networks. We summarize the Adam learning procedure in Algorithm 1.

Early stopping is employed based on the reconstruction loss on validation sets. We use mini-batches of size 64. The experiments of our models are implemented in Theano [52] using an NVIDIA GeForce GTX TITAN X GPU with 12GB memory. As a reference, the training on CelebA [31] with patches from 200,000 images took around 2 hours.

Algorithm 1. Train CNN to learn $A_{CNN}^{inv}$ via Adam [50]. $A_{CNN}^{inv}$ is a function of weights, i.e., $A_{CNN}^{inv} = f ({w^{(i)}}_{i = 0}^{3})$ , where w⁽⁰⁾ denotes the weights in the fully connected layer, and ${w^{(i)}}_{i = 1}^{3}$ denotes the weights in the i^th deconvolutional unit in Fig. 4. $E = | | X_{train} - A_{CNN}^{inv} (Y_{train}) | |$ is the loss. α is the learning rate. The hyperparamters are set as β₁ = 0.9, β₂ = 0.999 and η = 10⁻¹². $β_{1}^{t}$ and $β_{2}^{t}$ represent the t^th power of β₁ and β₂. ⊙ denotes the Hadamard (element-wise) product.

View Table

5. Results

In this section, we first conduct the simulation to verify the proposed imaging system and algorithms and then capture the real data using our prototype to demonstrate the performance of our system. The compressive sensing ratio (CSr) is defined as

CSr = m / n,

where m is the compressed measurement number for each block, and n is the pixel number in each block, which is set to n = 16 × 16 = 256 in the following experiments.

5.1. Simulation

We adopt the large scale face dataset [31], CelebA, to conduct our simulation. CelebA contains more than 200,000 celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. We use 10,000 images to train the GMM and CNN and test on 1000 other images, which are resized to 64 × 64. We assume that a RGB sensor is used in this simulation and thus the compressive sensing is performed simultaneously in the three color channels. To be consistent with the hardware setting in the real data, the block size is set to 16 × 16 pixels and we reconstruct the images with different numbers of measurements. We assume that there is no overlapping among blocks and the image faction is ideally captured by different sensors, i.e., ignoring the cross-talk issue demonstrated in Fig. 3(a).

The PSNR (Peak-Signal-to-Noise-Ratio) of the reconstructed image is used as a performance metric and is plotted in Fig. 5(d) along with exemplar 64 reconstructed images by GMM and CNN. We can observe that when CSr= 0.05, descent results can be obtained. The running time for reconstruction is less than 5ms using an i7 CPU for GMM and around 1ms for CNN. We also notice from the PSNR curves that with limited measurements (e.g. CSr < 0.1), CNN leads to better results than GMM. When the measurement number is getting larger, GMM and CNN perform similarly. From the reconstructed images in Fig. 5, we can observe that CNN provides more details in each block, but has some boundary artifacts, while the GMM results are smooth. The boundary artifacts are generally caused by the convolutional operation used in the CNN.

Fig. 5 Simulation results of the proposed parallel lensless compressive imaging system using the “CelebA” face dataset, where 1000 face images (different from the training images) are resized to spatial size 64×64 and further divide to 16×16 blocks for parallel sampling. (a) Selected 64 exemplar truth images, (b) reconstructed images using GMM at CSr = 0.05, (c) reconstructed images using GMM at CSr = 0.1, (d) PSNR curves of reconstructed images compared with ground truth using GMM and CNN at various CSr, (e) reconstructed images using CNN at CSr = 0.05, (f) reconstructed images using CNN at CSr = 0.1.

Download Full Size | PDF

5.2. Experimental results

We now show the reconstruction results of data captured by the camera we have built in Fig. 2. Different from the simulation, sixteen gray-scale sensors are used in the sensor board to capture more light thus to save the integration time. We further restricted the light to test the performance of the camera in the dark environment. We display the targeted images on an LCD monitor and then use the camera to capture the compressive measurements. The camera is close (about 1cm) to the LCD monitor to avoid the cross-talk between adjacent sensors. In this way, the object is in the same distance to the camera.

5.2.1. Digital data

We first test our camera on the MNIST digit dataset [22], which contains a training set of 60,000 examples, and a test set of 10,000 examples. We train the CNN using the 60,000 training examples and use this trained CNN to perform reconstruction of the digits from the test set. Figure 6 plots 50 exemplar digit images reconstructed by CNN using different number of measurements. It can be observed that even when CSr= 0.02 (as each block is of 16 × 16 pixels, CSr= 0.02 denotes that each sensor only captured 5 measurements), we can still get good results in Fig. 6(c). This being said, when the LCD refresh rate is set to 50Hz, we can get a digit image in 100ms, bearing the low-cost and simple architecture of our camera. When CSr= 0.05, the reconstruction results are almost perfect.

Fig. 6 Experimental results using CNN with measurements taken by the prototype with 16 sensors for the digit dataset. (a) Truth, (b) CSr = 0.01, (c) CSr = 0.02, and (d) CSr = 0.05.

Download Full Size | PDF

5.2.2. Face data

Next, we test our camera on complex images, e.g. the face images in the Caltech256 [53] dataset, but still using the trained GMM and CNN by the CelebA dataset as used in the simulation. This verifies the robustness of the learned CNN models. We reconstructed 435 images in the face category of the dataset and 64 selected images are shown in Fig. 7. In order to qualitatively verify the performance, though there is misalignment during capture, we calculate the PSNR of the reconstructed images compared with real images shown on the monitor and the curves are plotted in Fig 7(d). It can be seen that CNN always provide better results than GMM, even in the high CSr range. This is different from the simulation and it signifies that the CNN based inversion is more robust to noise than GMM. We can identify the faces when CSr= 0.1. When CSr= 0.3, the faces are clearer with more details.

Fig. 7 Experimental results of our parallel lensless compressive imaging system by sampling the “face easy” category images in the “Caltech256” dataset, where 435 face images are used for testing. (a) Selected 64 exemplar truth images, (b) reconstructed images using GMM at CSr = 0.1, (c) reconstructed images using GMM at CSr = 0.3, (d) PSNR curves of reconstructed images compared with (misaligned) ground truth using GMM and CNN at various compressive sensing ratios, (e) reconstructed images using CNN at CSr = 0.1, (f) reconstructed images using CNN at CSr = 0.3. 5.2.1. Digital data

Download Full Size | PDF

5.2.3. Robustness of CNN to training data

It is well known that the training data will affect the results of CNN. We have shown that using the CelebA dataset to train the CNN but testing on the Caltech256 dataset can get decent results as in Fig. 7. This is mainly due to the fact that both dataset are facial images. Different from conventional applications, which usually trained CNN using the entire image, our CNN is trained on image patches. As we performed the tasks on both digital and facial data above, we would like to see how the results of one by using the other one as the training data in CNN.

In Fig. 8, we show the reconstruction results of digital data using the CNN trained on digits (different set from the testing data) and faces, respectively. It can be seen that the digits can also be reconstructed well by using the CNN trained by the face data. But the CNN trained by digital data provides better results. However, as the digital dataset is small and lacking details, using the CNN trained by digits cannot reconstruct good faces, and the results are thus omitted here. Therefore, even our CNN is based on patches, a larger dataset with more details is preferred to train the CNN for general purpose reconstruction, especially when the scene is unknown a priori.

Fig. 8 Experimental results of digit dataset reconstructed by CNN trained by different datasets. (a) PSNR of reconstructed digits compared with ground truth using CNN trained by digital data (different from the testing data, red line) and face data (blue line). (b) Reconstructed digits using CNN trained by digital data at CSr = 0.09. (c) Reconstructed digits using CNN trained by face data at CSr = 0.09.

Download Full Size | PDF

6. Conclusions

We have proposed the parallel lensless compressive camera to mitigate the speed issue in the current lensless compressive camera. Real-time reconstruction employing deep learning techniques has been developed to provide instant images. Prototypes have been built to demonstrate the feasibility of the proposed imaging architecture. Multiple geometries of this block-wise parallel lensless compressive camera have been described. Encouraging results of real data have been shown to verify the fast capture and real time reconstruction of the proposed camera. Our imaging system can be easily modified to achieve a higher resolution by setting each block to more pixels or adding more blocks. Though the current prototype can only capture images at specific distances, we are working on a new camera with the concentration-sensor regime to overcome this challenge.

Acknowledgments

The authors would like to thank Robert Farah at Bell Labs for helping building the sensor board and thank Gang Huang, Paul Wilford, and Hong Jiang at Bell Labs for helpful discussions.

References and links

1. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006). [CrossRef]

2. E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory 52(2), 489–509 (2006). [CrossRef]

3. M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk, “Single-pixel imaging via compressive sampling,” IEEE Sig. Process. Mag. 25(2), 83–91 (2008). [CrossRef]

4. G. Huang, H. Jiang, K. Matthews, and P. Wilford, “Lensless imaging by compressive sensing,” in Proceedings of IEEE International Conference on Image Processing (IEEE, 2013), pp. 2101–2105.

5. P. Llull, X. Yuan, L. Carin, and D. Brady, “Image translation for single-shot focal tomography,” Optica 2(9), 822–825 (2015). [CrossRef]

6. X. Yuan, X. Liao, P. Llull, D. Brady, and L. Carin, “Efficient patch-based approach for compressive depth imaging,” Appl. Opt. 55(27), 7556–7564 (2016). [CrossRef] [PubMed]

7. X. Yuan, P. Llull, X. Liao, J. Yang, G. Sapiro, D. J. Brady, and L. Carin, “Low-cost compressive sensing for color video and depth,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2014), pp. 3318–3325.

8. P. Llull, X. Liao, X. Yuan, J. Yang, D. Kittle, L. Carin, G. Sapiro, and D. J. Brady, “Coded aperture compressive temporal imaging,” Opt. Express 21(9) 10526–10545 (2013). [CrossRef] [PubMed]

9. D. Reddy, A. Veeraraghavan, and R. Chellappa, “P2C2: Programmable pixel compressive camera for high speed imaging,” in Proceedings of IEEE Computer Vision and Pattern Recognition (IEEE, 2011), pp. 329–336.

10. Y. Hitomi, J. Gu, M. Gupta, T. Mitsunaga, and S. K. Nayar, “Video from a single coded exposure photograph using a learned over-complete dictionary,” in Proceedings of IEEE International Conference on Computer Vision (IEEE, 2011), pp. 287–294.

11. Y. Sun, X. Yuan, and S. Pang, “High-speed compressive range imaging based on active illumination,” Opt. Express 24(20), 22836–22846 (2016). [CrossRef] [PubMed]

12. X. Yuan and S. Pang, “Structured illumination temporal compressive microscopy,” Biomed. Opt. Express 7(3), 746–758 (2016). [CrossRef] [PubMed]

13. Y. Sun, X. Yuan, and S. Pang, “Compressive high-speed stereo imaging,” Opt. Express 25(15), 18182–18190 (2017). [CrossRef] [PubMed]

14. X. Yuan, Y. Sun, and S. Pang, “Compressive video sensing with side information,” Appl. Opt. 56(10), 2697–2704 (2017). [CrossRef] [PubMed]

15. A. Wagadarikar, R. John, R. Willett, and D. J. Brady, “Single disperser design for coded aperture snapshot spectral imaging,” Appl. Opt. 47(10), B44–B51 (2008). [CrossRef] [PubMed]

16. X. Cao, T. Yue, X. Lin, S. Lin, X. Yuan, Q. Dai, L. Carin, and D. J. Brady, “Computational snapshot multispectral cameras: Toward dynamic capture of the spectral world,” IEEE Sig. Process. Mag. 33(5), 95–108 (2016). [CrossRef]

17. T.-H. Tsai, P. Llull, X. Yuan, D. J. Brady, and L. Carin, “Spectral-temporal compressive imaging,” Opt. Lett. 40(17), 4054–4057 (2015). [CrossRef] [PubMed]

18. X. Yuan, T.-H. Tsai, R. Zhu, P. Llull, D. J. Brady, and L. Carin, “Compressive hyperspectral imaging with side information,” IEEE J. Sel. Top. Sig. Process. 9(6), 964–976 (2015). [CrossRef]

19. T.-H. Tsai, X. Yuan, and D. J. Brady, “Spatial light modulator based color polarization imaging,” Opt. Express 23(9), 11912–11926 (2015). [CrossRef] [PubMed]

20. X. Yuan, “Compressive dynamic range imaging via Bayesian shrinkage dictionary learning,” Opt. Eng. 55, 123110 (2016). [CrossRef]

21. X. Yuan, H. Jiang, G. Huang, and P. Wilford, “SLOPE: Shrinkage of local overlapping patches estimator for lensless compressive imaging,” IEEE Sens. J. 16(22), 8091–8102 (2016). [CrossRef]

22. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE 86(11), 2278–2324 (1998). [CrossRef]

23. Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, “Variational autoencoder for deep learning of images, labels and captions,” in Proceedings of Advances in Neural Information Processing Systems (NIPS2016), pp. 2352–2360.

24. A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica 4(9), 1117–1125 (2017). [CrossRef]

25. M. Minsky and S. Papert, Perceptrons: An Introduction to Computational Geometry (MIT Press, 1969).

26. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016).

27. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature 323, 533–536 (1986). [CrossRef]

28. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature 529, 484–489 (2016). [CrossRef] [PubMed]

29. R. Horisaki, R. Takagi, and J. Tanida, “Learning-based imaging through scattering media,” Opt. Express 24(13), 13738–13743 (2016). [CrossRef] [PubMed]

30. D. L. Donoho, “For most large underdetermined systems of linear equations the minimal ℓ₁-norm solution is also the sparsest solution,” Commun. Pure Appl. Math. 59(7), 907–934 (2006). [CrossRef]

31. Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of IEEE International Conference on Computer Vision (IEEE, 2015), pp. 3730–3738.

32. K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok, “Reconnet: Non-iterative reconstruction of images from compressively sensed random measurements,” in Proceedings of IEEE Computer Vision and Pattern Recognition (IEEE, 2016), pp. 449–458.

33. A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” in Proceedings of International Conference on Learning Representations (ICLR, 2016), pp. 1–16.

34. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the International Conference on Machine Learning (ICML, 2015), pp. 448–456.

35. A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proceedings of the International Conference on Machine Learning (ICML, 2013), pp. 1–6.

36. J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” in Proceedings of International Conference on Learning Representations Workshop (ICLR, 2015), pp. 1–15.

37. D. J. Brady, M. E. Gehm, R. A. Stack, D. L. Marks, D. S. Kittle, D. R. Golish, E. M. Vera, and S. D. Feller, “Multiscale gigapixel photography,” Nature 486, 386–389 (2012). [CrossRef] [PubMed]

38. X. Yuan, J. Yang, P. Llull, X. Liao, G. Sapiro, D. J. Brady, and L. Carin, “Adaptive temporal compressive sensing for video,” in Proceedings of IEEE International Conference on Image Processing (IEEE, 2013), pp. 14–18.

39. M. S. Asif, A. Ayremlou, A. Sankaranarayanan, A. Veeraraghavan, and R. Baraniuk, “Flatcam: Thin, bare-sensor cameras using coded aperture and computation,” IEEE Trans. Comput. Imag. 3(3), 384–397 (2017). [CrossRef]

40. M. Chen, J. Silva, J. Paisley, C. Wang, D. Dunson, and L. Carin, “Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance bounds,” IEEE Trans. Sig. Process. 58(12), 6140–6155 (2010). [CrossRef]

41. M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Sig. Process. 54(11), 4311–4322 (2006). [CrossRef]

42. X. Liao, H. Li, and L. Carin, “Generalized alternating projection for weighted-ℓ_2,1 minimization with applications to model-based compressive sensing,” SIAM J. Imag. Sci. 7, 797–823 (2014). [CrossRef]

43. X. Yuan, “Generalized alternating projection based total variation minimization for compressive sensing,” in Proceedings of IEEE International Conference on Image Processing (IEEE, 2016), pp. 2539–2543.

44. A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences 2, 183–202 (2009). [CrossRef]

45. G. Yu, G. Sapiro, and S. Mallat, “Solving inverse problems with piecewise linear estimators: From Gaussian mixture models to structured sparsity,” IEEE Trans. Image Process. 21(5), 2481–2499 (2012). [CrossRef]

46. J. Yang, X. Yuan, X. Liao, P. Llull, G. Sapiro, D. J. Brady, and L. Carin, “Video compressive sensing using Gaussian mixture models,” IEEE Trans. Image Process. 23(11), 4863–4878 (2014). [CrossRef] [PubMed]

47. A. Mousavi and R. G. Baraniuk, “Learning to invert: Signal recovery via deep convolutional networks,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2017), pp. 2272–2276.

48. Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, and L. Carin, “VAE Learning via Stein Variational Gradient Descent,” in Proceedings of Advances in Neural Information Processing Systems (NIPS, 2017), pp. 1–10.

49. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of IEEE Computer Vision and Pattern Recognition (IEEE, 2016), pp. 770–778.

50. D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of International Conference on Learning Representations Workshop (ICLR, 2015), pp. 1–15.

51. X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS, 2010), pp. 249–256.

52. F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio, “Theano: new features and speed improvements,” in Proceedings of Advances in Neural Information Processing Systems Workshop (NIPS, 2012), pp. 1–10.

53. G. Griffin, A. D. Holub, and P. Perona, “The Caltech 256,” in Caltech Technical Report (2006).

Require: The training measurements Y_train and ground truth X_train.
1:	Initial ${w_{0}^{(i)}}_{i = 0}^{3}$ for the fully connected layer and three deconvolutional units. Initial 1^st moment vector ${m_{0}^{(i)}}_{i = 0}^{3}$ and 2^rd moment vector ${v_{0}^{(i)}}_{i = 0}^{3}$ to be 0. $E = \| \| X_{train} - A_{CNN}^{inv} (Y_{train}) \| \|$ .
2:	for t = 1 to Max-Iter do
3:	for i = 0 to 3 do
4:	Update $g_{t}^{(i)} = \partial E / \partial w_{t - 1}^{(i)}$
5:	Update $m_{t}^{(i)} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}^{(i)}$
6:	Update $v_{t}^{(i)} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{(i)} ⊙ g_{t}^{(i)}$
7:	Update ${\hat{m}}_{t}^{(i)} = m_{t}^{(i)} / (1 - β_{1}^{t})$
8:	Update ${\hat{v}}_{t}^{(i)} = v_{t}^{(i)} / (1 - β_{2}^{t})$
9:	Update $w_{t}^{(i)} = w_{t}^{(i)} - α {\hat{m}}_{t}^{(i)} / (\sqrt{{\hat{v}}_{t}^{(i)}} + η)$
10:	Update $E = \| \| X_{train} - A_{CNN}^{inv} (Y_{train}) \| \|$ .
11:	end for
12:	end for
13:	Output: The weights ${w^{(i)}}_{i = 0}^{3}$ of the fully connected layer and three deconvolutional units in the CNN demonstrated in Fig. 4.

Parallel lensless compressive imaging via deep convolutional neural networks

Abstract

1. Introduction

1.1. Challenges in current lensless compressive cameras

1.2. Machine learning for computational imaging

1.3. Contributions

2. Theory

2.1. Image and pixel formulation

2.2. Block-wise sensing and compressive imaging

3. Hardware setup

3.1. Overlapping regions and stitching

3.2. Concentration-sensor regime

4. Reconstruction algorithms

4.1. Existing algorithms

4.1.1. Dictionary learning based inversion

4.1.2. Closed-form inversion via Gaussian mixture models

4.2. Parallel reconstruction via deep convolutional neural networks

5. Results

5.1. Simulation

5.2. Experimental results

5.2.1. Digital data

5.2.2. Face data

5.2.3. Robustness of CNN to training data

6. Conclusions

Acknowledgments

References and links

Cited By

Figures (8)

Tables (1)

Equations (14)

Optics Express