Machine-learning enables image reconstruction and classification in a &#x201C;see-through&#x201D; camera

Zhimeng Pan; Brian Rodriguez; Brian Rodriguez; Rajesh Menon

doi:10.1364/OSAC.376332

1. Introduction

Imaging is a form of information transfer from the object to the image planes. The traditional camera comprised of lenses and an image sensor enables an approximately one-to-one mapping between these planes. This approach is widely successful primarily because of the high signal-to-noise ratio (SNR) that may be achieved at each image pixel. However, there are alternative one-to-many mappings that can achieve information transfer albeit with constraints. A fully-transparent mask placed in close proximity to an image sensor was used to perform this one-to-many mapping in the spectral domain in 2014 [1,2]. Following this work, mask-based spectral imaging was demonstrated in 2015 [3]. Such approaches were extended to angular-spectral imaging in 2017 [4], and full field of view and video imaging in 2018 [5]. One-to-many mapping occurs naturally in the case of scattering media, and undoing this scattering is an important problem, one which has been addressed, for example with coherent light [6], adding machine learning [7], and also with ultra-fast cameras [8]. More recently for incoherent imaging, improved model-based optimization [9] and neural networks [10,11] has been applied to increase the computational speed of such mask-based imaging. Improved algorithms were also applied to mitigate the resolution trade-offs involved in these mask-based approaches [12,13]. Machine-learning has been applied to the mask-based sensor for 3D imaging [14], hyperspectral imaging [15], lightfield imaging [16], wavefront sensing [17] and high-resolution microscopy [18,19].

The one-to-many mapping could also be achieved with lightpipes, which enable imaging in restricted environments, such as microscopy within the brain of a mouse, first demonstrated ex-vivo in 2014 [20], and in-vivo (with depth of almost 2mm inside the mouse brain) in 2017 [21–23]. Extension to 3D imaging and freely moving animals, albeit at much shallower depths have also been demonstrated [24].

We emphasize that all the previous work used an element (mask, diffuser, lightpipe, etc.) in front of the image sensor. We showed that such an element is not necessary and a truly optics-less camera is feasible with only the bare image sensor [25]. We also showed that machine learning may be applied to this “optics-less” camera for image classification [26]. In order to prevent the sensor from blocking the field of view from an observer, we described a “transparent” camera that used a transparent window placed perpendicular to an image sensor [27]. In this paper, we explore two new aspects of this “see-through” or transparent camera: first, we show that a trained neural network is able to perform image reconstruction from such a camera; second, we explore the difference between image reconstruction and image classification in such a camera, a problem that was deemed to be most interesting and least studied in a recent review on machine-learning-based imaging [28].

As was described before, our “see-through” camera is comprised of an image sensor placed at the edge of a transparent window. A schematic and photograph of our experimental setup are shown in Fig. 1. The object was a conventional LCD display, the window was made of transparent plexiglass and the sensor was a color CMOS image sensor (MU300 from AmScope). As before, the plexiglass had reflective tape around its edges except the edge facing the sensor, which was roughened for efficient light extraction [27]. The distance between the window and the LCD was approximately 250mm. The test images were displayed on the LCD and the corresponding sensor data was captured and stored. Ten frames were averaged for each stored data frame to reduce noise. A black box was used to cover the setup to minimize any ambient stray light.

Fig. 1. (a) Schematic of our experimental setup. The object is an LCD display placed about 250 mm away from a transparent plexiglass window, to the edge of which is placed a color CMOS image sensor (with no optics). (b) Photograph of our experimental setup. The letter “o” from the MNIST dataset is displayed on the LCD.

Download Full Size | PDF

In our previous work [27], we performed a calibration step to explicitly measure the space-variant point-spread functions of this imaging system, and then utilized a regularized singular-value decomposition (SVD) to invert the transfer function of the system and thereby, compute the image for human consumption from the raw sensor data. Neural networks have shown to be highly effective at solving such inverse problems and have the advantage of learning features from datasets, which can then be enhanced (via transfer learning, for example) with new data. The SVD-based approach also suffers from its dependence on the regularization parameter(s) used, which might need to be tuned for each image to get optimal results. As a result, the neural network approach may be expected to provide a more general solution to this inverse problem.

The time required for calibration in the prior work scales linearly with the number of independent pixels in the object and the pre-processing for singular-value decomposition scales as square of the number of pixels. This computational cost is approximately the same as that for acquiring the dataset and training the neural-network (NN, which scales with the number of independent pixels in the object). In this current paper, we captured ∼100,000 images at 128 X 128 pixels resolution. The total time to capture and train the NN was ∼2 days, which is similar to the time that was required for careful calibration (including adjusting for exposure times) in the prior work.

2. Network architecture and training methodologies

Network for Image Reconstruction: We built a convolutional neural network (CNN) to learn the inverse function that could reconstruct images from their corresponding sensor images. The overall network structure follows the classic “U-net” architecture [29], and modified with additional dense blocks from “Res-Net” [30]. U-net is a kind of encoder-decoder architecture, where the input images first go through a series of stages of convolutional and pooling layers. Each stage will reduce the dimensions (height x width) of images by half, but doubles the number of channels. This phase is referred to as the encoding phase. After this phase, input images are encoded into a lower dimensional representation space. Then the encoded representations go through a decoder phase, which is very similar to the encoder phase. The difference is that each decoder stage will double the dimensions and halve the number of channels. The characteristic feature of U-net is the skip connection, which concatenates the corresponding encoder stage outputs to decoder stage inputs. By doing so the network could use as much of the original information as possible to reconstruct the images.

Our dense block consists of 3 individual layers: 2 convolutional layers with RELU activation function followed by a batch-normalization layer. The advantage of the dense block is that it prevents the gradient from vanishing so that we could train very deep networks efficiently. Figure 2 shows the detailed architecture of the image-reconstruction CNN. Given the structure, the activation function of the last layer and the loss function are worth carefully considering. It has been well-known that the commonly used MSE, i.e., mean-square error loss function does not work well in sparse-image reconstruction as it tends to produce blurred images [31]. Instead, we use the pixel-wise cross-entropy as the loss function (L), which could impose sparsity [32],

L = \; \frac{1}{N}\mathop \sum \nolimits_i - {g_i}log({{p_i}} )- ({1 - {g_i}} )log({1 - {p_i}} ),

Fig. 2. CNN architecture for image reconstruction.

Download Full Size | PDF

where the summation is over every pixel i, and g_i and p_i represent the ground truth and predicted pixel intensity, respectively. In order to make the loss function valid, we need to restrict the range of the output layer to be within [0,1]. We thus choose sigmoid to be the activation function of the output layer. For all data sets, we split them into training set and testing set in the ratio of 9:1. We use the Adam optimizer [33] with initial learning rate 0.001, and train up to 50 epochs.

Network for Image Classification: Reconstructed results can be evaluated by metrics such as mean square error (MSE) or mean absolute error (MAE), or visual comparison. We will provide these metrics in the next section. In addition, we measure the quality of the reconstructed images by doing classification using the reconstructed images, and compare it to classification with original images and with raw sensor images (without reconstruction). Since our main goal of the classification is to test the reconstruction ability of the network in Fig. 2, rather than coming up with a state-of-the-art classification network, we decided to use an off-the-shelf classifier network, SimpleNet [34]. Though simple, having the fewest parameters compared with other architectures, SimpleNet has proved to provide very competitive, sometimes offering better performance in classification tasks.

Table 1 summarizes the image size used in each experiment described in this paper. A rationale for the choices are described briefly below.

Table 1. Image size (in pixels) of images used in each case.

View Table | View all tables in this article

Experiment: Each image from the dataset was scaled to be slightly less than the physical size of the transparent window (200mm x 225mm). The resolution of the sensor is 500 × 680 pixels.

Reconstruction: For image reconstruction, we cropped the image such that its aspect ratio was 1:1 and then rescaled to 128 × 128 pixels. These were performed to ensure reasonable time for training and computation. The aspect ratio of 1:1 was determined by the symmetric architecture of the U-net, where the input and the output images must have the same aspect ratio, and the reference dataset aspect ratio (to which one compares the output images to) is 1:1. Gaussian white noise with mean = 0 and variance = 0.001 was added to the input images to improve robustness of the network.

Classification: Since the dataset image size is 32 × 32 (or 28 × 28) and to keep computation time small, we chose to rescale the output of the reconstruction NN from 128 x128 to 32 × 32 (or 28 × 28), and choose these for training and testing. We did check that the impact of such down-sampling on classification accuracy was negligible.

In contrast, when using the raw sensor images for classification, it is important to keep as much information as possible, while still maintaining a reasonable computation time. Therefore, we decided to down-sample the sensor image by a factor of 4, from 500 × 680 to 125 × 170. This also has the advantage that its size is close to that used in the reconstruction NN (128 × 128).

3. Results and discussion

We trained and tested the network from Fig. 2 on 3 data sets: MNIST (6 classes) [35], EMNIST [36] and Kanji49 [37]. MNIST is the most widely used data set for visual task, containing gray scale images of 10 digits in various handwritten forms. But using as a proof of concept, we only use the first 6 classes (0 to 5) and randomly sub-sample 10% images in each class. EMNIST is an augmentation of MNIST, which additionally contains handwritten images of 26 English alphabet characters for both upper and lower cases. Note that some characters (for example, x,y,z) have very similar forms for both cases and are thus merged into one class (see [36] for details). Therefore, the total number of classes in EMNIST is 47, instead of 62. Kanji49 is similar to MNIST, but instead contains 49 Japanese Hiragana characters. It basically has the same number of classes as EMNIST, but the shapes of Hiragana are more complicated than those of the English characters. We include this dataset to further verify the reconstruction efficacy of our network. Table 1 lists the summary of data sets we used. We emphasize that all images are in grayscale and even though the output of the image sensor has 3 color channels, we convert all images to grayscale first. Therefore, no color information is used in this paper.

For each dataset, we trained our model as described earlier. Table 2 contains reconstruction performance for both training and testing sets. Note that testing refers to testing the network with images that it has never seen before during training. Figures 3–5 show the reconstruction results for the MNIST, EMNIST and Kanji49 datasets, respectively. Both training and testing results are included. Note that the input, ground truth and output are all grayscale images. Reconstruction results of MNIST is best, as expected, since it contains fewest number of classes. Kanji49’s result is slightly worse than EMNIST because the characters in it have more variants, and are more difficult for the network to find a suitable inverse function. For all 3 datasets, the testing results are worse than training results, as expected. This is a typical phenomenon in deep learning based algorithms, especially when the number of training samples is not large enough. In this case, the network tends to over-fit the training set, so as to decrease the training loss. It’s worth noting that we could always over-fit the training set with large enough network and bigger training batch. We could, to some extent, control the trade-off between the small generalization gap (which means relatively high testing accuracy) and high training accuracy.

Fig. 3. Reconstruction results for MNIST data. Left shows example images from the training set and Right shows example images from the testing data set.

Download Full Size | PDF

Fig. 4. Reconstruction results for EMNIST data. Left shows example images from the training set and Right shows example images from the testing data set.

Download Full Size | PDF

Fig. 5. Reconstruction results for Kanji49 data. Left shows example images from the training set and Right shows example images from the testing data set.

Download Full Size | PDF

Table 2. Details of datasets and summary of results

View Table | View all tables in this article

For each data set, we trained and tested 3 classification networks using 3 different sources of images: original (ground truth) images, raw sensor images and reconstructed images. The training-testing split is the same as before (9:1). For every network, we used the Adam optimizer to train up to 30 epochs. Figure 6 summarizes classification accuracy for all 3 data sets. A schematic to explain the concept of classification directly from the raw sensor and the second method of classification after reconstruction from the raw sensor is also included in Fig. 6(a).

Fig. 6. (a) Schematic of the two methods of classification. (b) Classification accuracy for the two methods and the 3 data-sets.

Download Full Size | PDF

For original images, all 3 data sets have a very high training and testing accuracy. But on raw sensor images, the network performs worse. Similar to reconstruction results, MNIST (6 classes) has the highest training and testing accuracy in all 3 settings due to its simplicity and smallest number of classes. The blue bars in Fig. 6 indicate that the testing classification accuracy for MNIST is similar whether one uses reconstructed images or raw-sensor data. Figure 7 shows the confusion matrix of classification for the MNIST dataset. In EMNIST (orange bars in Fig. 6), by first reconstructing the images from raw sensor, then classifying, we could improve classification accuracy. However, in KANJI49 dataset (gray bars in Fig. 6), testing accuracy of reconstructed images is lower than that with raw images. We believe that further parameter tuning of the classifier networks will improve all the accuracies.

Fig. 7. Confusion matrix of classification using (a) raw-sensor images and (b) the reconstructed images from the MNIST (6 classes) dataset.

Download Full Size | PDF

In conclusion, we showed that a U-net-based convolutional neural network can be trained to reconstruct images from a “see-through” lensless camera with good fidelity for the MNIST dataset. The quality of the reconstructed images is worse in the case of more complex images as in the EMNIST and the Kanji49 datasets. However, it may be possible to improve these with optimized networks and more training data. Secondly, we compared the accuracy of classification using a standard classifier network using the raw sensor images and the images reconstructed using the U-net. Our conclusion from this preliminary comparison is that for the MNIST dataset with 6 classes, good classification accuracy may be obtained in both cases. However, for more complex data sets like EMNIST and KANJI49, although good image reconstruction is possible, classification accuracy needs further improvement possibly from better network training. We attribute this to the increased complexity of the images as well as the larger number of classes. It must be noted that these results could be improved by optimizing the classifier network architecture for each case separately. Although our demonstration was achieved with a simple configuration and relatively simple images, the concept of a “transparent” computational camera could be applied for eye-tracking in smart glass or driver monitoring via the automobile windshields. In general, this technology could be useful when it is desired not to obscure a field of view with the image sensor.

Funding

National Science Foundation (1533611).

Acknowledgments

We would like to thank G. Kim, R. Palmer, A. Kachel and R. Guo for fruitful discussion, and assistance with experiments and software.

Disclosures

The authors declare no conflicts of interest.

References

1. P. Wang and R. Menon, “Computational spectroscopy via singular-value decomposition and regularization,” Opt. Express 22(18), 21541–21550 (2014). [CrossRef]

2. P. Wang and R. Menon, “Computational spectroscopy based on a broadband diffractive optic,” Opt. Express 22(12), 14575–14587 (2014). [CrossRef]

3. P. Wang and R. Menon, “Ultra-high sensitivity color imaging via a transparent diffractive-filter array and computational optics,” Optica 2(11), 933–939 (2015). [CrossRef]

4. P. Wang and R. Menon, “Computational snapshot angular-spectral lensless imaging,” arXiv preprint arXiv:1707.08104 [physics.optics] (2017).

5. P. Wang and R. Menon, “Computational multi-spectral video imaging,” J. Opt. Soc. Am. A 35(1), 189–199 (2018). [CrossRef]

6. O. Katz, P. Heidmann, M. Fink, and S. Gigan, “Non-invasive single-shot imaging through scattering layers and around corners via speckle correlations,” Nat. Photonics 8(10), 784–790 (2014). [CrossRef]

7. R. Horisaki, R. Takagi, and J. Tanida, “Learning-based imaging through scattering media,” Opt. Express 24(13), 13738–13743 (2016). [CrossRef]

8. G. Satat, M. Tancik, O. Gupta, B. Heshmat, and R. Raskar, “Object classification through scattering media with deep learning on time resolved measurement,” Opt. Express 25(15), 17466–17479 (2017). [CrossRef]

9. K. Monakhova, J. Yurtsever, G. Kuo, N. Antipa, K. Yanny, and L. Waller, “Learned reconstructions for practical mask-based lensless imaging,” Opt. Express 27(20), 28075–28090 (2019). [CrossRef]

10. S. Li, M. Deng, J. Lee, A. Sinha, and G. Barbastathis, “Imaging through glass diffusers using densely connected convolutional networks,” Optica 5(7), 803–813 (2018). [CrossRef]

11. A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica 4(9), 1117–1125 (2017). [CrossRef]

12. M. S. Asif, A. Ayremlou, A. Veeraraghavan, R. Baraniuk, and A. Sankaranarayanan, “FlatCam: Replacing lenses with masks and computation,” 2015 IEEE International Conference on computer vision workshop.

13. S. S. Khan, V. R. Adarsh, V. Boominathan, J. Tan, A. Veeraraghavan, and K. Mitra, “Towards photorealistic reconstruction of highly multiplex lensless images,” Proc. of IEEE International Conf. Computer Vision (2019).

14. N. Antipa, G. Kuo, R. Heckel, B. Mildenhall, E. Bostan, R. Ng, and L. Waller, “Diffusercam: lensless single-exposure 3D imaging,” Optica 5(1), 1–9 (2018). [CrossRef]

15. D. S. Jeon, S.-H. Baek, S. Yi, Q. Fu, X. Dun, and W. Heidrich, “Compact snapshot hyperspectral imaging with diffracted rotation,” ACM Trans. Graph. 38(4), 1–13 (2019). [CrossRef]

16. K. Tajima, T. Shimano, Y. Nakamura, M. Sao, and T. Hoshizawa, “Lensless light-field imaging with multi-phased Fresnel zone aperture,” 2017 International conference on computational photography.

17. P. Berto, H. Rigneault, and M. Guillon, “Wavefront sensing with a thin diffuser,” Opt. Lett. 42(24), 5117–5120 (2017). [CrossRef]

18. J. K. Adams, V. Boominathan, B. W. Avants, D. G. Vercosa, F. Ye, and R. G. Baraniuk, “Single-frame 3D fluorescence microscopy with ultraminiature lensless Flatscope,” Sci. Adv. 3(12), e1701548 (2017). [CrossRef]

19. A. K. Singh, G. Pedrini, M. Takeda, and W. Osten, “Scatter-plate microscope for lensless microscopy with diffraction-limited resolution,” Sci. Rep. 7(1), 10687 (2017). [CrossRef]

20. G. Kim and R. Menon, “An ultra-small 3D computational microscope,” Appl. Phys. Lett. 105(6), 061114 (2014). [CrossRef]

21. G. Kim, N. Nagarajan, E. Pastuzyn, K. Jenks, M. Capecchi, J. Sheperd, and R. Menon, “Deep-brain imaging via epi-fluorescence computational cannula microscopy,” Sci. Rep. 7(1), 44791 (2017). [CrossRef]

22. G. Kim and R. Menon, “Numerical analysis of computational cannula microscopy,” Appl. Opt. 56(9), D1–D7 (2017). [CrossRef]

23. G. Kim, N. Nagarajan, M. Capecchi, and R. Menon, “Cannula-based computational fluorescence microscopy,” Appl. Phys. Lett. 106(26), 261111 (2015). [CrossRef]

24. O. Skocek, T. Nöbauer, L. Weilguny, F. M. Traub, C. N. Xia, M. I. Molodtsov, A. Grama, M. Yamagata, D. Aharoni, and D. D. Cox, “High-speed volumetric imaging of neuronal activity in freely moving rodents,” Nat. Methods 15(6), 429–432 (2018). [CrossRef]

25. G. Kim, K. Isaacson, R. Palmer, and R. Menon, “Lensless photography with only an image sensor,” Appl. Opt. 56(23), 6450–6456 (2017). [CrossRef]

26. G. Kim, S. Kapetanovic, R. Palmer, and R. Menon, “Lensless-camera based machine learning for image classification,” arXiv preprint arXiv:1709.00408 [cs.CV] (2017).

27. G. Kim and R. Menon, “Computational imaging enables a “see-through” lensless camera,” Opt. Express 26(18), 22826–22836 (2018). [CrossRef]

28. G. Barbastathis, A. Ozcan, and G. Situ, “On the use of deep learning in computational imaging,” Optica 6(8), 921–943 (2019). [CrossRef]

29. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” In International Conference on Medical image computing and computer-assisted intervention, pages234–241. Springer, 2015.

30. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” In Proceedings of the IEEE conference on computer vision and pattern recognition, p. 770–778 (2016).

31. A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” In Advances in neural information processing systems, p. 5574–5584 (2017).

32. S. Suresh, N. Sundararajan, and P. Saratchandran, “Risk-sensitive loss functions for sparse multi-category classification problems,” Inf. Sci. 178(12), 2621–2638 (2008). [CrossRef]

33. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, (2014).

34. S. H. Hasanpour, M. Rouhani, M. Fayyaz, and M. Sabokrou, “Lets keep it simple, using simple architectures to outperform deeper and more complex architectures,” arXiv preprint arXiv:1608.06037 (2016).

35. Y. LeCun and C. Cortes. MNIST handwritten digit database (2010).

36. G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, “Emnist: an extension of mnist to handwritten letters,” arXiv preprint arXiv:1702.05373 (2017).

37. T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha, “Deep learning for classical japanese literature,” arXiv preprint arXiv: 1812.01718v1 (2018).

Name	# of images	# of classes	Training MAE	Training MSE	Testing MAE	Testing MSE
MNIST	6,000	6	0.0129	0.0031	0.1014	0.0585
EMNIST	120,000	47	0.0730	0.0324	0.1213	0.0563
KANJI49	100,000	49	0.0994	0.0609	0.1786	0.0891

Name	# of images	# of classes	Training MAE	Training MSE	Testing MAE	Testing MSE
MNIST	6,000	6	0.0129	0.0031	0.1014	0.0585
EMNIST	120,000	47	0.0730	0.0324	0.1213	0.0563
KANJI49	100,000	49	0.0994	0.0609	0.1786	0.0891

Machine-learning enables image reconstruction and classification in a “see-through” camera

Abstract

1. Introduction

2. Network architecture and training methodologies

3. Results and discussion

Funding

Acknowledgments

Disclosures

References

Cited By

Figures (7)

Tables (2)

Equations (1)

OSA Continuum

	Dataset	Image reconstruction	Classifier from raw image	Classifier from reconstructed image
MNIST	28 × 28	128 × 128	28 × 28	125 × 170
EMNIST	28 × 28	128 × 128	28 × 28	125 × 170
KANJI49	32 × 32	128 × 128	32 × 32	125 × 170