Lensless inference camera: incoherent object recognition through a thin mask with LBP map generation

Xiuxi Pan; Tomoya Nakamura; Tomoya Nakamura; Xiao Chen; Masahiro Yamaguchi

doi:10.1364/OE.416613

1. Introduction

This decade has seen a rapidly growing demand for smaller, lighter and cheaper camera that can be applied in extreme scenarios, such as Internet of Things (IoT) devices, where stringent constraints on size, weight and cost are imposed. Based on the fact that the optical hardware of lensed cameras are dominated by lens components and focusing distance, researchers have been studying mask-based lensless cameras which simplify optical hardware by replacing lens components with a thin mask placed on top of the image sensor [1–6]. However, mask-based lensless camera requires computational image reconstruction, which means imaging burden is moved from optics to computation.

With the proliferation of artificial intelligence, today a large proportion of cameras no longer photograph, instead capture the visual information for inference tasks like object recognition, face identification, and gender estimation. In this case, resemblance between the scene and the sensor measurement is a feasible way for visual information collection, but may not be the best one. To pursue high-quality image that resembles the scene as faithfully as possible, lensed camera relies on complex optical hardware while lensless camera requires expensive computation.

This work proposes preliminary lensless inference (LLI) camera that is specialized for inference tasks. Since the primary purpose of LLI camera has transfered from producing scene-resembling images to inference, it is able to relieve the burden in both optical hardware and computation that are needed for resembling the scene. In addition, without resembling the scene, LLI camera can provide with optical-level privacy protection for privacy-sensitive inference tasks such as secure optical sensing [7] or de-identifiied attribute recognition like gender or age estimation.

2. Related work

Optical hardware of LLI camera is the same as mask-based lensless cameras’. Performing inference with a mask-based lensless camera is not novel [8]. The procedure of applying lensless camera to inference tasks includes pattern capture, image reconstruction and inference. As for image reconstruction, iterative image reconstruction algorithms [9–11] based on compressed sensing framework [12–14] are most commonly used because they have high robustness and generally produce better results. To ensure high predictive accuracy, iterative reconstruction requires a large number of iterations. A large number of iterations consume much compute time, disallowing real-time inference. Achieving high predictive accuracy and real-time inference at the same time is challenging when applying iterative image reconstruction. Although there are fast image reconstruction algorithms [15–19], it will be beneficial if the image reconstruction can be bypassed since the reconstruction is not intrinsically needed in inference tasks.

There has been much interest in reconstruction-free recognition [20–26]. Frameworks proposed in [20–22] are lens-based, different with LLI camera in optical hardware. In particular, single-pixel sensing [20] or its variations, employing digital mirror device, theoretically disallow one-shot inference. Frameworks proposed in [23–25] are free of lens, and they are dedicated for laser-illuminated target. Wang et al. [26] uses lensless optical hardware for privacy-preserving action recognition, but the predictive accuracy is still limited. LLI camera proposed in this paper is a simple mask-based lensless camera, and pursues incoherent reconstruction-free recognition with significantly improved accuracy by incorporating a suitable preprocessing for optically encoded pattern through the mask.

3. Methods

The framework of LLI camera is illustrated in Fig. 1. Optical part consists of a thin mask placed several millimeters in front of a standard image sensor. The mask can be amplitude mask, diffractive mask or any optical encoding element. In this work, we apply a binary amplitude mask and consider an incoherent imaging system. Computational part includes data preprocessing and a classifier. In section 3.1, we address the difficulty in inference tasks using optically encoded pattern through the mask. Then in section 3.2, we propose the use of a new feature extraction technique based on "local binary pattern (LBP)", to derive a two-dimensional (2D) feature pattern. The classifier, convolutional neural network (CNN), is explained in section 3.3.

Fig. 1. The framework of LLI camera.

Download Full Size | PDF

3.1 Imaging model and disturbance amplification issue

Consider a mask-based lensless imaging system which consists of a mask placed in front of the sensor. Assume that there is an incoherent source point in the scene. Through the mask, the source point casts a specific pattern on the sensor. The pattern scales, shifts and varies in intensity, corresponding to the the position and intensity change of the source point [27]. Accordingly, the pattern can be roughly regarded as spatially shift-invariant [3,4,6]. The pattern is determined by the imaging system, termed point spread function (PSF) of the system. By assuming that an object O is a collection of incoherent points, sensor measurement of the object $I_{O}$ can be approximately modeled as a linear combination of PSFs [28]. This imaging model can be expressed mathematically as

(1)$$I_{O}(i,j) \approx \sum_{k} \sum_{l} O(k,l) H(i+k,j+l),$$

where H is PSF. More simply, the sensor measurement is expressed as convolution between the object and PSF

(2)$$I_{O} =O \ast H,$$

where $\ast$ denotes discrete convolution. The size of PSF is at least larger than the mask, usually covering a large proportion of area in the sensor. Since PSF extends globally, the local information is transformed into the global one that covers a large area on the sensor measurement, according to Eq. (1) or Eq. (2). Figure 2 illustrates an example, where we consider PSF as a 4 $\times$ 4 binary array. The left column of Fig. 2 shows an object O and its sensor measurement $I_{O}$. Similarly, a small local disturbance D becomes the global disturbance $I_{D}$ through the mask, formulated as

(3)$$I_{D} =D \ast H.$$

The center column of Fig. 2 shows the local disturbance D and its sensor measurement $I_{D}$.

Fig. 2. An example of disturbance amplification issue in the imaging model of mask. Here we simply consider PSF as a 4 $\times$ 4 binary array. Small local disturbance through the mask becomes global disturbance, overlaying on sensor measurement of the object. It brings considerable hindrance for inference tasks.

Download Full Size | PDF

From the top row of Fig. 2, we can observe that the local disturbance is small enough that it causes little difficulty for recognizing the object. Whereas, local disturbance through the mask becomes global disturbance $I_{D}$ overlaying on sensor measurement of the object $I_{O}$, as demonstrated at the bottom row of Fig. 2. It brings considerable hindrance for inference tasks.

3.2 Preprocessing: LBP map generation

As mentioned in previous subsection, the mask amplifies disturbances. Encoded pattern (for convenience, "optically encoded pattern through the mask" will be named as "encoded pattern" in following content) is much more susceptible to disturbances compared with a normal scene-resembling image. For practical implementation, robustness to disturbances should be enhanced. To address this issue, this work proposes LBP map generation, working as a data preprocessing approach for encoded pattern. We notice that encoded pattern shares some similarities with textured image. For example, to indicate an object in encoded pattern or textured image, the way the pixels are arranged tends to be a more important characteristic than shape. The local binary patterns histogram (LBPH) algorithm is the most use texture descriptor [29,30]. LBPH algorithm represents each image pixel with a binary code and the image is described by the distribution of binary codes. LBPH algorithm is widely used in texture analysis because it has low computation cost and is invariant to grayscale illumination changes. Inspired by this, we propose to generate a 2D map, based on the LBPH algorithm, for encoded pattern in order to improve the robustness to disturbances. Following the LBPH algorithm, firstly the binary code of each pixel in the encoded pattern is built considering the differences between the pixel and its equally spaced neighbors. Taking the value of the center pixel as the threshold, neighbor pixels are assigned with new binary values by setting 1 for values equal or higher than the threshold and 0 for values lower than the threshold. These binary values are concatenated into a binary code. An illustration of this process is depicted in Fig. 3, where the binary code is calculated by thresholding a 3 $\times$ 3 neighborhood. In order to describe texture feature, the LBPH algorithm then generates a histogram by calculating the distribution of binary codes. In LLI camera, however, LBPH algorithm is employed for data preprocessing rather than feature extraction. Therefore, we do not generate the histogram, but generate an LBP map by converting the binary code of each pixel to decimal value. Note that calculating the LBP map is computationally simple, which makes real-time image analysis possible.

Fig. 3. The process of LBP map generation.

Download Full Size | PDF

Since the binary code of a pixel is generated by comparing with its surrounding pixels, LBP map is a more robust format than encoded pattern, in term of dealing with disturbances. An example of LBP map and its efficiency of disturbance suppression is shown as Fig. 4. The LBP map of encoded pattern is used as input of the classifier.

Fig. 4. An example of LBP map and its efficiency of disturbance suppression.

Download Full Size | PDF

3.3 Classifier: convolutional neural network

CNN is one of the representative deep learning structures in the field of image recognition [31–33]. With the characteristics of local area perception, weight sharing and spatial sampling, CNN enables the automatic feature extraction in image without user intervention. Taking this advantage, LLI camera employs CNN as the classifier to extract relevant features from the LBP map of encoded pattern. In CNN, basic-level features in local are initially learned from raw input image pixels by convolving with learnable filter kernels, and then are developed to high-level abstractions. These basic-level features in local generally are color, texture, and edges. As illustrated in Fig. 4, LBP map has sharp variations in pixel values that better indicate the texture regularity and localize the edge contours, compared with encoded pattern. Therefore, when applying CNN as the classifier, LBP map not only suppresses the disturbances but also enhances the basic-level features in local.

4. Experiments

To analyze performance of LLI camera, we conduct a series of optical experiments. The setup of optical experiment is shown as Fig. 5. The monitor, mask and sensor are kept parallel, and center points of them are in a line. The target is displayed on the center of a monitor placed in front of LLI camera. The image displayed on the monitor has resolution of 200 pixels $\times$ 200 pixels, physical size of 25 cm $\times$ 25 cm. The binary amplitude mask, fabricated with chromium deposition in a synthetic-silica plate, presents a binary pseudorandom array. Shown as Fig. 6, the mask has optical size of 9 mm $\times$ 9 mm and aperture size of 30 µm $\times$ 30 µm. The ratio of transmission area is 25%, and each transmission spot occupies 6 µm $\times$ 6 µm. The sensor is a monochrome 12.3 megapixels CMOS (Sony IMX304) with 14 mm $\times$ 10 mm optical size, and 3.45 µm $\times$ 3.45 µm pixel size. The image senor and mask are separated by 2.5 mm, capturing approximately 75$^\circ$ field-of-view. If we consider spatial sampling by the image sensor, encoded pattern resolution associated with mask-based lensless imaging can be given by [6]

(4)$$Resolution = \frac{target-mask\;distance}{mask-sensor\;distance}\times pixel\;size\;of\;sensor.$$

Calculated with Eq. (4), the pattern resolution of this built LLI camera is around 0.0014 d, where d is the distance between target and mask.

Fig. 5. Experiment setup.

Download Full Size | PDF

Fig. 6. Binary amplitude mask.

Download Full Size | PDF

Definition and explanation of datasets are illustrated in Fig. 7. The dataset, displayed on the monitor as target, is named as normal dataset (NDS). NDS contains normal training dataset (NTDS) and normal validation dataset (NVDS). The dataset, consisting of encoded patterns recorded on the sensor, is named as encoded dataset (EDS). EDS contains encoded training dataset (ETDS) and encoded validation dataset (EVDS), respectively corresponding to NTDS and NVDS. In order to compare LLI camera with mask-based lensless camera, we build reconstructed dataset (RDS), containing reconstructed training dataset (RTDS) and reconstructed validation dataset (RVDS), by performing image reconstruction on EDS. The employed image reconstruction algorithm is the most frequently used alternating direction method of multipliers (ADMM) [11], which iteratively solves the optimization problem, with $\ell _{1}$-norm in the cost function. Figure 8 illustrates the reconstructed images with different iterative counts. Reconstructed images with 1 iterative count and 10 iterative counts will be used in RDS.

Fig. 7. Definition and explanation of datasets. RDS is reconstructed from EDS by performing the iterative reconstruction algorithm ADMM with $\ell _{1}$-norm.

Download Full Size | PDF

Fig. 8. Reconstructed images with different iterative counts. The applied image reconstruction algorithm is ADMM with $\ell _{1}$-norm. Reconstructed images with 1 iterative count and 10 iterative counts will be used in RDS.

Download Full Size | PDF

We perform experiments on popular computer vision tasks of handwritten digit recognition and gender estimation. For handwritten digit recognition, NDS uses MNIST dataset [34] which contains 60,000 training images for NTDS and 10,000 validation images for NVDS. For gender estimation, around 60,000 cropped female faces and 60,000 cropped male faces are collected from IMDB-WIKI dataset [35] for NTDS. LFW dataset [36] which is one of the benchmark datasets for gender estimation, is used for NVDS. Since face images from both IMDB-WIKI and LFW datasets are captured under variables of position, pose, lighting, expression, background, camera quality, occlusion and age, gender estimation with IMDB-WIKI and LFW datasets is a more difficult and advanced inference task compared with handwritten digit recognition with MNIST dataset where all digits are white and written on black background without noise.

The CNN architecture applied for all experiments mentioned below is ResNet-18 [37], which is modified in the input layer to receive 224 $\times$ 224 grayscale image as input. Thus, all images from NDS, EDS and RDS are resized and transformed to 224 $\times$ 224 grayscale for use. For all tasks, ResNet-18 uses Adam [38] with $\beta _{1}$=0.9, $\beta _{2}$=0.999, a weight decay of 0.1, and a mini-batch size of 64. The training is implemented on two NVIDIA GeForce GTX TITAN X GPUs, with Keras 2.3.1, Tensorflow 1.2.1 and Python 2.7. Training on NDS, EDS and RDS take approximately the same time. For handwritten digit dataset, the model converges within 5 epochs and each epoch takes approximately 4 minutes. For gender dataset, the model converges within 8 epochs and each epoch takes approximately 12 minutes.

We perform two experiments in order to verify the LLI camera’s feasibility in real environment. We test LLI camera under uneven illumination and with changing positions of the target. We also compare LLI camera with lensed camera and the mask-based lensless camera on both predictive accuracy and inference speed.

4.1 Experiment 1: test under uneven illumination; comparison with lensed camera and mask-based lensless camera

In the setup of experiment 1, the monitor is fixed 35 cm away from LLI camera. There are many disturbances in real environment. In experiment 1, we choose uneven illumination, one of the most common disturbances, for test. To simulate uneven illumination, we beforehand create normal dataset with local illumination variation (NDS-IV) by randomly adding local illumination variation to NDS. NDS-IV contains normal training dataset with local illumination variation (NTDS-IV) and normal validation dataset with local illumination variation (NVDS-IV). The effect of adding local illumination variation is shown in Fig. 9. Note that face images from IMDB-WIKI and LFW datasets are captured in the wild where changes in lighting conditions already exist. It means that, for gender estimation, both NDS/ RDS/ EDS and NDS-IV/ RDS-IV/ EDS-IV contain illumination variations, but NDS-IV/ RDS-IV/ EDS-IV contain more.

Fig. 9. The effect of randomly adding illumination variation in local. Note that face images from IMDB-WIKI and LFW datasets are captured in the wild where changes in lighting conditions already exist. It means that, for gender estimation, both NDS/ RDS/ EDS and NDS-IV/ RDS-IV/ EDS-IV contain illumination variations, but NDS-IV/ RDS-IV/ EDS-IV contain more.

Download Full Size | PDF

Experimental schemes of experiment 1 are illustrated in Fig. 10. The scheme of lensed camera uses NDS or NDS-IV, mask-based lensless camera uses RDS or RDS-IV, inference directly on encoded pattern and LLI camera use EDS or EDS-IV. Note that in all schemes, normalization is conducted before images are sent to ResNet-18. For mask-based lensless camera, image reconstruction with 1 iterative count and 10 iterative counts will be used respectively. Predictive accuracy result is listed in Table 1 (false acceptance rate and true acceptance rate of each gender estimation result are listed in Appendix). In Table 1, lensed camera is abbreviated as "lensed", mask-based lensless camera is abbreviated as "Rec.(1 iter.)" for image reconstruction with 1 iterative count and "Rec.(10 iter.)" for image reconstruction with 10 iterative counts, LLI camera is abbreviated as "LLI". "Digit" and "gender" represent tasks of handwritten digit recognition and gender estimation respectively. "IV" indicates illumination variation. Calculation speed comparison among LBP map generation, inference with ResNet-18, iterative image reconstruction is listed in Table 2, where all calculations are timed with a 224 $\times$ 224 $\times$ 1 image in Windows 10, Keras 2.3.1, Tensorflow 1.2.1, Python 2.7, on a machine with CPU i7-7700.

Fig. 10. Experimental schemes of experiment 1. In Table 1, lensed camera is abbreviated as "lensed", mask-based lensless camera is abbreviated as "Rec.(1 iter.)" for image reconstruction with 1 iterative count and "Rec.(10 iter.)" for image reconstruction with 10 iterative counts, LLI camera is abbreviated as "LLI". "IV" indicates illumination variation.

Download Full Size | PDF

Table 1. Predictive accuracy result of experiment 1

View Table | View all tables in this article

Table 2. Calculation speed comparison

View Table | View all tables in this article

Comparing inference directly on encoded pattern and LLI camera in Table 1, we have two observations. Firstly, inference directly with encoded pattern is possible for simple inference tasks like handwritten digit recognition in controlled lighting condition (test D1, 98.83%). Whereas it is not feasible (test D2, 11.35%) or has poor performance (test D3, 89.53%) when uneven illumination is added to the target. The reason is that the mask amplifies disturbances in the scene, as explained in previous section. For advanced task like gender estimation, inference directly with encoded pattern becomes infeasible (test G1, 50.52%; G2, 50.40%; G3, 61.45%). The second observation is that LBP map generation is an efficient data preprocessing approach to suppress disturbances and enhance feature extraction for encoded pattern. With LBP map generation, LLI camera achieves high predictive accuracy for simple tasks like handwritten digit recognition no matter without uneven illumination (test D1, 98.87%) or with uneven illumination (test D3, 97.74%). For advanced tasks like gender estimation, LBP map generation makes it feasible although the result is not good enough when uneven illumination exists.

Now we compare predictive accuracy among LLI camera, lensed camera and mask-based lensless camera in Table 1. LLI camera surpasses mask-based lensless camera, and achieves an accuracy that is close to the lensed camera’s in both conditions without and with uneven illumination. As for computational efficiency, as shown in Table 2, LBP map generation consumes 0.16 millisecond, which can be ignored when compared to inference with ResNet-18 or the iterative image reconstruction with ADMM.

4.2 Experiment 2: test with changing positions of the target

To further verify LLI camera’s feasibility in real environment, we test LLI camera with the target placed in changing three-dimensional (3D) positions. The experimental setup is shown in Fig. 11. The origin of the axis is set in the center of the monitor, 35 cm away from the LLI camera. Translation in z direction is done by moving the monitor, and translation in x or y direction is done by moving the area of display window on the monitor. Through changing the target’s 3D position (x,y,z), ETDS-(x,y,z) and EVDS-(x,y,z) are collected. ETDS-(x,y,z) or EVDS-(x,y,z) means that the ETDS or the EVDS is collected with the target in the position (x,y,z). In total, 9 ETDS are collected, as listed in Table 3. Unit for measurement is centimeter. Model trained with only ETDS-(0,8,8) and model trained with all 9 ETDS are used to test 24 EVDS. Results are shown in Fig. 12 for handwritten digit recognition and Fig. 13 for gender estimation (data of all tests is list in Appendix).

Fig. 11. Setup of experiment 2. The target moves in 3D positions. The origin of the axis is set in the center of the monitor, 35 cm away from the LLI camera.

Download Full Size | PDF

Fig. 12. Result for the test on handwritten digit recognition with moving target. Data is listed in Appendix.

Download Full Size | PDF

Fig. 13. Result for the test on gender estimation with moving target. Data is listed in Appendix.

Download Full Size | PDF

Table 3. Training datasets collected in experiment 2

View Table | View all tables in this article

For handwritten digit recognition when target moves in z direction, shown as Fig. 12(a), LLI camera keeps a high accuracy, even only ETDS-(0,8,8) is used for training. For target moving in x and y directions, shown as Fig. 12(b), accuracy decreases dramatically as target moving away form the training position (0,8,8) if only ETDS-(0,8,8) is used for training. Whereas, accuracy can be kept in a high level if 9 ETDS are used for training. Fig. 12 reveals that, for handwritten digit recognition, LLI camera is robust to the target’s z direction translation, and is not robust to target’s translation in x and y directions if the model is trained with target in only single position. While the robustness can be greatly enhanced through training with target in multiple positions.

For more advanced task of gender estimation, shown as Fig. 13, LLI camera has weak robustness to target movement. Whereas, we can observe accuracy improvement when the target position for test is included in target positions for training. It means that, for gender estimation, LLI camera can still keep high robustness to target’s movement if training with enough target positions is implemented.

5. Conclusion

We report a preliminary lensless inference camera that is specialized for object recognition. The proposed LLI camera, consisting of a thin mask placed in front of a image sensor, is ultra-simple in optical hardware compared with lensed camera. LLI camera shares the same optical hardware with mask-based lensless camera. However, LLI camera bypasses computationally expensive image reconstruction. We analyze that the mask amplifies local disturbances in the scene to global disturbances in encoded pattern, making inference directly on encoded patterns challenging. To deal with this issue, we propose LBP map generation to work as a data preprocessing approach for encoded pattern. Results of optical experiments verify that LLI camera surpasses mask-based lensless camera, and presents a close performance to lensed camera, in terms of predictive accuracy. To test LLI camera’s feasibility in practical implementation, we perform experiments under uneven illumination and with moving target. LLI camera shows high robustness to illumination variations and moving target when working on simple tasks like handwritten digit recognition. For more advanced tasks like gender estimation, LLI camera shows a decent result under uneven illumination, but is not robust to moving target. In order to keep a high accuracy with moving target, training in multiple positions beforehand is required.

Though the feasibility of LLI camera has been preliminarily verified. In the future, we plan to build training dataset of 3D objects for LLI camera, in order to perform LLI camera in real environment. Another future work is performance improvement through optical hardware optimization. With optimization of mask design and mask-sensor distance, LLI camera’s signal-to-noise ratio, diffraction limit and adaptability to the specific target are expected to be improved.

Considering its ultra-simple hardware and capability of real-time one-shot inference, LLI camera is possible to have applications in IoT devices, like smart watch, where various inference tasks are needed while tiny size, light weight and low cost are also required. Another interesting fact about LLI camera is that it does not require knowing mask information beforehand and does not produce any human-interpretable image in the whole process. Therefore, LLI camera is expected to offer privacy protection for inference tasks that are sensitive to privacy such as secure optical sensing or de-identifiied attribute recognition.

Appendix

False acceptance rate (FAR) and true acceptance rate (TAR) of each gender estimation result in experiment 1 are listed below in Table 4 for completeness. Complete result of experiment 2 is listed in Table 5.

Table 4. FAR and TAR of each gender estimation result in experiment 1

View Table | View all tables in this article

Table 5. Complete result of experiment 2

View Table | View all tables in this article

Disclosures

The authors declare no conflicts of interest.

References

1. D. G. Stork and P. R. Gill, “Optical, Mathematical, and Computational Foundations of Lensless Ultra-Miniature Diffractive Imagers and Sensors,” Int. J. Adv. Syst. Meas. 7, 201–208 (2014).

2. S. K. Sahoo, D. Tang, and C. Dang, “Single-shot multispectral imaging with a monochromatic camera,” Optica 4(10), 1209–1213 (2017). [CrossRef]

3. N. Antipa, G. Kuo, R. Heckel, B. Mildenhall, E. Bostan, R. Ng, and L. Waller, “Diffusercam: lensless single-exposure 3d imaging,” Optica 5(1), 1–9 (2018). [CrossRef]

4. M. J. DeWeert and B. P. Farm, “Lensless coded aperture imaging with separable doubly Toeplitz masks,” Opt. Eng. 54(2), 023102 (2015). [CrossRef]

5. M. S. Asif, A. Ayremlou, A. Sankaranarayanan, A. Veeraraghavan, and R. G. Baraniuk, “Flatcam: Thin, lensless cameras using coded aperture and computation,” IEEE Transactions on Comput. Imaging 3(3), 384–397 (2017). [CrossRef]

6. V. Boominathan, J. K. Adams, J. T. Robinson, and A. Veeraraghavan, “Phlatcam: Designed phase-mask based thin lensless camera,” IEEE Trans. Pattern Anal. Mach. Intell. 42(7), 1618–1629 (2020). [CrossRef]

7. B. Javidi, A. Carnicer, M. Yamaguchi, T. Nomura, E. Pérez-Cabré, M. S. Millán, N. K. Nishchal, R. Torroba, J. F. Barrera, W. He, X. Peng, A. Stern, Y. Rivenson, A. Alfalou, C. Brosseau, C. Guo, J. T. Sheridan, G. Situ, M. Naruse, T. Matsumoto, I. Juvells, E. Tajahuerce, J. Lancis, W. Chen, X. Chen, P. W. H. Pinkse, A. P. Mosk, and A. Markman, “Roadmap on optical security,” J. Opt. 18(8), 083001 (2016). [CrossRef]

8. J. Tan, L. Niu, J. K. Adams, V. Boominathan, J. T. Robinson, R. G. Baraniuk, and A. Veeraraghavan, “Face detection and verification using lensless cameras,” IEEE Transactions on Comput. Imaging 5(2), 180–194 (2019). [CrossRef]

9. J. M. Bioucas-Dias and M. A. Figueiredo, “A new twist: Two-step iterative shrinkage/thresholding algorithms for image restoration,” IEEE Transactions on Image Process. 16(12), 2992–3004 (2007). [CrossRef]

10. A. Beck and M. Teboulle, “Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems,” IEEE Transactions on Image Process. 18(11), 2419–2434 (2009). [CrossRef]

11. S. Boyd, N. Parikh, and E. Chu, Distributed optimization and statistical learning via the alternating direction method of multipliers (Now Publishers Inc, 2011).

12. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006). [CrossRef]

13. E. J. Candès and M. B. Wakin, “An introduction to compressive sampling,” IEEE signal processing magazine 25(2), 21–30 (2008). [CrossRef]

14. A. Stern, Optical compressive imaging (CRC Press, 2016).

15. T. Shimano, Y. Nakamura, K. Tajima, M. Sao, and T. Hoshizawa, “Lensless light-field imaging with Fresnel zone aperture: quasi-coherent coding,” Appl. Opt. 57(11), 2841–2850 (2018). [CrossRef]

16. Y. Li, Y. Xue, and L. Tian, “Deep speckle correlation: a deep learning approach toward scalable imaging through scattering media,” Optica 5(10), 1181–1190 (2018). [CrossRef]

17. K. Monakhova, J. Yurtsever, G. Kuo, N. Antipa, K. Yanny, and L. Waller, “Learned reconstructions for practical mask-based lensless imaging,” Opt. Express 27(20), 28075–28090 (2019). [CrossRef]

18. S. S. Khan, V. Adarsh, V. Boominathan, J. Tan, A. Veeraraghavan, and K. Mitra, “Towards photorealistic reconstruction of highly multiplexed lensless images,” in Proceedings of the IEEE International Conference on Computer Vision (2019), pp. 7860–7869.

19. T. Nakamura, T. Watanabe, S. Igarashi, X. Chen, K. Tajima, K. Yamaguchi, T. Shimano, and M. Yamaguchi, “Superresolved image reconstruction in fza lensless camera by color-channel synthesis,” Opt. Express 28(26), 39137–39155 (2020). [CrossRef]

20. M. A. Davenport, M. F. Duarte, M. B. Wakin, J. N. Laska, D. Takhar, K. F. Kelly, and R. G. Baraniuk, “The smashed filter for compressive classification and target recognition,” Computational Imaging V, vol. 6498 (International Society for Optics and Photonics, 2007), p. 64980H.

21. T. Ando, R. Horisaki, and J. Tanida, “Speckle-learning-based object recognition through scattering media,” Opt. Express 23(26), 33902–33910 (2015). [CrossRef]

22. T. Okawara, M. Yoshida, H. Nagahara, and Y. Yagi, “Action recognition from a single coded image,” in 2020 IEEE International Conference on Computational Photography (ICCP) (IEEE, 2020), pp. 1–11.

23. B. Javidi, S. Rawat, S. Komatsu, and A. Markman, “Cell identification using single beam lensless imaging with pseudo-random phase encoding,” Opt. Lett. 41(15), 3663–3666 (2016). [CrossRef]

24. B. Javidi, A. Markman, and S. Rawat, “Automatic multicell identification using a compact lensless single and double random phase encoding system,” Appl. Opt. 57(7), B190–B196 (2018). [CrossRef]

25. T. O’Connor, C. Hawxhurst, L. M. Shor, and B. Javidi, “Red blood cell classification in lensless single random phase encoding using convolutional neural networks,” Opt. Express 28(22), 33504–33515 (2020). [CrossRef]

26. Z. W. Wang, V. Vineet, F. Pittaluga, S. N. Sinha, O. Cossairt, and S. Bing Kang, “Privacy-preserving action recognition using coded aperture videos,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2019).

27. I. Freund, M. Rosenbluh, and S. Feng, “Memory effects in propagation of optical waves through disordered media,” Phys. Rev. Lett. 61(20), 2328–2331 (1988). [CrossRef]

28. J. W. Goodman, Introduction to Fourier optics (Roberts and Company Publishers, 2005).

29. L. Wang and D.-C. He, “Texture classification using texture spectrum,” Pattern Recognit. 23(8), 905–910 (1990). [CrossRef]

30. T. Ojala, M. Pietikäinen, and D. Harwood, “A comparative study of texture measures with classification based on featured distributions,” Pattern Recognit. 29(1), 51–59 (1996). [CrossRef]

31. K. Fukushima, “Neocognitron: A hierarchical neural network capable of visual pattern recognition,” Neural networks 1(2), 119–130 (1988). [CrossRef]

32. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation 1(4), 541–551 (1989). [CrossRef]

33. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Commun. ACM 60(6), 84–90 (2017). [CrossRef]

34. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE 86(11), 2278–2324 (1998). [CrossRef]

35. R. Rothe, R. Timofte, and L. V. Gool, “Dex: Deep expectation of apparent age from a single image,” in IEEE International Conference on Computer Vision Workshops (ICCVW) (2015).

36. G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” Tech. Rep. 07-49, University of Massachusetts, Amherst 2007.

37. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 770–778.

38. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980.

Test #	Training/ Val.	Lensed	Rec.(1 iter.)	Rec.(10 iter.)	Lensless	LLI
Digit
D1	no IV/ no IV	99.12%	97.11%	95.55%	98.83%	98.87%
D2	no IV/ with IV	99.37%	96.55%	93.42%	11.35%	60.81%
D3	with IV/ with IV	98.76%	96.59%	94.93%	89.53%	97.74%
Gender
G1	no IV/ no IV	89.54%	65.57%	69.50%	50.52%	86.26%
G2	no IV/ with IV	84.08%	64.98%	69.37%	50.40%	77.29%
G3	with IV/ with IV	88.38%	65.48%	69.43%	61.45%	77.83%

ETDS-(0, 0, 0)	ETDS-(0, 0, 2)	ETDS-(0, 0, 4)	ETDS-(0, 0, 6)
ETDS-(0, 0, 8)	ETDS-(-0.8, -0.8, 8)	ETDS-(-0.4, -0.4, 8)	ETDS-(0.4, 0.4, 8)
ETDS-(0.8, 0.8, 8)

			TAR@FAR=
Test	0.00001	0.0001	0.001	0.01	0.1	Accuracy
Lensed
G1	0.00052	0.00523	0.05233	0.31530	0.84788	89.54%
G2	0.00027	0.00272	0.02717	0.20662	0.69271	84.08%
G3	0.03322	0.05612	0.15184	0.28713	0.69702	88.38%
Rec.(1 iter.)
G1	0.00166	0.00166	0.00692	0.02737	0.18640	65.57%
G2	0.00156	0.00156	0.00656	0.06992	0.18235	64.98%
G3	0.00943	0.01057	0.020532	0.073201	0.18062	65.48%
Rec.(10 iter.)
G1	0.00029	0.00029	0.00321	0.04178	0.27104	69.50%
G2	0.00022	0.00022	0.00309	0.04101	0.26800	69.37%
G3	0.000375	0.000375	0.002058	0.048241	0.27230	69.43%
Lensless
G1	0.00022	0.00048	0.00285	0.00788	0.04384	50.52%
G2	0.00013	0.00039	0.00149	0.00715	0.06363	50.40%
G3	0.00282	0.00282	0.00740	0.03029	0.23541	61.45%
LLI
G1	0.00014	0.00138	0.01376	0.13759	0.69514	86.26%
G2	0.00007	0.00072	0.00716	0.07156	0.51608	77.29%
G3	0.00545	0.00545	0.01110	0.12985	0.53161	77.83%

EVDS position	Model trained with ETDS-(0, 8, 8)	Model trained with all 9 ETDS
	digit recognition / gender estimation	digit recognition / gender estimation
(0, 0, 0)	95.16% / 51.00%	98.32% / 59.47%
(0, 0, 1)	96.12% / 51.04%	98.50% / 54.70%
(0, 0, 2)	96.78% / 54.56%	98.54% / 66.82%
(0, 0, 3)	97.00% / 50.67%	98.72% / 58.01%
(0, 0, 4)	97.73% / 51.14%	98.73% / 69.97%
(0, 0, 5)	98.02% / 50.99%	98.77% / 56.47%
(0, 0, 6)	98.10% / 51.33%	98.77% / 65.40%
(0 ,0, 7)	98.25% / 51.31%	98.73% / 58.42%
(0, 0, 8)	98.31% / 84.11%	98.52% / 75.59%
(0, 0, 9)	98.25% / 51.57%	98.53% / 72.22%
(0, 0, 10)	98.09% / 51.13%	98.44% / 70.01%
(0, 0, 11)	97.66% / 51.23%	98.24% / 65.51%
(-1.2, -1.2, 8)	68.09% / 54.13%	96.63% / 59.07%
(-1, -1, 8)	80.43% / 53.67%	97.69% / 67.00%
(-0.8, -0.8, 8)	89.45% / 57.40%	98.14% / 76.07%
(-0.6, -0.6, 8)	94.19% / 54.28%	98.34% / 65.44%
(-0.4, -0.4, 8)	96.99% / 54.56%	98.38% / 74.91%
(-0.2, -0.2, 8)	98.03% / 51.36%	98.53% / 68.39%
(0, 0, 8)	98.31% / 84.11%	98.52% / 75.59%
(0.2, 0.2, 8)	97.78% / 50.83%	98.72% / 53.88%
(0.4, 0.4, 8)	96.02% / 52.04%	98.66% / 75.05%
(0.6, 0.6, 8)	90.93% / 50.91%	98.60% / 53.65%
(0.8, 0.8, 8)	81.69% / 53.71%	98.44% / 75.75%
(1, 1, 8)	69.47% / 54.44%	98.20% / 52.67%
(1.2, 1.2 ,8)	57.38% / 54.23%	97.30% / 52.99%

Test #	Training/ Val.	Lensed	Rec.(1 iter.)	Rec.(10 iter.)	Lensless	LLI
Digit
D1	no IV/ no IV	99.12%	97.11%	95.55%	98.83%	98.87%
D2	no IV/ with IV	99.37%	96.55%	93.42%	11.35%	60.81%
D3	with IV/ with IV	98.76%	96.59%	94.93%	89.53%	97.74%
Gender
G1	no IV/ no IV	89.54%	65.57%	69.50%	50.52%	86.26%
G2	no IV/ with IV	84.08%	64.98%	69.37%	50.40%	77.29%
G3	with IV/ with IV	88.38%	65.48%	69.43%	61.45%	77.83%

Lensless inference camera: incoherent object recognition through a thin mask with LBP map generation

Abstract

1. Introduction

2. Related work

3. Methods

3.1 Imaging model and disturbance amplification issue

3.2 Preprocessing: LBP map generation

3.3 Classifier: convolutional neural network

4. Experiments

4.1 Experiment 1: test under uneven illumination; comparison with lensed camera and mask-based lensless camera

4.2 Experiment 2: test with changing positions of the target

5. Conclusion

Appendix

Disclosures

References

Cited By

Figures (13)

Tables (5)

Equations (4)

Optics Express