We propose a framework for three-dimensional (3D) object recognition and classification in very low illumination environments using convolutional neural networks (CNNs). 3D images are reconstructed using 3D integral imaging (InIm) with conventional visible spectrum image sensors. After imaging the low light scene using 3D InIm, the 3D reconstructed image has a higher signal-to-noise ratio than a single 2D image, which is a result of 3D InIm being optimal in the maximum likelihood sense for read-noise dominant images. Once 3D reconstruction has been performed, the 3D image is denoised and regions of interest are extracted to detect 3D objects in a scene. The extracted regions are then inputted into a CNN, which was trained under low illumination conditions using 3D InIm reconstructed images, to perform object recognition. To the best of our knowledge, this is the first report of utilizing 3D InIm and convolutional neural networks for 3D training and 3D object classification under very low illumination conditions.
© 2018 Optical Society of America under the terms of the OSA Open Access Publishing Agreement
Imaging a scene in low illumination conditions using conventional image sensors that operate in the visible spectrum is difficult as the captured images become read-noise dominant. Thus, signal-to-noise ratio (SNR) suffers resulting in poor scene visualization in addition to making object recognition a difficult task. There is much interest in a broad range of fields to image in low-light conditions such as remote sensing , underwater imaging , night vision [3,4], etc. Image sensors that are designed for imaging in low-light conditions include electron-multiplying CCD cameras (EM-CCD) [3,4], scientific CMOS (sCMOS) cameras , or night vision cameras. However, both the EM-CCD and sCMOS cameras are expensive and bulky. In particular, the EM-CCD needs to be cooled to around −55° C prior to operation. Night vision operates by amplifying the number of photons in the scene. If too few photons are available, an active near infrared source is required to illuminate the scene. Infrared cameras are effective in low light conditions. However, they have lower resolution compared with visible range cameras and may require bulkier and more expensive optics.
Passive cameras for 3D imaging using three-dimensional (3D) integral imaging (InIm)  have been reported [6–10]. In 3D InIm, an array of cameras or a single moving camera may be used to capture a scene, with each camera obtaining a unique perspective of the scene known as an elemental image (EI). Using the acquired EIs, a 3D image can be reconstructed computationally or optically. Integral imaging has been investigated in low illumination conditions. In , a photon-counting model was used to simulate photon-limited images from EIs that captured a 3D scene under sufficient illumination. Computational 3D InIm reconstruction was performed using photon-limited EIs. It was shown that the 3D InIm reconstruction produces the maximum likelihood estimate of objects that lie on the corresponding 3D reconstructed depth plane. Thus, the 3D reconstructed image has higher SNR compared with a single 2D image. In , a 16-bit cooled camera was used to obtain EIs of objects under photon-starved conditions. After 3D reconstruction and denoising using total-variation denoising, object visualization was achieved whereas it was not possible using a single 2D image. In , 3D InIm was used to obtain EIs of an outdoor scene containing an object behind occlusion under low illumination conditions. With a single 2D image, face detection was not possible in the experiments. After computational 3D InIm reconstruction, object detection was successful. However, object classification in low light levels was not possible in this approach.
In this paper, we show for the first time that it is possible not only to detect, but also to classify 3D objects in a scene in very low illumination conditions by 3D InIm. The novelty of the manuscript stems from the unique approach to the 3D training of the CNN classifier. We train the CNN classifier by using denoised 3D reconstructed images acquired using elemental images obtained under various low illumination conditions. By using 3D training data under these illumination conditions, the CNN is able to perform face recognition as it has been trained to recognize the face under non-optimal illumination conditions. Thus, the novelty lies in enabling a novel 3D object recognition approach for object recognition in low light levels, which may not have been possible using conventional 2D approaches.
We use low cost passive image sensors that operate in visible spectrum. The EIs are read-noise dominant. 3D InIm is naturally optimal in the maximum likelihood sense for read-noise dominant images as this follows a Gaussian distribution. Upon 3D InIm reconstruction, SNR increases resulting in improved image visualization. The scene is then denoised with total-variation regularization using an augmented Lagrange approach (TV-denoising) . Regions of interest are obtained to detect faces in the scene , which are then inputted into a pre-trained convolutional neural network [16,17] for facial recognition. We demonstrate by experiments that the 3D InIm system trained in the dark with CNN was able to perform successfully face detection and classification in low light levels.
2. Three-dimensional integral imaging in low illumination conditions
3D InIm is a 3D imaging technique that uses a lenslett array, an array of cameras, or a moving camera to capture different perspectives of a scene, known as elemental images. The 3D InIm captures both intensity and angular information. Figure 1(a) depicts the integral imaging pickup stage. Once the EIs have been acquired, the scene can be reconstructed, as shown in Fig. 1(b), by back-propagating the captured light rays through a virtual pin hole to a particular depth plane a distance z away. Figure 1(c) depicts the chief ray, Ri, from the object surface in 3D space (x,y,z) at location (x0,y0,z0) with azimuth angle θi and zenith angle ϕi being imaged by the i-th lens located at (x1,y1,z1) and arriving at the sensor plane at (τ,ψ). Using the acquired elemental images, 3D InIm reconstruction can be performed optically or computationally. Figure 2 depicts the synthetic aperture integral imaging (SAII) pick-up and reconstruction stage. Computational 3D InIm reconstruction is implemented as follows :
A captured image can be defined as E(x,y) = I(x,y)r(x,y) where I(x,y)>0 is the illumination factor and r(x,y) is the reflection coefficient between 0 and 1 . As the scene illumination decreases, the illumination factor diminishes. Moreover, read-noise becomes greater than the scene signal hindering adequate scene visualization. Thus, the image becomes read-noise dominant. Read-noise results from on-chip sensor noise, is additive, and can be modeled as a zero-mean Gaussian distribution. Using Eq. (1), the 3D InIm reconstruction with read-noise is:
Taking the variance of Eq. (2), the variance of the noise component for a fixed z, assuming that noise is wide sense stationary, is:
As the number of overlapping images increases, the variance or noise power, of the read noise decreases. It was shown that integral imaging reconstruction is naturally optimal in the maximum-likelihood sense for read-noise limited images as the distribution is approximately Gaussian . Without photon counting devices to measure the flux density, the SNR of the image is estimated as :
The number of photons per pixel (Nphotons) can be estimated, assuming dark current noise is negligible and the exposure time is sufficiently short, as:
3. Experimental results
A synthetic aperture integral imaging experiment was conducted using Allied Vision Mako-192 camera with 86.4 mm x 44 mm x 29 mm camera dimensions. The sensor is an e2v EV 76C570 and a CMOS sensor type. The F/# is F/1.8 with focal length of 50 mm, pixel size of 4.5 um x 4.5 um, sensor size of 7.2 mm (H) x 5.4 mm (V), and image size of 1600 (H) x 1200 (V). The camera read-noise is 20.47 electrons/pixel/sec and the quantum efficiency at 525 nm is 0.44 electron/photons. A gain of 0 dB was used in the experiments. The InIm setup consisted of 72 elemental images using 3 x 24 array with 10 mm (H) x 80 mm (V) a pitch and exposure time of 0.015 s.
The experimental setup for low illumination conditions consisted of a 3D integral imaging set up with 6 subjects located a distance 4.5 m away from the camera array. Experiments were conducted for each subject under different illumination conditions resulting in different SNR levels. The illumination conditions were altered by adjusting the intensity of the light source. Figure 3(a) depicts the elemental image [reference image] with an SNR of 10.41 dB (i.e. good illumination) and Fig. 4(a) shows the 3D reconstructed image at z = 4.5 m with an SNR of 12.39 dB. Prior to 3D reconstruction, the elemental images were registered and aligned due to the experimental conditions (e.g. unbalanced camera array). Fifty bias frames were taken and averaged for each camera and subtracted from the elemental images. SNR was computed by assuming <go2> is the object (i.e. the person’s face) while <N2> is an area of scene that is completely dark. The elemental images acquired using the 3D InIm under low illumination conditions are shown in Fig. 3(b)–3(f), which is in order of decreasing illumination and dominated by read-noise. Measuring the number of photons in the outdoor scene under these conditions requires sophisticated instruments, which are not easily field portable. In Fig. 3(b), the SNR was −1.20 dB with approximately 40.53 photons/pixel on the object. The person captured is still visible. In Fig. 3(c), the SNR decreases to −9.13 dB with 16.26 photons/pixel. The average power of object is lower than the noise power for the images shown in Fig. 3(d)–3(f). As a result, SNR cannot be computed as resulting in an imaginary number in Eq. (4).
3D reconstructed images at z = 4.5 m are shown in Fig. 4, with the 3D reconstructed images corresponding to the elemental images shown in Fig. 3. In Fig. 4(b)–4(f), the SNR increases to 8.93 dB, 0.96 dB, −5.28 dB, −9.67 dB, and −12.38 dB, respectively. Moreover, the corresponding photons/pixel for Fig. 4(b)–4(f) is 130.02, 51.95 photons/pixel, 25.34 photons/pixel, 15.27 photons/pixel, and 11.18 photons/pixel, respectively. Figure 5(a) depicts a graph of the SNR (see Eq. (4) of the elemental images and the corresponding 3D reconstructed images a z = 4.5 m as a function of illumination. Illumination levels 1 to 17 correspond to the scene light levels used in the experiments with 1 corresponding to the highest illumination level. The SNR of the 3D reconstructed images is higher than that of the 2D EIs. We note that the SNR could not be computed for EIs with SNRs below −21.35 dB as the noise became greater than the signal. Figure 5(b) depicts a graph displaying SNR (in dB) as a function of the number of photons/pixel. Overall, the 3D reconstructed images have a higher number of photons/pixel relative to their corresponding 2D EI.
Additional experiments were carried out to evaluate the advantages of 3D InIm in low light conditions when compared with increasing the exposure time of a single camera and recording multiple 2D elemental images using a single camera perspective followed by taking the average of the images. To evaluate image quality, we define the following metric:
A 3D InIm experiment was conducted using the experimental parameters described above. Figure 6(a) depicts the reference image while Fig. 6(b) depicts the 3D reconstructed image at z = 4.5 m, with corresponding SNRcontrast of 31.5 dB and 33.76 dB, respectively. A low-light condition experiment was then conducted. First, the scene was captured using a single image, but using three different exposure times. Figure 7(a)–7(c), depicts the captured images having an exposure time of 0.010 s, 0.015 s and 0.203 s under low light conditions, respectively. The SNRcontrast of the images shown in Fig. 7(a) and Fig. 7(b) cannot be computed as the object region intensity is less than that of the background. The SNRcontrast of Fig. 7(c) is 8.213 dB. Another set of experiments was carried out to capture 72 images from a single perspective along with a 3D InIm experiment. As shown in Fig. 8(a), 72 images from a single perspective were taken using an exposure time of 0.015 s and averaged while in Fig. 8(b) the 3D reconstructed image at z = 4.5 m is shown. The SNRcontrasts are 6.38 dB and 16.702 dB, respectively. The experiment was then repeated once more using an exposure time of 0.010 s. Figure 8(c) depicts the average of 72 images obtained from a single perspective while Fig. 8(d) depicts the 3D reconstructed image at z = 4.5 m. The SNRcontrasts are 2.152 dB and 15.94 dB, respectively. Thus, by capturing both intensity and angular information, image contrast and visualization using 3D InIm image reconstruction is improved compared with taking the average of multiple images captured using a single perspective or increasing the exposure time and capturing a single image. One reason is that 3D InIm segments out the object of interest from the background.
4. Object classification using convolutional neural networks
A Convolutional Neural Network (CNN) [16,17] was then trained on low illumination data for face recognition. An advantage of deep learning over other machine learning algorithms (e.g. Support Vector Machines or Random Forest Classifier) is that feature extraction is not needed. However, deep learning increases the computational complexity. In addition, deep learning requires a sufficiently large training set. The training images were 3D reconstructed images of faces after TV-denoising obtained under different illumination conditions. The customized CNN employed in the experiments used larger filters in the convolutional layers as these performed well on images obtained under low-light conditions. Training in photon-starved environments helps improve the classifier’s ability to discern between different subjects in low illumination conditions. To illustrate the need for learning in the dark, normalized correlation was used to demonstrate the difficulty in discriminating faces under low illumination conditions. Figure 9(a) shows a 3D reconstruction reference image at z = 4.5 m after TV-denoising obtained using EIs under an SNR of 10.41 dB. This image was correlated with 3D reconstructed images after TV-denoising whose EIs were obtained under an SNR of −12.75 dB, shown in Fig. 9(b) (true class object) and 9(c) (false class object), with correlation values of 0.58 and 0.48, respectively. Note that 1 indicates the images are perfectly correlated, and 0 indicates no correlation. Thus, it is difficult to discriminate objects under low-light conditions without training the classifier information about what object appears in low-light.
A CNN was trained to perform facial recognition using data from the 3D InIm reconstructed data. A data set was collected consisting of 6 different subjects, and 17 different illumination conditions acquired using the 3D InIm. The images were then computationally reconstructed over distances of 4 m to 5 meters with a step size of 50 mm where the true object distance was 4.5 meter. Figure 10 below depicts examples of the training images used. The data set was then split into training and testing whereas 4 randomly chosen illumination conditions having SNRs of approximately −1.41 dB, −8.322 dB, −8.971 dB, and −12.75 dB was not used to train the classifier (test set) and the other 13 illumination conditions were used for training. Thus, there were 24 test scenes. The training images were grayscale images of size 256 x 256 pixels and were perturbed by adding additive Gaussian noise with mean 0 and standard deviation of 0.01, 0.05 and 0.1, rotated −1, −0.5, 0.5 and 1 degrees, and translated −5, −3, 3, and 5 pixels in both the x- and y- directions generating a total of 29,232 images. The data was then denoised using total-variation regularization using an augmented Lagrange approach with a regularization parameter of 20000 . Figure 10 depicts examples of denoised 3D reconstructed training images acquired at various SNRs, reconstruction depths, additives noise, and rotations. The CNN consisted of: a convolution layer [13 x 13, 20 filters], rectified linear unit layer (ReLU), 2 x 2 max pooling, convolution layer [11 x 11, 20 filters], ReLU, 2 x 2 max pooling, fully connected layer, and a SoftMax layer [6 outputs]. For training, stochastic gradient descent was used with a learning rate of 0.0001 and a maximum of 10 epoch used along with the cross-entropy metric to evaluate model performance. In total, the model took approximately 4 hours to train on a high performance computer utilizing a GPU Tesla K40m running CUDA 8.0 and implemented using MATLAB.
For classification, only regions of the 3D reconstructed image consisting of information from all 72 elemental images were considered to reduce the size of the input image. The image was then denoised using total-variation regularization using an augmented Lagrange approach with a regularization parameter of 20000 . Afterwards, the Viola-Jones face detector  was used to find regions of interest. The regions were then inputted into the CNN classifier. This process was repeated over all z. If the same face appeared in the same region and detected over multiple depths, the estimated object reconstruction depth corresponded to the face with the highest mean intensity value. The rationale is that this reconstruction depth contained the most object information (i.e. strongest signal). More specifically, the object region can be modeled as signal plus additive noise whereas incorrect depths can be considered as noise. This is not the only approach and other approaches may be considered for future work. Note that faces were not detected for the Viola-Jones classifier for EIs with SNRs below approximately −21.36 dB. Table 1 summarizes the results. The proposed 3D system had a 100% accuracy. Figure 11 below depicts an overview of the classification scheme.
In conclusion, we present a 3D InIm system trained in the dark to classify 3D objects obtained under low illumination conditions. Regions of interest obtained from 3D reconstructed images are obtained by denoising the 3D reconstructed image using total- variation regularization using an augmented Lagrange approach followed by face detection. The regions of interest are then inputted into a pre-trained Convolutional Neural Network (CNN). The CNN was trained using 3D InIm reconstructed under low illumination after TV-denoising. The EIs were obtained under various low illumination conditions having different SNRs. The CNN was able to able to recognize the 3D reconstructed faces after TV- denoising with 100% accuracy. Using a single 2D elemental image, regions of interest cannot even be extracted for low illumination conditions. Future work includes more dynamic scenes, utilizing different algorithms to improve image quality and classification in different scene conditions , and improving the data set size to create a more robust classifier.
Night Vision and Electronic Sensors Directorate, Communications-Electronics Research, Development and Engineering Center, US Army (W909MY-12-D-0008).
1. N. Levin and Q. Zhang, “A global analysis of factors controlling VIIRS nighttime light levels from densely populated areas,” Remote Sens. Rev. 190, 366–382 (2017). [CrossRef]
2. B. Phillips, D. Gruber, G. Vasan, C. Roman, V. Pieribone, and J. Sparks, “Observations of in situ deep-sea marine bioluminescence with a high-speed, high-resolution sCMOS camera,” Deep Sea Res. Part I Oceanogr. Res. Pap. 111, 102–109 (2016). [CrossRef]
4. Z. Petrášek and K. Suhling, “Photon arrival timing with sub-camera exposure time resolution in wide-field time-resolved photon counting imaging,” Opt. Express 18(24), 24888–24901 (2010). [CrossRef] [PubMed]
5. G. Lippmann, “Épreuves réversibles donnant la sensation du relief,” J. Phys. Theory Appl. 7(1), 821–825 (1908). [CrossRef]
7. A. Llavador, E. Sánchez-Ortiga, G. Saavedra, B. Javidi, and M. Martínez-Corral, “Free-depths reconstruction with synthetic impulse response in integral imaging,” Opt. Express 23(23), 30127–30135 (2015). [CrossRef] [PubMed]
9. H. Hoshino, F. Okano, H. Isono, and I. Yuyama, “Analysis of resolution limitation of integral photography,” J. Opt. Soc. Am. A 15(8), 2059–2065 (1998). [CrossRef]
12. A. A. Stern, D. Aloni, and B. Javidi, “Experiments with three-dimensional integral imaging under low light levels,” IEEE Photon. J. 4(4), 1188–1195 (2012). [CrossRef]
13. A. Markman, X. Shen, and B. Javidi, “Three-dimensional object visualization and detection in low light illumination using integral imaging,” Opt. Lett. 42(16), 3068–3071 (2017). [CrossRef] [PubMed]
14. S. H. Chan, R. Khoshabeh, K. B. Gibson, P. E. Gill, and T. Q. Nguyen, “An augmented Lagrangian method for total variation video restoration,” IEEE Trans. Image Process. 20(11), 3097–3111 (2011). [CrossRef] [PubMed]
15. P. Viola, M. Jones, and D. Snow, “Detecting pedestrians using patterns of motion and appearance,” Int. J. Comput. Vis. 63(2), 153–161 (2005). [CrossRef]
16. A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in the Neural Information Processing Systems Conference (2012), pp. 1097–1105.
18. R. Gonzalez and R. Woods, Digital Image Processing (Pearson, 2008).
19. F. Sadjadi and A. Mahalanobis, “Automatic target recognition,” Proc. SPIE 10648, 106480I (2018).