With one single photon camera (SPC), imaging under ultra weak-lighting conditions may have wide-ranging applications ranging from remote sensing to night vision, but it may seriously suffer from the problem of under-sampled inherent in SPC detection. Some approaches have been proposed to solve the under-sampled problem by detecting the objects many times to generate high-resolution images and performing noise reduction to suppress the Poission noise inherent in low-flux operation. To address the under-sampled problem more effectively, a new approach is developed in this paper to reconstruct high-resolution images with lower-noise by seamlessly integrating low-light-level imaging with deep learning. In our new approach, all the objects are detected only once by SPC, where a deep network is learned to reduce noise and reconstruct high-resolution images from the detected noisy under-sampled images. In order to demonstrate our proposal is feasible, we first select a special category to verify by experiment, which are human faces. Such deep network is able to recover high-resolution and lower-noise face images from new noisy under-sampled face images and the resolution can achieve 4× up-scaling factor. Our experimental results have demonstrated that our proposed method can generate high-quality images from only ~0.2 detected signal photon per pixel.
© 2018 Optical Society of America under the terms of the OSA Open Access Publishing Agreement
Ultra weak light level imaging techniques have many applications such as night vision, remote sensing, biological science [1–3] and so on. A typical image taken with a conventional camera captures 1012 photons [4, 5], on the other hand, a high-quality digital camera with a multi-megapixel array typically records an image by collecting of order 105 photons per pixel . In ultra-low light and limited exposure time conditions, the number of photons cannot reach sufficient numbers of photons. Imaging detection under very weak light intensity generally uses area array detectors whose sensitivity reaches single-photon level, such as CCD, ICCD, EMCCD, etc. But the detection sensitivity of them is limited to the photosensitivity on its unit pixel. For example, the common CCD only has a weak response when the number of photons per pixel is greater than 100 count/s; and the background noise of the ICCD is large; the EMCCD is sensitive to low illumination, but it could not discriminate the actual number of photons when the number of photons per pixel is very small .
The single photon camera(SPC) is based on a 2-D imaging array of smart pixels. Each pixel is equivalent to one fully independent point detector  whose photon detection accuracy and sensitivity can reach the shot noise limitation [6, 8]. Such single photon camera(SPC) can be used in low-light-level photon-limited imaging. People have achieved obtaining images with SPC from ~1 detected signal photon per pixel [8,9] on average.
Especially, the SPC captured the number of photons not the gray scale image, and often the random nature of photon emission and detection is the dominant source of noise in imaging systems. The data collected by the imaging systems obey a spatial Poisson distribution which describes the probability of photon emissions at different locations in space. Such cases are referred to as photon-limited imaging applications, since the relatively small number of detected photons limited the signal-to-noise ratio (SNR). Thus, because of the packaging difficulty for single-photon detector array, the spacing between pixels is so large that the SPC can only obtain the under sampled images. Especially for the high resolution objects whose pixels are beyond the number of SPC pixels, the detected images are heavily under sampled.
To obtain the high resolution image, the traditional method using SPC needs to scan many times to form a subhigh resolution image [8,9]. It also needs other image denoising processing to suppress the Poission noise inherent in low-flux operation. Some simple and effective methods are urgently needed for obtaining high resolution and high quality images.
Recently, deep learning has achieved outstanding performance in image processing. Many people have tried to combine the deep learning with the traditional physical problems to achieve good performance [10–13], such as combining imaging through scattering media with machine learning, combining object classification through scattering media with deep learning and so on.
In this paper, we have seamlessly combined weak-light image with deep learning to reconstruct lower-noise and subhigh resolution images, where the generative adversarial net (GAN) [14,15] is used to learn representative features from the noisy and under-sampled images detected by SPC and all the objects are needed to be detected only once. In order to demonstrate our proposal is feasible, we first select a special category to verify by experiment, which are human faces. We used a dataset of 9000 high resolution original face images and their detected noisy under-sampled images to train a deep network which is further used to predict lower-noise and high resolution images from the detected noisy under-sampled face images. The remaining of the paper is organized as follows. In section 2, we describe our GAN-based detection under weak-light scheme. Section 3 and Section 4 demonstrates the effectiveness of the proposed method.
Compared with previous imaging methods under weak-light-level conditions, all the objects are needed to be detected only once in our scheme and the reconstructed images are lower-noise and have high resolution. Especially, we have verified that our scheme can still generate high quality images from ~0.2 detected signal photon per pixel and the resolution can achieve 4× upscaling factor.
2.1. Imaging under ultra weak light
The experimental setup for ultra low light imaging is illustrated in Fig. 1. One Digital Micromirror Devices(DMD: DLC9500P24, pixel count: 1920 × 1080, pixel pitch: 10.8µm), which was illuminated by a laser(laser: KATANA-10XP), displays the object images. Then the light transmitted by the DMD was divided into two beams by a beam splitter. One beam was captured by CCD, and the other was captured by a single photon camera (SPC pixel pitch: 150µm, pixel count: 32 × 64). To ensure the light intensity is weak enough, we use CCD and the single photon camera (SPC) to detect objects at the same 2.4 ms exposure time. The images captured by CCD and SPC respectively are shown in Fig. 1. It can be seen the light intensity is too weak that the common detector(CCD) cannot record objects, but the SPC can still work. And because the spacing between SPC pixels is too large that it can only obtain the under sampled images. The light intensity can be adjusted to different ultra weak level and the SPC captured the number of photons. In this case, we can obtain multiple sets of data. In our experiment, we demonstrated our method in three ultra weak light conditions while the CCD cannot record objects.
In our experimental design, objects , where Nx and Ny are the number of pixels along the x and y-dimensions, and i is the index of the object, displayed on a DMD. When the light is ultra weak, we use the single photon camera(SPC) to capture images, which have 32 × 64 smart pixels, so the captured under-sampled images gi ∈ C32 × 64. Actually, when the object is square, if we want to avoid the shooting distortion and obtain the most effective information, the effective part of image can only occupy about 32 × 32 smart pixels. As a result, the resolution decreased by . In our experiment, Nx = Ny = 128. So if we want to obtain the subhigh resolution images at the weak light level, we must increase the resolution from detected noisy under-sampled images and achieve 4× upscaling factor. It must be mentioned that the SPC captured the number of photons not the gray scale image. We can obtain the gray scale image by normalizing the number of photons. Our reconstruction does not begin with a gray scale image, but reconstructs from a photon count image containing Poisson noise.
2.2. Improving image quality and resolution using a deep neural network
To improve the image quality, we use a GAN to reconstruct the high quality images from the noisy under-sampled images. The GAN network consists of the following two paths: a Generator Network and a Discriminator Network. These networks include convolution layer, sub-pixel convolution layer, fully connected layer and activation layer, denoted as “Conv”, “pixelshuffler”, “Dense” and “Activation”, respectively. Figure 2 shows the proposed network structure .
The convolution layers generate feature maps for the input images by using convolution operations, which are frequently used in image processing. For example, in our Generate Network, the first convolution layer is denoted as “32 × 32 × 64”, which means it outputs 64 feature maps that have an output size of 32 × 32 pixels each. The first convolution layer has convolution filter with a kernel size of 3 × 3, and the kernel weights are learned from the training dataset. Generally, convolution reduces the number of pixels in the output(feature map). To avoid reducing the number of pixels, we set the convolution strides as 1 in the Generator Network.16]. ILR is the input low resolution images and ISR is the output high resolution images, the weights w and bias b are learned from the training dataset. For example, in our Generator Network, we use two sub-pixel convolution layers. The input size of the first sub-pixel convolution layer is 32 × 32 × 256 and we set r(the upscale ratio) as 2, so the output size is 64 × 64 × 64. After two sub-pixel convolution layers, the feature map has an output size of 128 × 128 pixels.
The core operation of the fully connected layer is the matrix vector product. It expands the output of the previous layer into a one-dimensional vector and then multiplies it with the parameter matrix. The size of the parameter matrix is determined by the size of the input and output. Essentially, it is a linear transformation from one feature space to another. For example, in our Discriminator Network, the first fully connected layer is donated as “Dense:1024”, which means the output is a one-dimensional vector and it has 1024 entries.
The output of the activation layer is a probability value. It is always used after the fully connected layer. The activation function such as relu or sigmoid is used to add non-linear factors, solving the problem that cannot be solved by the linear model. For example, in our Discriminator Network, the last layer is an activation layer, it produces a non-linear decision boundary and classify the input as the original image or the reconstructed image via non-linear combinations of the weighted inputs.
All the parameters of our network are learned from the training dataset. When we train the network, the input of our Generator Network is the detected noisy low-resolution image and the Generator Network reconstructs the subhigh resolution images. Then the reconstructed images and the original high resolution image were sent to Discriminator Network. The Discriminator Network classify the input as the original image or the reconstructed image.
We hope the Generator Network can reconstruct the subhigh resolution images as real as possible which means we hope the reconstructed subhigh resolution images and the original high resolution images cannot be distinguished by the Discriminator Network. But the Discriminator Network is a binary classification network which means it hopes to accurately identify whether the input image is the reconstructed subhigh resolution images or the original high resolution images. Both parties try their best to optimize their networks so as to form a competitive confrontation.
In the process of training, one party is fixed, the other’s network weights are updated, and alternate iterations are performed. During this process, both parties try their best to optimize their networks so as to form a competitive confrontation until the two sides reach a dynamic balance. The generation model G restores the distribution of the training data (creates exactly the same sample as the original high resolution image), and the discriminant model no longer discriminates the result. The accuracy is 50%, which is approximately equal to random guessing.
To optimize the kernel weights and other network parameters, the Generator network is trained by minimizing the g-GAN loss and mean squared error (MSE). The g-GAN loss is generated when the the Discriminator Network distinguish the reconstructed subhigh resolution images correctly. The MSE between the reconstructed subhigh resolution image and the original high resolution image is defined as
At the same time, the Discriminator network is trained by minimizing the cross-entropy(CEH)  between the reconstructed subhigh resolution image and the original high resolution image. The cross-entropy is widely used in classification network. It is defined as
After training, all the parameters of our network are fixed. The Generator network could reconstruct subhigh resolution images from new noisy detected under-sampled images that were not included in the training set.
3. Experimental results
In order to demonstrate that our proposal is feasible, we first selected a special category to verify by experiment, which were human faces. To train the network, we needed to prepare a large dataset comprising pairs of original high resolution images and noisy under-sampled images detected in weak-light condition. A face database called celeba  was used for the object images for training and test processes. The central 128 × 128 pixel region of images in the celeba was clipped and was displayed on the central 128 × 128 pixel region of DMD. We used SPC to detect objects in weak-light conditions. To demonstrate our scheme, we prepared three sets of data detected in three different light intensity conditions. The light intensity in case 1 is the largest and the light intensity in case 3 is the smallest. For each object, we only detected once in each weak-light level condition. The object images of the training and test datasets were randomly chosen from the celeba database without overlap. The number of images in the training dataset was set to 9000, and that of the test dataset was set to 1000. After training, the network could predict subhigh resolution images from new noisy under-sampled images(test dataset) detected in the same low-light condition that were not included in the training set.
Six examples of the training samples are shown in Fig. 3. The original high resolution objects are shown in Fig. 3(a). The noisy under-sampled images detected in three different weak-light-level conditions were shown in Figs. 3(b), 3(c), and 3(d), respectively. The noisy under-sampled image resolution is 32 × 32 and the original object resolution is 128 × 128.
It can be seen that all detected images have some highlights. They are caused by the fabrication imperfections of SPC array. The fabrication imperfections cause some pixels to have high-dark count rates, so that their detection is uninformative in our imaging experiments. Other low-light-level imaging methods need to do some processing for them , but our method with deep-learning can automatically handle the effects of these hot-pixels during the learning process. So we don’t need to do special processing for that, the GAN network can directly learn from the detected images.
Figure 4 shows the experimental results of reconstructions for eight test samples. For each test sample, the first image is the original object image, following by three noisy under-sampled images that detected at three different light intensity, and the last three images are the reconstructed images corresponding to the three noisy under-sampled images respectively. The detected noisy under-sampled image resolution is 32 × 32 and the reconstructed image resolution is 128 × 128. That means the resolution upscaling factor is
4. Discussion and future work
In traditional super resolution simulation field [15,16], to measure the performance of the method, people analyzed the PSNR(dB) of reconstructed images by their method. For comparison and evaluating our reconstruction result, we calculated PSNR for three sets of our test samples as shown in Table 1. The PSNR of eight test samples in Fig. 4 is shown in Table 2. Especially, the PSNR of test sample (a) is the smallest and its reconstruction is not good enough. The light intensity in case 1 is the largest and the light intensity in case 3 is the smallest. The PSNR is defined as
To measure the weak-light level in our experiment, we calculated the average number of photons per pixel. It is defined asTable 3. We also calculate the average photons of 1000 test samples, it is 1.88, 1.02 and 0.18 in case 1, case 2 and case 3, respectively. It is clear that the average number of photons decreases with the decreasing of light intensity. And in the weakest-light level condition, our scheme can still generate high-quality images at the level of ~0.2 detected signal photon per pixel.
In the field of image processing, GAN and CNN are two networks that are often considered. In order to find a more suitable model for the problem, we also tried CNN. In fact, the generator network of GAN we used here is a CNN. By simply setting the coefficient of g-GAN loss to be zero, we obtained the performance of CNN. Figure 5 shows the performance of CNN for eight test samples.
For comparison, we also calculated PSNR for the images reconstructed by CNN, as shown in Table 4 and Table 5. Obviously, GAN has a better performance than CNN. It may be because the GAN can focus on texture details by adversarial loss . GAN can produce a more realistic sample than other models, although its interpretability is poor.
We do a specificity analysis as shown in Fig. 6. We randomly choose fifty non-face images with 128 × 128 pixels from the Caltech computer vision database . Then, the noisy and under-sampled images were detected under case1 weak light level condition for test dataset. The generator network trained for the face images of case1 training dataset was applied for the reconstruction of the non-face test images. The results of ten non-face test samples were shown in Fig. 6. Figure 6(a) is the original high resolution non-face objects, (b)is the detected noisy and under-sampled images of (a). Figure 6(c) is the reconstructed subhigh resolution images. As indicated in the results, the trained model can not reconstruct high quality images for non-face images.
In this paper, low-light-level imaging is seamlessly combined with deep learning to reconstruct lower-noise and high resolution images, where a GAN is learned to reconstruct the lower-noise and subhigh resolution images from the noisy under-sampled images detected by SPC under weak-light-level condition. Instead of scanning images many times, our proposed method just needs to detect each object only once. Especially, our reconstruction does not begin with a gray scale image, but reconstructs from a photon count image containing a lot of Poisson noise which is inherent in low-flux operation. In the experiment, the subhigh resolution face image was successfully reconstructed from the noisy under-sampled image detected by SPC. The under sampled image resolution is 32 × 32 and our reconstructed image resolution is 128 × 128. Our proposed approach can achieve 4x upscaling factor for resolution. Our experimental results under three different weak-light-level conditions have demonstrated the effectiveness of our proposed method, e.g., our proposed method can still generate high-quality images from ~0.2 detected signal photon per pixel.
National Natural Science Foundation of China (NSFC) (Grants No: 61471239, 61631014); Hi-Tech Research and Development Program of China (2013AA122901).
The authors declare that there are no conflicts of interest related to this article.
1. J. Salmon, Z. Harmany, C. A. Deledalle, and R. Willett, “Poisson Noise Reduction with NonLocal PCA,” J. Math. Imaging Vis. 48, 279–294 (2014). [CrossRef]
4. W. Ruyten, “CCD arrays, cameras, and displays, by Gerald C. Holst,” Opt. Photonics News 8, 54 (1997).
6. Y. Wen-kai, Y. Xu-ri, L. Xue-feng, Z. Guang-jie, and Z. Qing, “Compressed sensing for ultra-weak light counting imaging,” Opt. Precis. Eng. 20, 2283–2292 (2012). [CrossRef]
7. D. Bronzi, F. Villa, S. Tisa, A. Tosi, F. Zappa, and D. Durini, “100 000 Frames/s 64 32 Single-Photon Detector Array for 2-D Imaging and 3-D Ranging,” IEEE J. Sel. Top. Quantum Electron. 20, 354–363 (2014). [CrossRef]
8. D. Shin, F. Xu, D. Venkatraman, R. Lussana, F. Villa, F. Zappa, V. K. Goyal, F. N. C. Wong, and J. H. Shapiro, “Photon-efficient imaging with a single-photon camera,” Nat. Commun. 7, 12046 (2016). [CrossRef] [PubMed]
9. A. Kirmani, D. Venkatraman, D. Shin, A. Colao, F. N. Wong, J. H. Shapiro, and V. K. Goyal, “First-photon imaging,” Science 343, 58 (2014). [CrossRef]
10. T. Ando, R. Horisaki, and J. Tanida, “Speckle-learning-based object recognition through scattering media,” Opt. Express 23, 33902 (2015). [CrossRef]
12. T. Shimobaba, Y. Endo, T. Nishitsuji, T. Takahashi, Y. Nagahama, S. Hasegawa, M. Sano, R. Hirayama, T. Kakue, and A. Shiraki, “Computational ghost imaging using deep learning,” Opt. Commun. 413, 147–151 (2017). [CrossRef]
13. B. Heshmat, G. Satat, M. Tancik, O. Gupta, and R. Raskar, “Object classification through scattering media with deep learning on time resolved measurement,” Opt. Express 25, 17466–17479 (2017). [CrossRef] [PubMed]
14. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proceedings of International Conference on Neural Information Processing Systems (2014), pp. 2672–2680.
15. C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, and Z. Wang, “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 105–114.
16. W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 1874–1883.
17. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2015), pp. 3431–3440.
18. J. Shore and R. Johnson, “Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy,” IEEE Trans. Inf. Theory 26, 26–37 (1980). [CrossRef]
19. Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2015), pp. 3730–3738.
20. F. F. Li, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2004), p. 178.