CMOS computational camera with a two-tap coded exposure image sensor for single-shot spatial-temporal compressive sensing

Yi Luo; Yi Luo; Jacky Jiang; Mengye Cai; Mengye Cai; Shahriar Mirabbasi

doi:10.1364/OE.27.031475

1. Introduction

Compressive sensing (CS) is one of widely applied theories in computational imaging paradigms [1]. Through exploiting the intrinsic redundancy of scene information, CS-inspired computational cameras overcome the trade-off between spatial resolution and temporal resolution [2]. Unlike conventional cameras which are based on non-intermittently exposure, computational cameras implement CS through exposure programming (coded exposure) followed by sparse reconstruction [3,4]. Up to present, several computational camera systems have been developed to demonstrate CS using pixel-wise coded exposure for types of applications such as high-speed imaging [5], high-dynamic range (HDR) imaging [6], and depth sensing [7,8]. Due to on-chip per-pixel exposure switching is not available in off-the-shelf image sensors, pixel-wise exposure encoding is usually realized by programming split-light modulators (SLMs) placed on the camera light path [9]. As SLMs require driving power and accessorial relay lens, they make computational cameras suffer from low light throughput, bulky sizes, and high power consumption.

To avoid employing SLMs as the optical modulation apparatus, several on-chip coded exposure camera systems have been proposed recently to introduce CS to applications where size and power are constrained (e.g. portable electronics) [10,11]. Rely on advanced complimentary metal-oxide semiconductor (CMOS) process, camera coded exposure can be implemented by image sensors. In [12], Wan et al. proposed a CS camera with a multi-bucket CMOS imager. With each pixel consists of a storage-gate based charge modulator, the image sensor can apply temporal coded exposure. Similar multi-bucket CMOS imager design reported by Mochizuki et al. extends the temporal coded exposure technique with 15 coded apertures to implement high-speed CS applications [13]. However, as those aperture cells are off-pixel designs, pixel-wise coded exposure is not achieved. Zhang et al. presented a CS video camera with on-sensor pixel-wise coded exposure [14]. The prototype CMOS image sensor includes in-pixel exposure code memory to accomplish spatial coded exposure. However, due to the lack of pixel charge modulator, the imager requires frequent reset and readout operation thus temporal coded exposure stays impossible in a single frame.

In this paper, we present a spatial-temporal CS camera system with on-chip coded exposure. Based on our prior research on exposure-programmable pixels [15,16,17], in this work, the CMOS image sensor is designed to implement pixel-wise coded exposure in both spatial and temporal domains. Each pixel consists of a two-tap charge modulator and two dynamic random-access memory (DRAM) cells to carry out both charge modulation and exposure code storage. During camera exposure period, pre-defined exposure code masks are loaded in sequence to the CMOS image sensor for pixel-wise coded exposure. After all exposure code masks are applied, at the end of a frame, pixels are scanned to output a coded image for CS-based reconstruction. In comparison to other related works, all exposure code masks are applied solely in sensor’s exposure period that noisy pixel reset and readout operations are not required. Thus, pixel-wise coded exposure is integrated seamlessly into the reset-exposure-readout based operation flow of existing CMOS image sensors. Overall, CS applications are naturally extended to sensor nodes in a single frame period, which results in a significant reduction in power consumption.

The rest of the paper is organized as follows: Section 2 describes the CS using pixel-wise spatial-temporal coded exposure. Section 3 presents the pixel architecture and CMOS implementation. The test results from the prototype camera are summarized in Section 4 with discussions. Section 5 concludes the paper.

2. Spatial-temporal compressive sensing (CS)

2.1. Per-frame spatial-temporal coded exposure

Conventionally, cameras are exposed in a non-intermittent fashion. During camera’s exposure period, the pixel array is consciously exposed to light until the start of sensor readout. On the contrary, in coded exposure, the scene light is encoded when the image sensor is exposed. Typically, such optical modulation is conducted in either temporal or spatial domain. For temporal coded exposure, the exposure period is chopped into sub-periods. In each sub-period, all pixels on the image sensor are exposed to encoded light according to a designated exposure code. The spatial coded exposure modulates the scene light in space. During the exposure period, a pixel-wise exposure code mask is applied thus every pixel exposes to uniquely encoded light. The spatial-temporal coded exposure, as the name suggests, combines the optical encoding from both domains. Figure 1 is a conceptual diagram depicts the process of spatial-temporal coded exposure and image reconstruction in a frame period (T). The exposure period (T_expo) consists of N sub-periods with uniform time duration. In each sub-period, a pixel-wise exposure code mask is armed to guide pixel exposure according to the mask pattern. After T_expo, the pixel array is read out during a readout period (T_read) and then enters to the next frame. Overall, the camera operates at a frame rate of 1/T frames per second (fps) with a total number of N× (1/T) exposure code masks applied. As the image sensor outputs one image per frame, the scene information encoded in every coded image I(y) is described as:

(1)$${\textbf I}(y) = \int\limits_N {{\textbf F}({y,n} )} {\textbf M}({y,n} )dn$$

where F(y,n) donates a space-time volume using non-intermittent exposure in the nth sub-period with a 2D spatial coordinates y = {y₁, y₂}. M(y,n) represents the exposure code mask applied in the nth sub-period. At a frame rate of 1/T fps, a volume of (I(y)/T)×t coded images are captured by the camera in a video with time length t.

Fig. 1. Conceptual diagram of camera compressive sensing using spatial-temporal coded exposure.

Download Full Size | PDF

2.2 Decompression and reconstruction

The target of reconstruction is to recover the unknown space-time volume F from the encoded image I. Since the quantity of I is significantly lower than that of F, the demodulation of compressive sensing is an underdetermined linear system which is challenging to solve. Previously reported research in this field exploits the sparse representation α to faithfully estimate F as a sparse-linear combination model [18,19]. Given a dictionary D, the estimated space-time volume model F’(y,n) at the nth sub-period is defined as [9]:

(2)$${\textbf F}^{\prime}(y,n) = {\textbf D}\alpha = {\alpha _1}{{\textbf D}_1} + {\alpha _2}{{\textbf D}_2} + {\alpha _3}{{\textbf D}_3} + \ldots + {\alpha _k}{{\textbf D}_k}$$

where α = [α₁, …, α_k] are the sparse coefficient vectors associated to dictionary elements D₁, …, D_k. Substituting Eq. (2) into Eq. (1), we can get a corresponding coded image I’(y) based on F’(y). The targeting reconstructed space-time volume $\overline {\textbf F}$ for each I(y) is obtained by optimizing the following problem:

(3)$${\overline {\textbf F}} (y) = {\mathop{\arg \min}\nolimits_{{\textbf F^{\prime}}(y)}}({\textrm{E}_d}({\textbf F}^{\prime}(y)) + \beta {{\textbf E}_r}({\textbf F}^{\prime}(y)))$$

where β is a weighting factor. The data term E_d(F’(y)) and the regularization term E_r(F’(y)) are defined as:

(4)$${E_d}({\textbf F}^{\prime}(y)) = ||{{\textbf I}^{\prime}(y) - {\textbf I}(y)} ||_2^2$$

(5)$${E_r}({\textbf F}^{\prime}(y)) = \int\limits_N {\textbf D} {||\alpha ||_1}dn$$

The size of $\overline {\textbf F}$(y) is depended on that of D. For a dictionary consists of k images in each sub-period, the size of $\overline {\textbf F}$(y) is kN images. Thus, the frame rate of the recovered video is calculated as kN(1/T) frames/second. Also, it is noticeable that the performance of $\overline {\textbf F}$(y) reconstruction is mainly affected by the quality of D and α. In previous works, discrete wavelets (DWT) and discrete cosine transform (DCT) were employed as the transform basis for the sparse coefficients [20,21]. Dictionaries trained from diversity of videos or generated based on i.i.d. uniformly distributed entries were also reported [22]. The patch size of the dictionary is usually constrained in a certain range to optimize the time cost while detail features of scene are still included in the reconstruction result.

3. On-chip implementation of single-shot CS

As mentioned previously, camera compressive sensing is realized by on-chip coded exposure to achieve benefits on optical efficiency, packaging sizes, and power consumption. This section introduces the camera design from low-level pixel circuitry to top-level hardware implementation.

3.1. Exposure-programmable pixel

The on-chip implementation of coded exposure starts with pixel designs. In previous research, a pixel can only accept a bit of exposure code during the entire exposure period [14]. Thus, only one pixel-wise exposure code mask (N = 1) is applied in every frame. As the exposure period is split into N sub-periods in the single-shot CS, a pixel is required to accept N bits of exposure codes. In [15–17], we have proposed several pixel designs to achieve N-bit spatial-temporal coded exposure. Evaluation of those designs revealed their shortcomings such as poor noise immunity [15] and layout bottlenecks [16,17]. In this work, we propose an optimized pixel design which overcomes those limitations. In this paper, we proposed an optimized pixel design which overcomes those limitations. Shown in Fig. 2(a) is a block diagram of the proposed pixel structure. The photo-diode (PD) is connected to a charge modulator, which is controlled by the exposure code (φ_code) stored in an exposure code memory unit. In each exposure sub-period (T_chop), the PD stays under exposure. Generated charges, instead of being trapped within the PD, are transferred out to the charge modulator. Such charge “pull-out” mechanism enables exposure encoding even the pixel is continuously exposed to light. For binary exposure codes (M(y,n)∈{0,1}), charges are preserved when the code is “1” and discarded if code “0” applies. As the exposure code changes in different sub-periods, charges are selectively stored in the charge modulator. This is equivalent to the exposure encoding using SLMs placed in between optical lenses and the image sensor. Once the exposure period elapsed, at the end of frame, the readout circuitry is enabled to read the charge modulator and generate an output signal (φ_out). Figure 2(b) indicates the time diagram of voltage variations along the pixel coded exposure in a frame period T. In reset period T_rst, both the photo-detector and the charge modulator are reset by φ_rst to an initial voltage level (V_rst). During the exposure period T_expo, the exposure code signal φ_code toggles in each T_chop to update the code memory. Based on the updated exposure code, the charge modulator performs charge selection, resulting in either holding or descending of its voltage level (V_c). The final voltage level of V_c is read out as the pixel output in the readout period T_read before returning to V_rst in the next frame period. The detailed pixel circuitry implementation is described in Fig. 3(a). Based on our previous studies on the voltage-mode active-pixel sensor (APS) pixel and the lateral electric-field modulation [15], the charge modulator is a two-tap structure consisting of two charge-integration capacitors (C₁ and C₂) and two valve switches (M₁ and M₂). In coded exposure, charges generated from a pinned PD are solely integrated on either C₁ or C₂ by switching on/off M₁ or M₂, respectively. The switching operation of M₁ and M₂ is controlled by binary signals φ_code and $\overline {\varphi}_{\textrm{code}}$, where $\overline {\varphi}_{\textrm{code}}$ is complementary to φ_code. Two DRAM cells are inserted to the path between φ_code/ $\overline {\varphi}_{\textrm{code}}$ and M₁/M₂. When the DRAM control signal φ_ctrl is low, M₁ and M₂ maintain their on/off status according to the φ_code and $\overline {\varphi}_{\textrm{code}}$ stored in DRAM₁ and DRAM₂. In such arrangement, generated charges are integrated on C₁ when φ_code is set to “1” while flowing to C₂ if φ_code is “0”. During readout period T_read, charges accumulated in C₁ are read out by an active pixel sensor (APS) module. After triggered φ_tran, charges in C₁ are transferred to the floating diffusion node (FD), where charges are converted to φ_out in voltage levels by a source follower once φ_sel is pulled up. Charges in C₂, on the contrary, stay still in the readout period until the initial reset of next frame. By pulling up φ_rst, charges in C₁, C₂, and FD are demolished and the pixel is reset for incoming exposure period.

Fig. 2. (a) The block diagram of proposed exposure-programmable pixel. (b) The time diagram of pixel coded exposure in a frame period.

Download Full Size | PDF

Fig. 3. (a) Detailed pixel circuitry. (b) The block diagram of chip architecture.

Download Full Size | PDF

3.2. Sensor architecture

The overall sensor architecture is shown in Fig. 3(b). The pixel array is supported by variety of functional blocks. The row decoder block performs row-by-row scanning to provide signals φ_rst, φ_tran, and φ_sel to the pixel array. The DRAM controller is a row scanner which sequentially selects a row of pixels to enable their exposure codes refreshment. If signal φ_ctrl in the chosen row is set to high, then the column exposure code decoder accesses to pixels in the selected row and distributes φ_code and $\overline {\varphi}_{\textrm{code}}$ to their DRAMs. In every T_chop, the DRAM controller scans through all rows to ensure the column exposure code decoder updates exposure codes in every pixel. In T_read, the pixel array is read out using a correlated double-sampling (CDS) scheme realized by column-based CDS circuits. Before reaching to a column scanner to output the final image data, pixel output signals are digitized to corresponding digital format through an analog-to-digital convertor (ADC) placed in each column.

3.3. Hardware implementation

For hardware implementation, a test image sensor containing 128×128 pixels are fabricated in a 0.13-μm CMOS CIS process. Figure 4(a) shows the chip die with a dimension of 3 mm × 3 mm. The pixel is in a pitch size of 10.2-μm with a fill factor of 41.5%. All in-pixel capacitors are implemented in metal-insulator-metal (MIM) structure, which allows transistors and routing wires placed underneath to maximize space utilization. A total of 128 column-parallel single-slope (SS) ADCs are employed to convert pixel output signals in every column into 8-bit digital values before they are scanned and send out of chip. The chip is powered by two separately regulated power sources – a 3.3 V power supply for analog circuits and a 1.2 V power line for all digital control modules. The development of a camera prototype starts from the chip packaging. As illustrated in Figs. 4(b) and 4(c), the test image sensor is bonded on a customized printed-circuit board (PCB) stacked on a PCB housing power management and microcontroller chips. We employ a field-programmable-gate-array (FPGA) chip as the microcontroller to provide and process signals come to/from the test image sensor. The prototype camera communicates to a computer through a universal-serial-bus (USB) wire, which is also utilized as the system power supply.

Fig. 4. (a) Chip micrograph. (b) Fabricated CMOS image sensor. (c) Prototype computational camera system.

Download Full Size | PDF

4. Experimental results and discussion

4.1. Sensor characterization

The characterization of the test image sensor starts from pixel level. The pixel performance is evaluated through illumination of a red light-emitting diode (LED) centered at 630 nm. In a dark environment, the pixel dark current is measured as 1.27 fA, while the pixel output φ_out decreased to 2.48 V from the reset level (V_rst = 2.56 V) after 1.0 s of non-intermittent exposure. When sweep the LED illumination intensity from 1.0 nW/cm² to 2.0 μW/cm², the lowest achievable detection limit of a pixel is 7 nW/ cm². Under an exposure time of 13.28 ms (at 60 fps), the peak signal-to-noise ratio (PSNR) at 2.0 μW/cm² is 34.2 dB and the output dynamic range is 47.3 dB. In order to prevent code lost, the minimum refreshment frequency of the in-pixel DRAM cell is 338.9 Hz, which determines the maximum allowable length of every T_chop to be 377.6 ms. As limited by the fabrication process, the maximum speed of the column exposure code decoder is 500 MHz. Therefore, the minimum length for each T_chop is calculated as 128 × 128 × 1/(500 MHz) = 32.76 μs. By calculating the mean of the standard deviation of all pixel outputs, the extracted pixel and column fixed-pattern noise (FPN) are 0.17% and 0.22%, respectively. The imaging capability of the pixel array is validated using a resolution chart (QA-71). With T_expo set to 10.49 ms, 128 pixel-wise exposure code masks (N = 128) are accommodated in a frame with each T_chop equals to 81.92 μs. Figure 5(a) displays the output image when the exposure codes in all masks are “1” (M(y,n)∈{1}). As φ_code in each T_chop was continuously pulling up, pixels preserved all collected charges and the image sensor implemented non-intermittent exposure. The spatial-temporal coded exposure, on the other hand, was verified by utilizing variety of code patterns. Figure 5(b) reveals an output image when spatial grey-scale code masks were applied. Each pixel experienced 128 different T_chop periods and the resultant image indicated a smooth grey-scale transparency transition, indicating effective spatial-temporal exposure encoding on the pixel array.

Fig. 5. Camera output images of (a) a non-intermittent exposure test, (b) a spatial-temporal coded exposure test using 128 column-grey-scale masks.

Download Full Size | PDF

4.2. Compressive sensing and space-time volume reconstruction

As mentioned in previous sections, the principium of camera compressive sensing is to recover an uncompressed space-time volume $\overline {\textbf F}$ after spatial-temporal coded exposure. Shown in Fig. 6 is an example of image capture using single-shot spatial-temporal coded exposure. In this example, a food blender operating at 2000 rpm is captured by the prototype camera operating at 10 fps with T_expo and T_chop are set to 96.5ms and 1.51ms, respectively. Thus, the number of implemented pixel-wise exposure code masks (N) in T_expo is 64. Depending on different exposure code patterns, contents shown in the captured image alter as the result of coded exposure. When a non-intermittent exposure code pattern (M(y,n)∈{1}) is applied, similar to conventional cameras, the moving dice is captured and shown in the output image with severe motion blur. To achieve the best root mean squared error and structural similarity performance, as reported in [9], we select pseudo-random binary codes (M(y,n)∈{0, 1}) as the code pattern for each mask. The resultant coded image reveals mosaic-like patterns, which reveal the coded motion blur of the fast-rotating whisk and confirm the independent coded exposure of each pixel. For space-time volume recovery, we employed a learned over-complete dictionary to estimate the space-time volume model. Since the temporal resolution is 1.51ms, the learned over-complete dictionary was trained by a video collection of moving objects at 640 fps using K-SVD algorithm. Each video consists of a pitch size of 8 × 8 with rotations in 8 directions and circular replay forward and backward. The output coded image is divided into blocks in size of 8 × 8. For each block, a space-time volume of 8 × 8 × 64 is reconstructed by optimizing Eq. (3). After performing block-wise reconstruction, as the result, we recovered a space-time volume of 64 images from the coded image. Each reconstructed image depicts the scene captured in a corresponding T_chop period, the rotating whisk is clear to observe with more details and less motion blur.

Fig. 6. Camera compressive sensing by single-shot spatial-temporal coded exposure. Comparing to a blurry image generated from non-intermittent exposure, a space-time volume of images is reconstructed from a captured coded image.

Download Full Size | PDF

The single-frame space-time volume recovery is useful in synthesis of high-frame-rate (e.g. 1000 fps) videos while the camera operates at low frame rates (e.g. 10 fps). As a demonstration, we captured four low-frame-rate videos of a 7-blade CPU fan at 10 fps (T = 100ms) using the prototype CS camera. In this experiment, we render numbers of 16, 32, 64 and 128 pixel-wise exposure code masks to the image sensor, hence the temporal resolution of the reconstructed space-time volume varies from 6.25 ms to 78.1 μs. Illustrated in Fig. 7 are coded frames each from the spatial-temporal coded exposure with N = 16, 32, 64 and 128. Meanwhile, frame images from a reference 10 fps video using non-intermittent exposure is also shown in Fig. 7(a) for comparison. It can be seen in all outputted frame images, fan blades and a symbol “2” written on one of fan blades are blurred out. For those frames from coded exposure, we used the same code mask patterns and reconstruction procedure. Since the scene frequency is unknown priori, the clarity of high-frequency components shown in the reconstructed frames is determined by the minimum temporal resolution of the recovered space-time volume. Depicted in Fig. 7(b) are images of four space-time volumes reconstructed from four corresponding coded frames. Through recovering space-time volumes from every coded frame generated by the prototype CS camera, high-speed videos are synthesized at frame rates of 160 fps, 320 fps, 640 fps and 1280 fps, respectively. As expected, in comparing to the reference 10 fps video, it is noticed that severe motion blur is still observable in the 160-fps video but decreases progressively in higher frame-rate videos.

Fig. 7. High frame rate video synthesis using per-frame coded exposure. The prototype CS camera operates at steady frame rate while each output coded frame reconstructs a space-time volume as a part of the final high-speed video.

Download Full Size | PDF

Another benefit offered by the on-sensor spatial-temporal coded exposure is the capability of applying CS on a region of interest (ROI). Through defining a ROI in the pixel-wise code masks, coded exposure is applied on pixels within the ROI only. As pixels located outside of ROI experience non-intermittent exposure, the space-time volume recovery concentrates on image data from ROI. Therefore, cameras can efficiently apply CS related applications on ROI areas and save on decompression power from contents (e.g., static background) outside of ROI (e.g., static background). Figure 8 depicts an example of ROI based CS on the prototype CS camera operating at 10 fps. Shooting on the same scene (the 7-blade CPU fan case), the ROI is defined in each pixel-wise exposure code mask with a size of 70 × 70 pixels. Note that the exposure codes outside of ROI are set to code “1”, which ensure pixels outside of ROI are excluded from coded exposure. After applying 128 exposure code masks in every frame, the generated 10 fps video clearly indicates coded blur in the ROI while areas outside of ROI show typical motion blur. After space-time volume recovery, the reconstructed 1280 fps video illustrates motion blur outside of ROI while the rotating fan blades are observable within the ROI, as expected.

Fig. 8. High-speed video synthesis using region of interest (ROI) based per-frame coded exposure. The size of ROI is 70 pixels by 70 pixels.

Download Full Size | PDF

4.3. Light throughput

Light efficiency characterizes how much light an imaging device received and transferred into image signals after bargaining on SNR. It is critical to an imaging system, especially when compressive sensing is involved in the process of image capture. For the prototype spatial- temporal CS camera, as the off-sensor optical lens module is the same as traditional cameras, the improvement of light throughput mainly contributed by the image sensor. In an environment with constant light intensity, assume there is a luminance signal with a limited bandwidth [-f_max, f_max]. The minimum sampling period for such signal would be Δt = 1/2f_max. In conventional photography, a camera needs to operate at 2f_max fps with P photons recorded in each frame. The required speed of spatial-temporal CS camera, on the contrast, is 2f_max/N fps. In a frame period, N × P × F photons are accumulated, where F is the fraction of code “1” in N code masks. Therefore, for each frame, the SNR gain of the spatial-temporal CS camera as compared to a conventional camera is given by:

(6)$$SN{R_{gain}} = \frac{{SN{R_{CS}}}}{{SN{R_{Conventional}}}} = \frac{{(N \times \sqrt {P \times F} )/(\lambda \times {\eta _{sensor}})}}{{\sqrt P /{\eta _{sensor}}}} = \frac{{N \times \sqrt F }}{\lambda }$$

where η_sensor represents a chip related noise level which is a sum of pixel dark noise, CDS sampling noise, and ADC noise. Similar to the approach used in [23], λ is the noise factor that captures the effect of the additional noise introduced during the reconstruction process and is defined as $\sqrt {trace({{({{({\textbf M}{\textbf D})}^T}{\textbf M}{\textbf D})}^{ - 1}}/N}$. Shown in Fig. 9 is an example of capturing a rotating quarter with light throughput enhancement by the spatial-temporal coded exposure of our prototype CS camera. Firstly, to acquire a reference high-speed footage, the camera is tuned in the non-intermittent exposure mode at high frame rate. Figure 9(a) is an output video captured at frame rate of 120 fps. As expected, due to the reduced length of exposure time, the measured SNR is 5.72 dB. When apply spatial-temporal coded exposure, cameras operate at lower speed and with longer T_expo. Shooting on the same scene, the coded image shown in Fig. 9(b) is captured at 10 fps with N = 12. After reconstruction, it is clear to observe the improvement of clearance on the recovered images. The measured SNR is 29.3 dB, which gives a SNR gain of 5.14. As T_expo is 12× larger than that in the 120fps video, the resultant high-speed video indicates enhanced light throughput while the camera operates at lower frame rate.

Fig. 9. A comparison between high-speed videos generated by (a) non-intermittent exposure and (b) per-frame spatial-temporal coded exposure.

Download Full Size | PDF

4.4. Power consumption

The power dissipation of the prototype camera system is contributed by the CMOS image sensor chip and its peripheral signal processing devices. In terms of the image sensor, which is the most power hungry component in this system, the power consumption depends on the number of code masks (N) applied during the coded exposure. For an X × Y pixel array in a frame period, the power consumption P_frame is estimated as:

(7)$${P_{frame}} = \frac{{{P_{rst}}{T_{rst}} + XY({P_{code}} + {P_{dram}})N{T_{chop}} + {P_{read}}{T_{read}}}}{T}$$

where P_rst and P_read are reset and readout power, P_code and P_dram are power dissipation on the exposure code decoder and the DRAM controller, respectively. Summarized in Fig. 10(a) is the power consumption of the pixel array and its peripheral blocks measured after applying N exposure code masks in a single shot (one frame period). When N = 1, which is the minimum number of exposure code mask applied in T_expo, the pixel array consumes 0.76 μW while its peripheral circuitry spent 12.83 μW of power (mostly dissipated on ADCs). As N increases, the power dissipation rises exponentially with most of it spent on exposure code mask refreshment. While N is small, the power consumed by the pixel array is relatively similar to that of its peripheral modules. If N reaches to 30000, which is the maximum applicable number of coded masks in one second (with T_chop sets to 32.76 μs), the pixel array consumes 22.85 mW while ∼10 mW of power dissipated on peripheral modules. The overall power consumption of the image sensor, as suggested in Fig. 10(b), also climbs rapidly when more exposure code masks are included in T_expo. The image sensor consumes 23.5 μW at N = 1 and surged up to 32.85 mW when N = 30000. Apparently, a smaller N helps reduce the chip power consumption. However, it limits the temporal resolution in space-time volume recovery, which has less coverage in high-frequency spectrums and results in residue motion blur shown on reconstructed images. Therefore, there is a tradeoff between the power budget and the number of masks applied. Based on the type of scene, one needs to maximize the performance of spatial-temporal CS with an acceptable number of coded exposure masks to meet the power budget requirement.

Fig. 10. (a) Power consumption of (a) the pixel array and peripheral modules during coded exposure. (b) The overall power consumption of the image sensor in a single shot.

Download Full Size | PDF

4.5. Discussion

We present a functional computational camera with a CMOS image sensor capable of implementing per-frame spatial-temporal CS. The design of on-sensor pixel-wise coded exposure offers an elaborate solution for exposure programming in either full resolution or ROI areas. Summarized in Table 1 is the comparison to other related works. The proposed camera has notable advantages in terms of power saving.

Table 1. Performance Comparison to Related Works

View Table

The proposed camera saves power due to the exclusion of using off-sensor optical programming modules, which are usually power hungry. In terms of sensor power dissipation, comparing to other on-chip CS solutions, our image sensor reduces more power due to the less frequency of sensor readout. As each pixel includes a two-tap charge modulator, in every frame, sensor readout is required after all exposure code masks are applied. The temporal resolution elaborates the size of reconstructed space-time volume. In previous works, the number of applicable exposure code mask is constrained by the speed of sensor readout circuits (CDS and ADC). Therefore, the temporal resolution is determined by the maximum reachable frame rate of the image sensor. In this work, the proposed per-frame spatial-temporal coded exposure improves the temporal resolution, which is irrelevant to the frame rate. For the prototype CMOS image sensor, since the exposure code mask refreshment rate is determined by the read/write speed of DRAM cells, the limiting factor to the temporal resolution is the speed of peripheral circuits (exposure code decoder and DRAM controller).

5. Conclusion

In this paper, we proposed a computational camera design using a CMOS image sensor capable of pixel-wise exposure programming. The design of in-pixel two-tap charge modulation and exposure code memory enables single-frame spatial-temporal CS. A prototype camera equipped with a 128 × 128 proof-of-concept CMOS image sensor experimentally demonstrated both single-frame pixel-wise coded exposure and high frame-rate video reconstruction. In applications such as high-speed imaging, through applying different number of exposure code masks, the temporal resolution of recovered video is flexible to cover variety of frequency spectrums in the scene.

In comparison to the state-of-the-art, the proposed on-sensor CS solution provides improved light throughput and in lower power consumption. Operate at low frame rate, the image sensor is exposed to scene for long period and improves SNR in recovered images. In each frame, as pixel readout is not required until coded exposure is accomplished, CS are naturally applied in either full resolution or designated ROI during the exposure period. In general, this camera solution extends the benefits of CMOS implementations of CS for emerging computational imaging applications.

Funding

Natural Sciences and Engineering Research Council of Canada (NSERC) (RGPIN-2017-06240, I2IPJ 516239-17).

Acknowledgment

The authors would like to thank Roozbeh Mehrabadi from the Department of Electrical and Computer Engineering at the University of British Columbia for his technical support with the CAD tools and measurements.

References

1. Z. Wang, L. Spinoulas, K. He, H. Chen, L. Tian, A. K. Katsaggelos, and O. Cossairt, “Compressive Holographic Video,” Opt. Lett. 32(10), 1229–1231 (2007). [CrossRef]

2. Y. Hitomi, J. Gu, M. Gupta, T. Mitsunaga, and S. K. Nayar, “Video from a Single Coded Exposure Photograph using a Learned Over-Complete Dictionary,” in IEEE ICCV (IEEE, 2011), pp.287–294.

3. P. Llull, X. Liao, X. Yuan, J. Yang, D. Kittle, L. Carin, G. Sapiro, and D. J. Brady, “Coded aperture compressive temporal imaging,” Opt. Express 21(9), 10526–10545 (2013). [CrossRef]

4. R. Koller, L. Schmid, N. Matsuda, T. Niederberger, L. Spinoulas, O. Cossairt, G. Schuster, and A. K. Katsaggelos, “High spatio-temporal resolution video with compressed sensing,” Opt. Express 23(12), 15992–16007 (2015). [CrossRef]

5. Q. Zhou, J. Ke, and E. Y. Lam, “Near-Infrared Temporal Compressive Imaging for Video,” Opt. Lett. 44(7), 1702–1705 (2019). [CrossRef]

6. T. Portz, L. Zhang, and H. Jiang, “Random Coded Sampling for High-Speed HDR Video,” in IEEE ICCP (IEEE, 2013), pp. 1–8.

7. Y. Sun, X. Yuan, and S. Pang, “Compressive High-Speed Stereo Imaging,” Opt. Express 25(15), 18182–18190 (2017). [CrossRef]

8. F. Li, H. Chen, A. Pediredla, C. Yeh, K. He, A. Veeraraghavan, and O. Cossairt, “CS-ToF: High-resolution compressive time-of-flight imaging,” Opt. Express 25(25), 31096–31110 (2017). [CrossRef]

9. D. Liu, J. Gu, Y. Hitomi, M. Gupta, T. Mitsunaga, and S. K. Nayar, “Efficient space-time sampling with pixel-wise coded exposure for high-speed imaging,” IEEE Trans. Pattern Anal. Mach. Intell. 36(2), 248–260 (2014). [CrossRef]

10. N. Sarhangnejad, H. Lee, N. Katic, M. O’Toole, K. Kutulakos, and R. Genov, “CMOS Image Sensor Architecture for Primal-Dual Coding,” in IISW (IISS, 2017), pp. 356–359.

11. M. Wei, N. Sarhangnejad, Z. Xia, H. Ke, N. Gusev, R. Genov, and K. N. Kutulakos, “Coded Two-Bucket Cameras for Computer Vision,” ECCV (Springer, 2018), pp.1–8.

12. G. Wan, X. Li, G. Agranov, M. Levoy, and M. Horowitz, “CMOS Image Sensor with Multi-Bucket Pixels for Computational Photography,” IEEE J. Solid-State Circuits 47(4), 1031–1042 (2012). [CrossRef]

13. F. Mochizuki, K. Kagawa, S. Okihara, M. W. Seo, B. Zhang, and T. Takasawa, “Single-Shot 200Mfps 5×3-Aperture Compressive CMOS Imager,” in IEEE ISSCC (IEEE2015), pp. 116–118.

14. J. Zhang, T. Xiong, T. Tran, S. Chin, and R. Etienne-Cummings, “Compact all-CMOS Sptiotemporal Compressive Sensing Video Camera with Pixel-Wise Coded Exposure,” Opt. Express 24(8), 9013–9024 (2016). [CrossRef]

15. Y. Luo and S. Mirabbasi, “A CMOS Pixel Design with Binary Space-Time Exposure Encoding for Computational Imaging,” IEEE CICC (IEEE2017), pp. 1–4.

16. Y. Luo and S. Mirabbasi, “Always-On CMOS Image Sensor Pixel Design for Pixel-Wise Binary Coded Exposure,” IEEE ISCAS (IEEE2017), pp. 1–4.

17. Y. Luo, D. Ho, and S. Mirabbasi, “Exposure-Programmable CMOS Pixel with Selective Charge Storage and Code Memory for Computational Imaging,” IEEE Trans. Circuits and Systems I: Regular Papers 65(5), 1555–1566 (2018). [CrossRef]

18. D. L. Donoho, M. Elad, and V. N. Temlyakov, “Stable Recovery of Sparse Overcomplete Representations in the Presence of Noise,” IEEE Trans. Inf. Theory 52(1), 6–18 (2006). [CrossRef]

19. E. J. Candes, J. Romberg, and T. Tao, “Stable Signal Recovery from Incomplete and Inaccurate Measurements,” Comm. Pure Appl. Math. 59(8), 1207–1223 (2006). [CrossRef]

20. M. Elad and M. Aharon, “Image Denoising via Learned Dictionaries and Sparse Representation,” IEEE CVPR (IEEE2006), pp. 895–900.

21. M. Wakin, J. Laska, M. Duarte, D. Baron, S. Sarvotham, D. Takhar, K. Kelly, and R. Baraniuk, “Compressive Imaging for Video Representation and Coding,” in IEEE PCS (IEEE2006), pp. 1–7.

22. M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation,” IEEE Trans. Signal Processing 54(11), 4311–4322 (2006). [CrossRef]

23. A. Veeraraghavan, D. Reddy, and R. Raskar, “Coded Strobing Photography: Compressive Sensing of High Speed Periodic Videos,” IEEE Trans. Pattern Anal. Mach. Intell 33(4), 671–686 (2011). [CrossRef]

	[7]	[13]	[14]	[17]	This work
Number of SLMs	1	0	0	0	0
Camera Resolution	125×125**	108×64	127×90	10×10	128×128
Exposure Type	Off-Chip Coded Exposure	On-Chip Coded Exposure	On-Chip Coded Exposure	On-Chip Coded Exposure	On-Chip Coded Exposure
Encoding Type	Spatial-temporal	Temporal	Spatial	Spatial-temporal	Spatial-temporal
Frame Rate	80 fps	32 fps	5 fps/100 fps	30 fps	10 fps
ROI Selection Capability	No	No	No	No	Yes
Pixel Power FOM*	N/A	7.3 nW@32 fps	1.14 nW@100 fps	4.6 nW@60 fps	0.6 nW@10 fps
Note	Coded exposure enabled by using off-sensor SLM	Temporal coded exposure with 15 apertures	Spatial coded exposure with 1 code mask/frame	Coded exposure in both temporal and spatial domains	Coded exposure with multiple code masks/frame

CMOS computational camera with a two-tap coded exposure image sensor for single-shot spatial-temporal compressive sensing

Abstract

1. Introduction

2. Spatial-temporal compressive sensing (CS)

2.1. Per-frame spatial-temporal coded exposure

2.2 Decompression and reconstruction

3. On-chip implementation of single-shot CS

3.1. Exposure-programmable pixel

3.2. Sensor architecture

3.3. Hardware implementation

4. Experimental results and discussion

4.1. Sensor characterization

4.2. Compressive sensing and space-time volume reconstruction

4.3. Light throughput

4.4. Power consumption

4.5. Discussion

5. Conclusion

Funding

Acknowledgment

References

Cited By

Figures (10)

Tables (1)

Equations (7)

Optics Express