Compressive video via IR-pulsed illumination

Felipe Guzmán; James Skowronek; Esteban Vera; David J. Brady

doi:10.1364/OE.506011

1. Introduction

Sensing high-speed scenes is highly desirable in various applications, including robotics, autonomous driving, and scientific topics such as fluid dynamics analysis [1], lucky imaging in astronomy [2], or clinical applications such as vocal fold digital high-speed imaging [3]. Conventional high-speed cameras capture images at high frame rates, leading to high hardware manufacturing costs and to massive data volumes readout, transferring, processing, and storage. On the other hand, off-the-shelf ordinary cameras with low frame rates suffer from either poor sampling between frames at short exposures or motion blur with long exposures when attempting to capture fast events, resulting in the loss of dynamic details of the whole scene.

Unlike traditional imaging systems that rely on simple and exhaustive sampling techniques, computational imaging integrates optical and electronic hardware with sophisticated back-end algorithms, enabling the optimal utilization of limited measurements and extracting a wealth of valuable information from the scene. By cleverly combining hardware and software components, computational imaging [4] revolutionizes how we capture and process visual data, opening up new possibilities for advanced applications in various fields.

Coded aperture compressive temporal imaging (CACTI) [5], inspired by compressed sensing (CS) [6] theory, modulates multiple frames of a high-speed scene into a single frame, enabling snapshot compressive imaging (SCI) [7]. Recent advancements have focused on simultaneously optimizing both the hardware and software decoder to achieve optimal high-speed video quality.

Standard CACTI systems employ temporally varying masks to modulate dynamic scenes at different times, while integrating during the sensor exposure. Early CACTI systems mechanically translate a static lithography mask to generate temporally varying coding [5,8]. Nowadays, mainstream CACTI systems use spatial light modulators (SLMs) for encoding, such as digital micromirror devices (DMD) or liquid crystal on silicon (LCoS) [9], which offer accurate and flexible pixel-wise exposure control. Other works relax the problem by allowing multishot acquisition using hybrid encoding [10,11] and side information to reconstruct the high-speed video [12].

Although calibration is a major challenge in computational imaging, the main caveat for most of these compressive temporal imaging techniques is the requirement for extra hardware to code at the pixel level, extending the optical path and cost of the computational imaging system. With this in mind, other previous works have proposed simpler implementations–such as the flutter shutter compressive video (FSV) [13,14]–which only modulates the opening and closing of the shutter of the camera. However, these methods lack reconstruction quality due to the ill-posed nature of the inverse problem. Using a different approach, other methods use parallel sampling through camera arrays with multiple encoding options [15–17]. On the other hand, other ideas replace the physical shutter by active, structured illumination to encode the scene [18,19].

On the decoder side, deep learning [20] has been really successful in dealing with the inverse problem related to CACTI [7,21]. Most of the related publications often train a feedforward convolutional neural network to directly extract the high-speed video from the compressed measurements [22–24], while other works use a hybrid approach such as deep unfolding networks [25], which merge neural nets with classical mathematical models.

In this article, we propose a new technique to capture and recover high-speed video from compressive measurements. We also believe that this method can be easily incorporated to enable temporal imaging in a variety of imaging systems and applications that may benefit from active illumination systems, including novel sensors [26], quantitative phase imaging [27], or optical diffraction tomography [28]. Firstly, we use an active illumination system that encodes the space-time datacube with temporally coded illumination, avoiding the need for complex hardware such as spatial light modulators, while also enabling imaging in low-light situations. Secondly, we use a hybrid temporal sampling scheme that employs keyframes to constrain the reconstruction of the datacube. Finally, we design and develop a novel convolutional neural network architecture to fuse the measurements to successfully reconstruct high-speed videos.

2. Forward model

We can express the high-speed scene as a three-dimensional function $f(x, y, t)$, where $x$ and $y$ represent spatial coordinates and $t$ represents time. Also, lets denote the dynamic illumination mask projected on the scene sensing channel at time $t$ as $h(x, y, t)$. Assuming that the camera integration time is $T$, we can state the first measurement as

(1)$$g(x, y)=\int_0^T h(x, y, t) f(x, y, t) \mathrm{d} t.$$

For many compressive video methods [7], function $h$ is a coded aperture with a binary random distribution, different at every $t$. On the other hand, in the flutter shutter video (FSV) [13], $h$ corresponds to either the opening or closing of the mechanical shutter with random activations at every $t$. Similarly, in our proposed active illumination model, $h$ represents the activation of the light-source which will be idealized as uniform at all spatial positions $(x,y)$.

The three-dimensional scene $f$ can be discretized as $\mathbf {f} \in \mathbb {R}^{N_x \times N_y \times N_F}$. Here, $N_x,N_y$ represents the number of pixels in a square active sensing area and $N_F$ represents the number of temporal channels. This means that $\mathbf {f}$ is a spatiotemporal data cube with dimensions ${N_x} \times N_y \times N_F$. The measurements at spatial indices $(i, j)$ and temporal index $k$ are calculated using the following equation

(2)$$g_{i, j}=\sum_{k=1}^{N_F} h_{i, j, k} f_{i, j, k}+n_{i, j},$$

where $n_{i, j}$ represents the noise at the detector pixel $(i, j)$.

To express the forward model as a linear transformation, one can flatten the discrete object $\mathbf {f}$ into the space $\mathbb {R}^{N_x N_y N_F\times 1}$, along with the image $\mathbf {g}$ and the noise $\mathbf {n}$ both into the space $\mathbb {R}^{N_x N_y\times 1}$. The linear transformation is then given by

(3)$$\mathbf{g}=\mathbf{H} \mathbf{f}+\mathbf{n},$$

(4)$$\mathbf{f}=\left[ \mathbf{f_1}^\top, \mathbf{f_2}^\top,\ldots, \mathbf{f_{N_F}}^\top \right]^\top,$$

where $\mathbf {f_i} \in \mathbb {R}^{N_xN_y \times 1}$ are the $k$ [flattened] illuminated scene "frames", or components, which together comprise the total exposure, and $\mathbf {H} \in \mathbb {R}^{N_xN_y \times N_xN_yN_F}$ represents the system’s discrete forward matrix, which incorporates all elements of the transformation such as the optical impulse response, pixel sampling function, intensity distribution, and the activation signal across time. $\mathbf {H}$ can be represented as

(5)$$\mathbf{H}=\left[ \operatorname{Diag}(h_1), \operatorname{Diag}(h_2),\ldots, \operatorname{Diag}(h_{N_F}) \right],$$

where $\operatorname {Diag}(\cdot )$ performs vector diagonalization.

In the context of CACTI [5], $\mathbf {H}$ is designed to modulate $\mathbf {f}$ dynamically with spatially random windows in the aperture that change during a single measurement $\mathbf {g}$. In contrast with CACTI, however, our pulsed-illumination system is designed such that $\mathbf {H}$ is constrained to a spatially uniform illumination pattern during each pulse. Therefore, Eq. (5) can only include either $\operatorname {Diag}(\mathbf {1})$ or $\operatorname {Diag}(\mathbf {0})$ for a given pulse over the duration of a given frame $\mathbf {f_i}$.

We refer to this illumination design as a frame mask ($F_M$), because each entire frame $\mathbf {f_i}$ is either illuminated or it is not. For example, if a given frame mask is $F_M(\mathbf {H}) = [0,1,0,1,1,0]$, this implies $\mathbf {H} = [\operatorname {Diag}(\mathbf {0}),\operatorname {Diag}(\mathbf {1}),\operatorname {Diag}(\mathbf {0}),\operatorname {Diag}(\mathbf {1}),\operatorname {Diag}(\mathbf {1}),\operatorname {Diag}(\mathbf {0})]$, with $\mathbf {0}$ and $\mathbf {1}$ both of size $N_x\times N_y$. ($N_x=N_y=N$ for this work). This frame masking technique is in contrast to one using pixel masks, not implemented here, where a set of spatially varying illumination patterns is used such that each of the $k$ frames $\mathbf {f_i}$ is illuminated by a different pattern.

Related works such as FSV [13,29] use the same acquisition model; however, the ill-posedness of $\mathbf {H}$ severely impacts the recovery of $\mathbf {f}$. Therefore, we also propose to alleviate the acquisition by allowing multiple shots, since this solution has already proven to be successful in other CS systems [10,12,17]. The model for the multi-shot method for compressive temporal imaging can be represented as

(6)$$\left[\begin{array}{l} \mathbf{g}_1 \\ \mathbf{g}_2 \\ \mathbf{g}_3 \end{array}\right]=\left[\begin{array}{l} \mathbf{H}_1 \\ \mathbf{H}_2 \\ \mathbf{H}_3 \end{array}\right] \mathbf{f}.$$

Inspired by frame interpolation (FI) methods [30], we can implement this system as a hybrid acquisition where $\mathbf {g}_1 = f_1$ and $\mathbf {g}_3 = f_{N_F}$ are key-frames and $\mathbf {g}_2$ contains a compressive measurement of all frames in between. Thus, we update Eq. (6) as

(7)$$\left[\begin{array}{l} \mathbf{g}_1 \\ \mathbf{g}_2 \\ \mathbf{g}_3 \end{array}\right]=\left[\begin{array}{ccc} \operatorname{Diag}(\mathbf{1}) & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{H}_2 & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \operatorname{Diag}(\mathbf{1}) \end{array}\right] \mathbf{f}.$$

The proposed forward model is shown as a illustration in Fig. 1, which indeed allows relaxing the inverse problem. With the key-frames $\mathbf {g}_1$ and $\mathbf {g}_3$ as extra information in the datacube, we can recover the video as a frame interpolation guided by the dynamics encoded in $\mathbf {g}_2$.

Fig. 1. Example of matrix dimensions for the proposed acquisition matrix.

Download Full Size | PDF

3. System description

There are several factors to consider when implementing high-speed imaging systems. First of all, high-speed imaging often requires excellent illumination conditions, granting a good signal-to-noise-ratio (SNR). In our case, we use six high-power LED flashlights. Since we are temporally encoding with the flashlights, we propose to work at a wavelength of 850nm to avoid the stroboscopic effect becoming harmful or annoying for people within the field of view of the system. This near-infrared wavelength provides the best trade-off between human sensitivity and quantum efficiency in CCD/CMOS cameras. In fact, these kinds of illumination strategies have already been used in other vision tasks such as driver monitoring [31].

Since the illumination we have chosen is a known pseudorandom pulse sequence, we must dynamically control the camera to emulate the proposed forward matrices. To achieve this, we capture one short exposure as the measurement $\mathbf {g}_1$, and one long exposure integrating the coding illumination into $\mathbf {g}_2$. By recycling the first short exposure frame of the following cycle as $\mathbf {g}_3$, we have three sequential measurements that are used to recover $\mathbf {f}$. We use an Arduino microcontroller platform as the synchronization device to rule the acquisition timing strategy shown in Fig. 2.

Fig. 2. Timing scheme of the acquisition system. The microcontroller generates different timing pulses for each device. The Timing system signal (green) is provided as a reference. The Illumination signal (red) powers the LEDs with a frame mask $F_M(H) = [1,1,0,1,0,0,1,0,0,1,0,0,1,0,1]$.

Download Full Size | PDF

The Timing system signal (green) depicted in Fig. 2 represents the timing of the desired snapshots, or projections, of the datacube $\mathbf {f}$. The Illumination signal (red) controls the power of the IR LED system. The purple line represents the exposure control and trigger of the camera. It is important to note that–for the sake of simplicity–we employ the same camera to capture both the short exposure $\mathbf {g}_1$ and the subsequent long exposure until the end of the cycle. The long exposure integrates the pseudorandom coded illumination pulses to obtain $\mathbf {g}_2$.

Moreover, we have incorporated a reference camera that acquires two frames per cycle; however, this camera only captures short exposures. We utilize this camera to compare our method with traditional frame interpolation techniques, as outlined in section 5.

The pulse width (PW) of the acquisition system needs to be sufficiently short to prevent motion blur in the scene, yet long enough to achieve a good signal-to-noise ratio (SNR). To maintain balance between these, both cameras and illumination use the exact same PW, ensuring a consistent level of motion blur in the reconstruction.

4. Reconstruction method

Since $\mathbf {H}$ multiplexes the temporal illumination pattern of the continuous object into the discrete-time image $\mathbf {g}$, inverting Eq. (3) for $\mathbf {f}$ becomes more difficult as the dimension $N_F$ increases. As a compressive reconstruction problem, it is very difficult to obtain accurate reconstructions using either linear or sparse inversion methods, or similar techniques [7,32] which require that the sampling matrix be an independent and identically-distributed Gaussian matrix [6]; which is unrealistic in the vast majority of real SCI systems that rely on binary dynamic coded apertures.

Nevertheless, neural network-based techniques are proving very useful for solving ill-posed inverse problems [21]. We propose our customized version of a proven convolutional neural network architecture to perform the reconstruction process, as summarized in Fig. 3. The proposed model includes UNET convolutional neural network and a modified version of the spatio-temporal transformer (STT) neural network, recently proposed in [33], where the initialization module formerly presented in [33–35] was replaced with the module shown in Fig. 3(a). The UNET architecture is a good match for tensor fusion tasks due to its inherent memory efficiency. In our implementation, the output of each UNET module is subsequently upsampled and summed. This novel initialization module is crucial to the success of the reconstruction, since in our preliminary tests the original STT model only provided good estimates for random binary masks, but never for our proposed forward matrix which has more structure.

Fig. 3. Diagram of the proposed neural network reconstruction scheme. (a) Initialization module that outputs a coarse video from the measurements; (b) Estimation module consisting of a smaller version of the STT [33]. The FILM block uses pretrained weights [36], while the model is frozen during training.

Download Full Size | PDF

For the initialization module, we use the pretrained weights of the Frame Interpolation for Large Motions (FILM) model [36] (which had already been trained on the same DAVIS2017 [37] dataset we use) to obtain an interpolated estimate of the video $\mathbf {f}_F$ based on the initial and final positions sampled in $\mathbf {g}_1$ and $\mathbf {g}_3$. Since this model only outputs one frame, it needs to be repeated multiple times to fill the datacube $\mathbf {f}_F$ with all required temporal indices $k \in {N_F}$. Next, we use the keyframes $\mathbf {g}_1$, $\mathbf {g}_3$, and the dynamically encoded $\mathbf {g}_2$, to obtain a second estimate of the video $\mathbf {f}_c = \mathbf {f}_{s_1} + \mathbf {f}_{s_2} + \mathbf {f}_{s_3}$. For this we use two convolutional operations, followed by leakyRELU activation layers, each applied at three pixel resolutions, and each with a kernel size of 3. Next, we perform a difference between each video tensor

\left[ \mathbf{f}_F-\mathbf{f}_c, \mathbf{f}_c\right]_{Ch},

where $[\cdot ]_{Ch}$ represents channel concatenation, resulting in a tensor of size $N\times N\times 2N_F$. This difference tensor serves as input for three parallel UNET neural networks [38], with resolutions downscaled by a factor of 2 at each step within each UNET. The UNETs take as inputs the estimated features of the datacube from the two front-end inference processes–FILM interpolation and convolutional neural network–into a single datacube at three different scales, which are subsequently upsampled and summed.

The output of the initialization module is used as input to the STT, which consists in a token generator that maps the video from pixel space to feature space. In the STT, the features are related with each other by the temporal and spatial self-attention map through a Grouping Resnet feed-forward network. Finally, the decoder of the STT is a video reconstruction module that outputs the estimated video $\hat {f}$. To avoid memory saturation during training, we use the small version of STT referred as STFormer-S in [33].

5. Results

We assess the performance of the proposed model and compare it with two state-of-the-art frame interpolation methods on a variety of simulation and real datasets. We quantitatively evaluate the effectiveness of different video SCI reconstruction methods on simulation datasets using metrics such as the peak signal-to-noise ratio (PSNR) and the structured similarity index (SSIM) [39]. As our model uses 3 inputs, we avoid using the coded frame on the interpolation methods, replacing it by an additional short exposure taken by the reference camera with the same exposure time as the keyframes.

5.1 Training

We use the DAVIS2017 [37] dataset to train our proposed method. This dataset contains 90 different scenes with two resolutions: $894 \times 480$ and $1920 \times 1080$ in color. For our purposes, we train the system with grayscale versions of the dataset, and we work with $128 \times 128$ patches and 16 frames. For data augmentation purposes, we split the dataset into 3800 tensors for training and 200 for validation, training for 150 epochs with a learning rate of 0.0001. The model was trained using a NVIDIA GeForce GTX 4070 Ti GPU. We train two models minimizing the error between the real and estimated videos, using the loss function

(8)$$\mathcal{L}= \|\mathbf{f}-\mathcal{N}(\mathbf{g})\|_2^2$$

where $\mathcal {N}$ represents the forward pass of the proposed model $\hat {\mathbf {f}} = \mathcal {N}(\mathbf {g})$.

5.2 Simulation results

We compare our simulation results with two state-of-the-art frame interpolation methods: multiple video frame interpolation via enhanced deformable separable convolution (EDSC) [40] and FILM [36]. Both of these methods offer two reconstruction options: (1) interpolating exactly the middle frame on the temporal axis, or (2) dynamic interpolation, where we can choose a specific timestamp for interpolation. We used the second method to obtain the same video $\mathbf {f}$ from the reference three measurements without temporal coding to compare against our three measurements with one temporally-coded frame.

As we can observe from the averaged quantitative results shown in Table 1, our proposed method vastly surpasses both traditional frame interpolation methods in terms of PSNR and SSIM. However, due to the higher number of parameters, our model demands a slower inference time.

Table 1. Quantitative results on DAVIS2017 dataset. The numbers in bold represent the best performance.

View Table

To present our results in qualitative terms, we showcase one of the reconstructions in Fig. 4. It is evident that our method exhibits lower intensity in the absolute error regions of the frames. Additionally, it can be observed that for frame 15, the FILM method demonstrates superior spatial resolution; however, owing to the nature of the frame interpolation techniques, inaccuracies in the motion estimation lead to incorrect structural positioning, resulting in greater error when compared with the ground truth. On the contrary, the proposed model is slightly blurry in the spatial resolution but the structure on the right side of the frame preserves better positioning.

Fig. 4. Reconstruction of one of the testing datasets. (a) Ground truth; (b) interpolated frames with the FILM method; (c) interpolated frames with the EDSC method; (d) reconstructed frames from the proposed scheme. The triangular superior region on each frame is the reconstruction and the triangular inferior part corresponds to the absolute difference between the ground truth and recovered frame.

Download Full Size | PDF

Additional results from simulated data are presented in Fig. 5 and the companion video (Visualization 1), where we can confirm that the FILM method yields better results in terms of the spatial quality of the frames. However, in terms of temporal consistency, the proposed method demonstrates superiority relative to the ground truth. Notably, in triads b, d, g, i, and l, the positional mismatch is clearly more pronounced in the FILM reconstruction, while our reconstructions, albeit slightly blurry, still maintains the correct position of the moving parts. Notice that the only difference for the method is that FILM uses 3 snapshots of the video while the proposed method replaces the middle snapshot with the compressive coded-illumination frame.

Fig. 5. Reconstruction results for different videos of the testing dataset. For each video (represented by a different letter) there are 3 frames: groundtruth, reconstruction with FILM, and our proposed reconstruction.

Download Full Size | PDF

5.3 Experimental results

5.3.1 Hardware used

The implementation of the proposed acquisition scheme can be seen in Fig. 6, which consists of the active illumination source, the control and synchronization hardware, and a pair of cameras. For the illumination source, we decided to use LEDs at 850nm to avoid bothering viewers in the scene with high frequency flashing lights. Nonetheless, to compensate for the low quantum efficiency of the sensor at that wavelength, we built an array of six UniqueFire flashlights populated with 850nm Osram LEDs. To control the LEDs, we designed a circuit using an IRF520 MOSFET as a power switch capable of handling the power required by the array. The MOSFET’s gate also serves as the input for the temporal mask coding strategy. We use an Arduino UNO to send the ON/OFF signals to the array, following the coded pulse sequence described in Fig. 2 (highlighted in red). The pulse width is set to 200$\mu s$ while the off time is set to 4400$\mu s$. In our tests, PW duration is short enough to avoid blurring caused by fast-moving objects, yet long enough to assure a good signal-to-noise ratio (SNR). The Arduino also controls the trigger and exposure time of the camera for our method and the camera for FILM and EDSC methods. The total acquisition cycle time of the system is 66 ms, capturing the first frame using a short exposure at the beginning, and then compressing the next equivalent 14 frames into a longer coded exposure.

Fig. 6. In this picture, we show the illumination system. a) Each LED flashlight with a 75mm focusing lens and b) The pulse control module with the Arduino.

Download Full Size | PDF

The cameras used for the coded and reference measurements (Basler acA1440-220um) are monochrome cameras with a frame rate of up to 227fps and a resolution of $1440\times 1080$. Each camera is equipped with an Edmund 850nm filter to avoid background light contamination during the acquisition process. As described in Fig. 2, we use the same camera to get both the short and long exposure measurements, while the other camera is solely used for capturing the short exposure reference frames used for interpolation.

5.3.2 Reconstruction results

We tested our system with a series of scenes featuring different types of motion. The first scene, shown in Fig. 7, consists of a small black object falling through a ramp and bouncing on the floor. We reconstruct 16 frames out from the three measurements, where we can observe in frames 10 to 15 that the FI methods tend to make the object disappear and reappear, whereas our method produces a slightly distorted reconstruction, though the motion appears more natural and truly corresponds to that of a falling object. For a full reconstruction of all the measurements in this scene, please refer to Visualization 2.

Fig. 7. Experimental results of Scene 1: The measurements are stacked in the color channels, where cyan represents the first measurement, purple the second (purple corresponds to the coded frame in our method), and yellow the third measurement. Row a shows the measurements and reconstructions for the FILM method. Row b demonstrates the results for the EDSC method. Row c presents the outcomes of our proposed method.

Download Full Size | PDF

6. Conclusion

We present a compressive temporal imaging system capable of effectively capturing and reconstructing high-speed motion from measurements modulated with active pulsed illumination which was constant in space but varying in time. We also developed a specially crafted deep neural network architecture for reconstruction purposes. Simulation results demonstrate a superior reconstruction performance to recover motion in the scenes when using active-illumination coded images instead of interpolating traditional frames. Moreover, experimental results demonstrate an effective compression ratio of >10X, while yielding more natural motion compared with the frame interpolation counterparts. By exclusively utilizing key frames for interpolation, the frame interpolation methods lack awareness of past and intermediate motion events. Consequently, despite the non-linear nature of the neural nets used for reconstruction, the inherent structure of the moving objects are reconstructed continuously and smoothly. Overall, the deep learning reconstruction model is able to properly decode the full motion structure contained in the compressive coded illumination frame, encompassing changes in velocity and direction of the moving parts within the scene.

By choosing to illuminate the scene as opposed to coding the aperture, we can significantly simplify the complexity and costs of the acquisition hardware, while also operating under low-light conditions. We believe the proposed technique can be implemented as a compressive video method for autonomous vehicles, offering an alternative to smart front-light illumination systems. Furthermore, employing multiple cameras would allow for the exploitation of different spectral bands, enabling diverse coding strategies for reconstructing high-speed color or multispectral videos.

Funding

Agencia Nacional de Investigación y Desarrollo (2022-21221399, ATE220022); Fondo Nacional de Desarrollo Científico y Tecnológico (1221883).

Disclosures

The authors declare no conflicts of interest.

Data availability

The model, training, and results can be replicated in the GitHub repository in Ref. [41].

References

1. M. Versluis, “High-speed imaging in fluids,” Exp. Fluids 54(2), 1458 (2013). [CrossRef]

2. N. M. Law, C. D. Mackay, J. E. Baldwin, et al., “Lucky imaging: high angular resolution imaging in the visible from the ground,” Astronomy Astrophysics 446(2), 739–745 (2006). [CrossRef]

3. S. Hertegård, H. Larsson, T. Wittenberg, et al., “High-speed imaging: applications and development,” Logopedics Phoniatrics Vocology 28(3), 133–139 (2003). [CrossRef]

4. J. N. Mait, G. W. Euliss, R. A. Athale, et al., “Computational imaging,” Adv. Opt. Photonics 10(2), 409–483 (2018). [CrossRef]

5. P. Llull, X. Liao, X. Yuan, et al., “Coded aperture compressive temporal imaging,” Opt. Express 21(9), 10526–10545 (2013). [CrossRef]

6. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006). [CrossRef]

7. X. Yuan, D. J. Brady, A. K. Katsaggelos, et al., “Snapshot compressive imaging: Theory, algorithms, and applications,” IEEE Signal Processing Magazine 38(2), 65–88 (2021). [CrossRef]

8. R. Koller, L. Schmid, N. Matsuda, et al., “High spatio-temporal resolution video with compressed sensing,” Opt. Express 23(12), 15992–16007 (2015). [CrossRef]

9. Z. Zhang, C. Deng, Y. Liu, et al., “Ten-mega-pixel snapshot compressive imaging with a hybrid coded aperture,” Photonics Res. 9(11), 2277–2287 (2021). [CrossRef]

10. H. Huang, J. Teng, Y. Liang, et al., “Key frames assisted hybrid encoding for high-quality compressive video sensing,” Opt. Express 30(21), 39111–39128 (2022). [CrossRef]

11. C. Yang, D. Qi, J. Liang, et al., “Compressed ultrafast photography by multi-encoding imaging,” Laser Phys. Lett. 15(11), 116202 (2018). [CrossRef]

12. X. Yuan, Y. Sun, S. Pang, et al., “Compressive video sensing with side information,” Appl. Opt. 56(10), 2697–2704 (2017). [CrossRef]

13. J. Holloway, A. C. Sankaranarayanan, A. Veeraraghavan, et al., “Flutter shutter video camera for compressive sensing of videos,” in 2012 IEEE International Conference on Computational Photography (ICCP), (IEEE, 2012), pp. 1–9.

14. A. Veeraraghavan, D. Reddy, R. Raskar, et al., “Coded strobing photography: Compressive sensing of high speed periodic videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence 33(4), 671–686 (2010). [CrossRef]

15. Y. Sun, X. Yuan, S. Pang, et al., “Compressive high-speed stereo imaging,” Opt. Express 25(15), 18182–18190 (2017). [CrossRef]

16. Y. Ge, G. Qu, Y. Huang, et al., “Coded aperture compression temporal imaging based on a dual-mask and deep denoiser,” JOSA A 40(7), 1468–1477 (2023). [CrossRef]

17. F. Guzmán, P. Meza, E. Vera, et al., “Compressive temporal imaging using a rolling shutter camera array,” Opt. Express 29(9), 12787–12800 (2021). [CrossRef]

18. X. Yuan and S. Pang, “Compressive video microscope via structured illumination,” in 2016 IEEE International Conference on Image Processing (ICIP), (IEEE, 2016), pp. 1589–1593.

19. Y. Sun, X. Yuan, S. Pang, et al., “High-speed compressive range imaging based on active illumination,” Opt. Express 24(20), 22836–22846 (2016). [CrossRef]

20. Y. LeCun, Y. Bengio, G. Hinton, et al., “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]

21. W. Saideni, D. Helbert, F. Courreges, et al., “An overview on deep learning techniques for video compressive sensing,” Appl. Sci. 12(5), 2734 (2022). [CrossRef]

22. M. Qiao, Z. Meng, J. Ma, et al., “Deep learning for video compressive sensing,” APL Photonics 5(3), 1 (2020). [CrossRef]

23. M. Iliadis, L. Spinoulas, A. K. Katsaggelos, et al., “Deep fully-connected networks for video compressive sensing,” Digital Signal Processing 72, 9–18 (2018). [CrossRef]

24. Z. Cheng, B. Chen, G. Liu, et al., “Memory-efficient network for large-scale video compressive sensing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), pp. 16246–16255.

25. J. Ma, X.-Y. Liu, Z. Shou, et al., “Deep tensor admm-net for snapshot compressive imaging,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), pp. 10223–10232.

26. J. Wu, Y. Guo, C. Deng, et al., “An integrated imaging sensor for aberration-corrected 3d photography,” Nature 612(7938), 62–71 (2022). [CrossRef]

27. S. Shin, K. Kim, J. Yoon, et al., “Active illumination using a digital micromirror device for quantitative phase imaging,” Opt. Lett. 40(22), 5407–5410 (2015). [CrossRef]

28. K. Lee, K. Kim, G. Kim, et al., “Time-multiplexed structured illumination using a dmd for optical diffraction tomography,” Opt. Lett. 42(5), 999–1002 (2017). [CrossRef]

29. D. Reddy, A. Veeraraghavan, and R. Chellappa, “P2c2: Programmable pixel compressive camera for high speed imaging,” in CVPR 2011, (IEEE, 2011), pp. 329–336.

30. J. Dong, K. Ota, M. Dong, et al., “Video frame interpolation: A comprehensive survey,” ACM Transactions on Multimedia Computing, Communications and Applications 19, 1–31 (2023). [CrossRef]

31. C. Cudalbu, B. Anastasiu, R. Radu, et al., “Driver monitoring with a single high-speed camera and ir illumination,” in International Symposium on Signals, Circuits and Systems, 2005. ISSCS 2005., vol. 1 (IEEE, 2005), pp. 219–222.

32. X. Yuan, Y. Liu, J. Suo, et al., “Plug-and-play algorithms for video snapshot compressive imaging,” IEEE Transactions on Pattern Analysis and Machine Intelligence 44(10), 7093–7111 (2021). [CrossRef]

33. L. Wang, M. Cao, Y. Zhong, et al., “Spatial-temporal transformer for video snapshot compressive imaging,” IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 9072–9089 (2022). [CrossRef]

34. Z. Meng, S. Jalali, X. Yuan, et al., “Gap-net for snapshot compressive imaging,” arXiv, arXiv:2012.08364 (2020). [CrossRef]

35. S. Zheng, X. Yang, X. Yuan, et al., “Two-stage is enough: a concise deep unfolding reconstruction network for flexible video compressive sensing,” arXiv, arXiv:2201.05810 (2022). [CrossRef]

36. F. Reda, J. Kontkanen, E. Tabellion, et al., “Film: Frame interpolation for large motion,” in European Conference on Computer Vision, (Springer, 2022), pp. 250–266.

37. J. Pont-Tuset, F. Perazzi, S. Caelles, et al., “The 2017 davis challenge on video object segmentation,” arXiv, arXiv:1704.00675 (2017). [CrossRef]

38. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, (Springer, 2015), pp. 234–241.

39. Z. Wang, A. C. Bovik, H. R. Sheikh, et al., “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing 13(4), 600–612 (2004). [CrossRef]

40. X. Cheng and Z. Chen, “Multiple video frame interpolation via enhanced deformable separable convolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence 44(10), 7029–7045 (2021). [CrossRef]

41. F.O. Guzman, “Pulse Illumination Video,” GitHub (2023) [accessed 31 Oct 2023], https://github.com/FOGuzman/PulseIlluminationVideo

Name	Description
Visualization 1	Reconstruction results for different datasets: on the left is the ground truth video, in the middle is the reconstruction using FILM, and on the right is the reconstruction of our proposed method.
Visualization 2	Experimental results for a continuous acquisition. The object is a small camera cap rolling down a ramp and bouncing on a platform. On the top are the measurements stacked on the color channels, and on the bottom, the reconstruction of that given sta

Method	PSNR [dB]	SSIM	Time [sec]	Parameters (M)
FILM [36]	$27.25 \pm 4.29$	$0.7370 \pm 0.1616$	0.16	34.43
EDSC [40]	$28.88 \pm 4.14$	$0.7452 \pm 0.1504$	$0.06$	$8.95$
Ours	$32.51 \pm 2.48$	$0.7997 \pm 0.1261$	0.24	66.68

Compressive video via IR-pulsed illumination

Abstract

1. Introduction

2. Forward model

3. System description

4. Reconstruction method

5. Results

5.1 Training

5.2 Simulation results

5.3 Experimental results

5.3.1 Hardware used

5.3.2 Reconstruction results

6. Conclusion

Funding

Disclosures

Data availability

References

Supplementary Material (2)

Data availability

Cited By

Figures (7)

Tables (1)

Equations (9)

Optics Express