Ultra-high-speed four-dimensional hyperspectral imaging

Jingyue Ma; Zhenming Yu; Liming Cheng; Jiayu Di; Ning Zhan; Yue Zhou; Haiying Zhao; Kun Xu

doi:10.1364/OE.520788

1. Introduction

Hyperspectral imaging (HSI) and depth imaging are unrelated fields that both contain essential information for interpreting a scene in different dimensions. Spectral information provides various applications in the fields of remote sensing, agriculture, and medical imaging [1–3], while depth information provides critical functionality for autonomous driving, mobile photography, and robotics [4–6]. Initially, both HSI and depth imaging need scanning to acquire accurate measurements, but the scanning acquisition is limited by a time-consuming process and the use of cumbersome equipment. Due to the significant development of image sensors and reconstruction algorithms, snapshot HSI [7–23] and non-scanning depth imaging [4,24–27] have been developed rapidly and provide the possibility of acquiring both depth and spectrum information through a single exposure.

Researchers have developed many snapshot systems for fast HSI. Among these, snapshot imaging systems based on compressed sensing (CS) [28–30] take advantage of the sparsity of nature scenes and eliminate the limitations of scanning methods from long acquisition times. A superior snapshot imaging method based on CS is the coded aperture snapshot spectral imager (CASSI). In practice, dual-disperser CASSI (DD-CASSI) [7] and single-disperser CASSI (SD-CASSI) [9] have attracted extensive research interest. A new CASSI system, reflective coded aperture snapshot spectral imaging (R-CASSI) [11], was recently proposed to balance systematic complexity and acquisition accuracy. For snapshot depth imaging, current research interests mainly focus on passive vision [26,31,32] because of its low lighting requirements and simple implementation.

It is obvious that state-of-the-art spectral and depth imaging techniques have the potential to be integrated into one system and simultaneously capture spectral and depth information. Because of the irrelevance of spectral and depth information, recent approaches [33–36] are accustomed to acquiring spectral and depth information, respectively, and matching information from different dimensions. Thus, the most crucial predicament of these methods is establishing a precise mapping relation between different snapshot systems.

In this paper, we propose a high-speed multi-dimensional imaging system and provide an efficient snapshot acquisition process to obtain both spectral and depth information without complex scanning. The proposed four-dimensional (4D) acquisition system contains a compressed dubbed R-CASSI system and a panchromatic (PAN) acquisition branch. According to CS theory, the hyperspectral data cube can be reconstructed from a sparse coded measurement. Then, a disparity map between the hyperspectral cube and the PAN measurement is estimated by a stereo-matching network. This disparity map contains the depth information of the scene and represents the mapping relationship between the two branches. Therefore, the disparity map can be used to warp the measurement from the PAN branch to the CASSI branch. The warped PAN measurement was applied to recover the indistinction of the CASSI reconstruction caused by measuring the noise and generalization of the algorithm. To leverage the high-frequency details of PAN measurement while maintaining hyperspectral accuracy, we implemented a fusion network containing several modules for feature extraction, fusion, and keeping hyperspectral data from degradation. We validated the proposed method with an experimental hardware prototype and confirmed that the method is capable of acquiring 4D information with an 8 nm spectral resolution between 450 and 700 nm and with 2.5 mm depth accuracy. The reconstruction time per frame was 1.83 s. In addition, we reprojected several 4D models to prove that the proposed system satisfies the needs of exact 4D acquisition.

2. System principles

Figure 1 shows the structure of the proposed system. We first reconstructed the three-dimensional (3D) hyperspectral data cube from the R-CASSI measurement using the U-net-3D network and then synthesized a grayscale image of the reconstructed 3D hyperspectral data. We then leveraged the stereo-matching network to estimate a disparity map between the synthesized grayscale image and the PAN measurement. A disparity map was used to align the PAN measurement and the reconstructed 3D hyperspectral data. The aligned PAN image and the reconstructed 3D hyperspectral data were used to generate the final hyperspectral data cube using the fusion network.

Fig. 1. Schematic illustration of the 4D spectral imaging system.

Download Full Size | PDF

2.1 R-CASSI and hyperspectral reconstruction

As shown in Fig. 1, R-CASSI is composed of several optical elements (L1, BS, L2, prism, L3, reflective mask, and C1). According to CS theory, the modulation process of the R-CASSI branch can be written as

(1)$${g\ =\ \varPhi f\ +\ e}. $$

Let ${f}\mathrm{\ \in }{{{\mathbb R}}^{n}}$ represent the hyperspectral data cube, ${g}$ the measurement captured by the monochrome camera, and ${e}\mathrm{\ \in }{{{\mathbb R}}^{m}}$ the observation noise of the R-CASSI system. The transfer matrix of the R-CASSI system can be expressed as $\mathrm{\varPhi }\mathrm{\ \in }{{{\mathbb R}}^{{mn}}}$. As $m \ll n$, this equation donates a highly underdetermined system. Based on the outperformance of convolutional neural network-based (CNN-based) algorithms in the field of picture processing, we leveraged a U-net-3D network for R-CASSI reconstruction [11].

After recovering the hyperspectral data cube using U-net-3D, we synthesized a grayscale image from the data cube inferred from the response function of the monocular camera. The grayscale image was inputted into the stereo-matching network for disparity estimation.

2.2 Stereo matching

We first confirmed the precise mapping relation between the two branches and rectified the grayscale images of the two systems using a checkerboard [37]. In this case, a disparity exists only along the x-axis and can be calculated as

(2)$$D({x,y} )= \mathop {\arg \min }\limits_d C({{p^l}({x,y} ),{p^r}({x + d,y} )} ), $$

where $({x,y} )$ is the position of the pixel, p takes a patch from an image based on the given position, C is the matching cost between the two patches, and d is the possible disparity of this pixel. We computed the matching cost and generated disparity maps through a CNN-based network [32] consisting of a feature pyramid encoder and a feature volume decoder with skip connections.

2.3 Hyperspectral and panchromatic fusion

In practice, the obtained disparity map includes the depth information of the scene and can be translated into the exact depth by the lens’s focal length and baseline. In addition, this disparity map can warp the PAN measurements pixel by pixel because it contains a pixel-wise correspondence between PAN measurement ${y^p}$ and synthetized grey-scale image ${y^s}$ of the CASSI reconstruction. The warped PAN measurement, indicated as ${y^w}$, has the same image field as the original hyperspectral data cube. On this occasion, warped PAN measurement ${y^w}$ has great potential to optimize hyperspectral reconstruction.

We deployed a CNN-based network called a fusion network to fuse the low-clarity-performed hyperspectral data cube and textured PAN measurement. The architecture of the network is shown in Fig. 2.

Fig. 2. Architecture of the fusion network.

Download Full Size | PDF

In general, high-frequency details are widely regarded as indispensable features and make great contributions to improving image quality because they include most texture details. In this case, however, fusion with only the original PAN image may lead to the loss of crucial details. Thus, we leveraged seven types of feature extractors to obtain the high-frequency features of the PAN image: the first-order difference operator (x-axis and y-axis), the Laplacian operator, the Sobel operator, the Canny operator, the Prewitt operator, and the Roberts operator. The high-frequency feature images attached to the original PAN image and reconstructed hyperspectral data were regarded as a preliminary feature block to be inputted into five different feature extraction modules.

As shown in Fig. 2, we first deployed two simple modules, both with a two-dimensional (2D) convolutional block and a rectified linear unit. To obtain different receptive fields, we set 3 × 3 and 5 × 5 convolutional kernels, respectively, in these two modules. The other feature extraction branches were named “MRF Module,” “Unet Module,” and “SPP Module.” The MRF Module was the multilevel receptive field module. We superposed dilated convolution in it to efficiently extract features with different levels of detail while maintaining the different scale information. The Unet Module can be considered a simple 2D U-net that constructs an encoder–decoder structure. The encoder contains two 3 × 3 kernel convolution layers and two max-pooling layers for downsampling, while the decoder contains two transposed convolution layers for upsampling. The skip connection was set to avoid gradient vanishing. The SPP Module is a spatial pyramid pooling (SPP) module. Considering the context relationship, it deploys average pooling layers to compress features into four scales. Then, transposed convolution layers are deployed to ensure that the output and input data are of the same size. A 3 × 3 convolution is then set to reduce the feature dimensions.

The aforementioned feature extraction modules provide feature maps from different scales that include complicated coarse-to-fine details and context information. Thus, we designed a multiscale fusion convolution structure called the MSF Module to acquire crucial features. In this multiscale convolution structure, we deployed progressive convolution to extract both shallow and deep information from feature maps. We then set several continuous residual blocks to implement a deep network and prevent network degradation and vanishing gradients. Thus, the MSF Module has larger receptive fields, obtaining concrete and abstract features with both shallow and deep information.

In addition, to maintain spectral fidelity from spatial effects, we defined a spectral attention module called the SA Module. This module includes adaptive average pooling and two fully connected layers to generate spectral attention weights and help recognize useful channels and ignore invalid information.

3. Experiment setup

The experiment hardware is shown in Fig. 3(a). The objective lenses L1 and L4 are the same as the prime lens (M0814-MP2) from COMPUTAR. The baseline of the proposed system is 5 cm. Other optics are from Thorlabs, which contain L2 and L3 (ACA254-060-A), a prism (PS812-A), and a beam splitter (BS013). The camera (C1 and C2) in both the CASSI and PAN branches are Basler acA2000-165µm with a 2048 × 1088 resolution and a 5.5 µm pixel pitch. Cause the pixel size of the mask should be the integer multiples of the pixel size of the camera, we design the pixel size of the mask as 11 µm. In addition, our mask is composed by random pattern (binary patterns are used in this paper, {0, 1} with 0 meaning the light passing through and 1 denoting reflecting the light, with the ratio of 1 to 1), this pattern was considered to be reliable for computational imaging [38–40]. Figure 3(b) shows our fabricated mask. The mask is made of a 2 cm × 2 cm glass base plate, the reflecting layer is made of Ag to reflect light, and it has a silicon dioxide coating to prevent the oxidation of silver. The experiment measurement of our mask is shown in Fig. 3(c). The scene is illuminated by a wide-spectrum, high-flatness LED light (ABSOLUTE SERIES LED D65).

Fig. 3. (a) Experiment hardware deployment of the 4D spectral imaging system. (b) Our fabricated mask. (c) The experiment measurement of our mask.

Download Full Size | PDF

Because of the luminous flux loss caused by the beam splitter, we set the exposure time of the CASSI branch to 20 ms, while that of the PAN branch was set to 10 ms. The captured images were in 12-bit RAW format. Our system acquired 27 spectral bands: 454.1, 459.3, 464.6, 470.2, 476.0, 482.0, 488.2, 494.7, 501.4, 508.4, 515.8, 523.4, 531.4, 539.7, 548.5, 557.6, 567.2, 577.3, 587.9, 599.1, 610.8, 623.2, 636.3, 650.1, 664.7, 680.2, and 696.6 nm.

To train U-net-3d with a simulative model that corresponds to the real imaging process, we obtained a real captured mask in the R-CASSI system and generated simulative measurements as input by the CAVE dataset [41]. The size of the training patches was 256 × 256 × 27. The batch size was set to 4, and the learning rate was 0.0001. During the training phase, we set 500 epochs to train the model with Adam as the optimizer; every epoch contained 3,000 data samples.

We used four public datasets—Sceneflow [42], KITTI-15 [43], Middlebury-v3 [44], and ETH3D [45]—to train the stereo-matching network for 10 epochs. The learning rate was set to 0.001, and the batch size was set to 24.

The training process for the fusion network was specially designed to adapt to our system. We first trained the U-net-3D network mentioned below. We then leveraged the KAIST dataset [46] to generate the same simulative measurements as in the U-net-3D network. The synthesized measurements were used as the input of a well-trained U-net-3D network without gradient descent. Then, the output hyperspectral data cube of the U-net-3D network was used to train the fusion network as the input. In addition, the grayscale image was simultaneously extracted with a high-frequency feature and inputted into the fusion network. The loss function of the fusion network can be expressed as follows:

(3)$$Los{s_{fusion}} = \frac{1}{m}\sum\limits_{i = 1}^m {{{||{f - GT} ||}^2}}, $$

where f is the output hyperspectral data of the fusion network, $GT$ is the ground truth of the hyperspectral data cube, m is the number of training samples, and ${||\cdot ||^2}$ is the ${l_2}$ norm, which is defined as mean square error. We trained the fusion network for 500 epochs. The learning rate was set to 0.001 and then shrunk by 10% every 200 epochs. The batch size was set to 10.

We trained the aforementioned networks with PyTorch. The training process uses a machine equipped with two NVIDIA RTX A6000 GPUs. Training the U-net-3D network, the stereo-matching network, and the fusion network, respectively, takes 20 hours, 25 hours, and 50 hours.

4. Evaluation and results

4.1 Ablation study

To investigate the effectiveness of each component in our fusion network, we conduct an extensive ablation study in Table 1.

Table 1. Ablation Study of Our Fusion Network

View Table | View all tables in this article

Table 1(a), which as a baseline, shows the effect that lack of all modules. Table 1(b) to (f) shows the effectiveness of each component compared to the baseline, and Table 1(g) to (l) illustrate the result that lack of one or partial modules. As we can see from Table 1(b), (d) and (f), compared to baseline, SA, UNET and MSF modules are especially beneficial for fusion network; SA module ensures the feature weight is correct and retain valid information from different channels, UNET module fully exploits the features of different levels by its encoder-decoder structure, MSF module fully extracts the features of different scales by its multiscale convolution structure and residual blocks. In addition, while all of our components are verified that they are beneficial for fusion network, Table 1(c) and (e), which are MRF and SPP modules, play a lesser role in our network when only set one module. However, when we focus on the effects that lack of one module, the lost of MRF and SPP modules, shown as Table 1(h) and (j), both lead to 1 dB performance degradation in PSNR. This verifies that they are both complementary to other modules and have the ability to enhance performance by assisting other modules. Table 1(g), (i) and (k) illustrate the capability of our SA, UNET and MSF module to extract adequate features for improving network performance. Removal of any of these modules individually results in a performance decrease of over 2 dB. In Table 1(l), due to the absence of a feature extraction module, resulting in fewer features, the MSF module did not acquire sufficient effective information, leading to suboptimal performance of the overall network.

4.2 Experiment results of hyperspectral fusion

Figure 4(a) shows a plastic cube. The measurement of the PAN branch is shown in Fig. 4(b), and the measurement of the CASSI branch is shown in Fig. 4(c). The two spectral curves of the chosen regions are shown in Fig. 4(d) and (e).

Fig. 4. Experiment scene results of a cube reconstructed using U-net-3D and a fusion network. (a) RGB image. (b) Measurement of the PAN branch. (c) Measurement of the CASSI branch. (d) Spectral curve of the selected green region. (e) Spectral curve of the selected orange region. (f) U-net-3D reconstruction results. (g) Fusion network reconstruction results.

Download Full Size | PDF

Because the reconstruction results of U-net-3d are adequately precise in the peak bands, the results of the fusion network retain accurate spectral details similar to those of the U-net-3d results. The lower part of Fig. 4 shows the visualization of this scene with 27 channels constructed by U-net-3d and the fusion network. As shown in Fig. 4(f) and (g), the spatial results of the fusion network have finer details and sharper image edges than those of U-net-3d. The results of U-net-3d have more artifacts in large areas than the results of the fusion network. For example, the zoomed areas of the 577.3 nm and 696.6 nm bands in the U-net-3d network have no details, while the same areas in the fusion results show legible edges and obvious cracks. It can be concluded that the artifact of CASSI reconstruction is obviously suppressed by the fusion network, while spectral prior knowledge is retained primely.

4.3 Comparison with other methods

For the second experiment, the target scene consisted of two stacked “Thorlabs snack” boxes, as shown in Fig. 5. The top of Fig. 5 shows the reference RGB image, the spectral density curves of selected area from RGB image and zoomed details of the images. In the lower part of Fig. 5, we show the visualization of this scene with seven algorithms [14,47–51] including 548.5 nm and 696.6 nm spectral image of each algorithm, synthesized grey image refer to camera's spectral response curve. As shown in Fig. 5, the results of two iterative algorithms (TwIST [47] and Mutually Beneficial Iterative algorithm (MBI) [48]) are excessively smooth and have lost detailed textures while retains indistinct regions. As shown in the zoomed details, the results of HD-net [49] and MST [50] exhibit a significant amount of noise. The disparity maps of HD-net and MST also show that stereo matching algorithm could not obtain exact disparity estimation, which is attributed to the low-quality spatial results. The results of TSA-net [14] have similar results in terms of resolution compared with our algorithm. However, minor imperfections are observed in the reconstruction of TSA-Net, attributed to incorrect disparity estimation within the delineated red-boxed region of disparity map. As for ACNNet [51], it demonstrates excellent spatial performance, however, there is an irregular distribution at the edges of its disparity map, as indicated by the blue boxed area in the Fig. 5.

Fig. 5. HSI reconstruction comparisons of experiment scene with 2 (out of 28) spectral channels and synthesized grey image. We show the spectral curves (top-medium) corresponding to the selected white boxes of the RGB image

Download Full Size | PDF

For spectral results, both iterative algorithms perform well at the selected region ‘(a)’ but are inferior to the other four deep learning algorithms overall, which is also consistent with the simulation results. All of the deep learning-based algorithms shows the excellent reconstruction effects at the selected region.

Table 2 shows the time complexity of the different algorithms. All the reconstruction algorithms were deployed on a computer with an Intel Xeon Gold 5218 CPU, 128 GB RAM, and two NVIDIA RTX A6000 GPUs running on the Linux operation system. The size of the reconstructed hyperspectral image was 768 × 768 × 27. Compared with the iterative algorithms, the deep learning-based algorithms showed lower time complexity. Although our algorithm takes slightly longer compared to other algorithms in terms of time, its advantages in spatial effects and depth estimation lead us to prefer the utilization of our proposed algorithm.

Table 2. Reconstruction Time of the Experiment Scene with Various Algorithms

View Table | View all tables in this article

4.4 Depth accuracy

We first captured 26 pairs of checkerboard images for distortion correction of our system and calculating rotation matrices and transformation matrices of our system [37]. Rotation matrices and transformation matrices between two branches are shown below:

(4)$$R = \left( {\begin{array}{ccc} {0.9996}&{0.0249}&{ - 0.0134}\\ { - 0.0248}&{0.9997}&{0.0077}\\ {0.0136}&{ - 0.0073}&{0.9999} \end{array}} \right),\;\; T = \left( \begin{array}{l} - 49.8301\\ 0.2502\\ - 2.1794 \end{array} \right),$$

Subsequently, we positioned the checkerboard as a scene at a distance of 1 m. The ideal reconstruction of the checkboard is that all points in the 3D point cloud are distributed in the same plane. However, because of inevitable reconstruction errors, some of the 3D points deviate from the original plane. Thus, we calculated the residuals between the reconstructed 3D point cloud and its fitting plane. Then, we used the residuals to compute the standard derivation between the reconstructed 3D point cloud and its fitting plane to reflect the accuracy of the 3D reconstruction.

Figure 6(a) shows the zoomed region we selected and Fig. 6(b) shows the absolute value of residuals between 3D point cloud and plane fitting. The standard deviation is about 2.5 mm, indicating that the depth accuracy of the proposed system is 2.5 mm. The results for depth accuracy experiments demonstrate the effectiveness of the proposed system in 3D reconstruction.

Fig. 6. (a) Measured planar board and selected region. (b) Absolute value of residuals between 3D points cloud and plane fitting

Download Full Size | PDF

4.5 Four-dimensional modeling

We deployed a complex scene to verify the precise 4D modeling ability of the proposed system. A toy with a complex texture is shown in Fig. 7(a). Figure 7(b) shows the depth map of the acquired scene, which can accurately exhibit the toy’s depth levels. The reprojected model of the scene reconstructed by the proposed algorithm is shown in Fig. 7(c). Figure 7(d) is the same model as Fig. 7(c), but from a different viewpoint. Figure 7(e) shows the model from the reconstructed result of only U-net-3d. This model contains an obvious artifact. Figure 7(f), (g), and (h), respectively, show three selected bands (548.5 nm, 599.1 nm, and 696.6 nm) of the final reconstructed hyperspectral model. The videos of these models are provided in Visualization 1.

Fig. 7. Spectral 4D model of the toy. (a) RGB image of the scene. (b) Depth map of the scene. (c)–(d) Point cloud image of the PAN image reconstructed by the proposed algorithm from different viewpoints. (e) Point cloud image of the PAN image reconstructed by U-net-3d. (f) Point cloud image at 548.5 nm. (g) Point cloud image at 599.1 nm. (h) Point cloud image at 696.9 nm.

Download Full Size | PDF

It is obvious that the shape of the toy was accurately recovered, which demonstrates that the proposed system and the reconstruction algorithm work effectively, even for depth discontinuities and occlusions.

However, despite the significant performance of our proposed method, it still has some limitations, which could be optimized through hardware design. For instance, when the optical element of our system has shallow depth of field, the stereo matching algorithm could only get blurry edges and textures of targets. These blurry measurements may lead to an increase in errors in the disparity map, thereby affecting the effectiveness of our fusion algorithm. Furthermore, since the proposed system is a dual-camera system, the effective field of view will decrease. In this situation, cause the wide-angle lens usually have large depth of field and large field of view, we could replace the lens of our system with a wide-angle lens. While the system deploys the elements with large field of view to overcome these disadvantages, the calibration efforts will increase because of large optical distortion in edges. Thus, it is important for our system to be well-calibrated, especially in the edge regions, while utilizing a wide-field lens in our system. In addition, since the system is subject to only two shooting angles, the occlusion in images (for some tiny features, the one in occlusion while another is not) will also affect the algorithms for feature extraction. Introducing appropriate occlusion handling algorithms to rectify occluded pixels could effectively resolve this issue.

5. Conclusion

In this paper, we present our developed ultra-high-speed 4D spectral imaging system that can simultaneously acquire reflectance spectra and 3D spatial data. The system has two parts: an R-CASSI branch for hyperspectral compressed coded measurement and a PAN branch for binocular stereo matching and spatial enhancement of hyperspectral data. The disparity map estimated from the two branches contains depth information and helps warp PAN measurement pixel by pixel. Then, we developed a novel fusion network with specially designed modules to fuse the warped PAN measurement with hyperspectral data from the R-CASSI branch. A set of experiments on a plastic cube, a paper box, and a doll demonstrated the versatility and accuracy of the proposed system. The spectral resolution and depth accuracy of the system are 8 nm (450–700 nm) and 2.5 mm, respectively. By using a deep learning network, the system reconstructs each frame for 1.83 s. Due to the accurate reconstruction performance of the proposed system, which supports accurate 4D modeling, the system has substantial potential areas of application, such as agriculture and biological analysis and art and cultural authentication.

Funding

National Key Research and Development Program of China (No.2021YFF0901700); National Natural Science Foundation of China (No. 61821001, No. 62371056); State Key Laboratory of Information Photonics and Optical Communication (BUPT) (No. IPOC2021ZT18).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. R. Näsi, E. Honkavaara, P. Lyytikäinen-Saarenmaa, et al., “Using UAV-based photogrammetry and hyperspectral imaging for mapping bark beetle damage at tree-level,” Remote Sens. 7(11), 15467–15493 (2015). [CrossRef]

2. T. Adão, J. Hruška, L. Pádua, et al., “Hyperspectral imaging: a review on UAV-based sensors, data processing and applications for agriculture and forestry,” Remote Sens. 9(11), 1110 (2017). [CrossRef]

3. G. Lu and B. Fei, “Medical hyperspectral imaging: a review,” J. Biomed. Opt. 19(1), 010901 (2014). [CrossRef]

4. M. Hansard, S. Lee, O. Choi, et al., Time-of-Flight Cameras: Principles, Methods and Applications, SpringerBriefs in Computer Science (Springer London, 2013).

5. D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” Int. J. Comput. Vis. 47(1/3), 7–42 (2002). [CrossRef]

6. T. Gruber, F. Julca-Aguilar, M. Bijelic, et al., “Gated2Depth: real-time dense lidar from gated images,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2019), pp. 1506–1516.

7. M. E. Gehm, R. John, D. J. Brady, et al., “Single-shot compressive spectral imaging with a dual-disperser architecture,” Opt. Express 15(21), 14013 (2007). [CrossRef]

8. C. Li, T. Sun, K. F. Kelly, et al., “A Compressive Sensing and Unmixing Scheme for Hyperspectral Data Processing,” IEEE Trans. Image Process. 21(3), 1200–1210 (2012). [CrossRef]

9. A. Wagadarikar, R. John, R. Willett, et al., “Single disperser design for coded aperture snapshot spectral imaging,” Appl. Opt. 47(10), B44 (2008). [CrossRef]

10. X. Lin, G. Wetzstein, Y. Liu, et al., “Dual-coded compressive hyperspectral imaging,” Opt. Lett. 39(7), 2044 (2014). [CrossRef]

11. Z. Yu, D. Liu, L. Cheng, et al., “Deep learning enabled reflective coded aperture snapshot spectral imaging,” Opt. Express 30(26), 46822 (2022). [CrossRef]

12. X. Lin, Y. Liu, J. Wu, et al., “Spatial-spectral encoded compressive hyperspectral imaging,” ACM Trans. Graph. 33(6), 1–11 (2014). [CrossRef]

13. C. V. Correa, H. Arguello, and G. R. Arce, “Snapshot colored compressive spectral imager,” J. Opt. Soc. Am. A 32(10), 1754 (2015). [CrossRef]

14. Z. Meng, J. Ma, and X. Yuan, “End-to-end low cost compressive spectral imaging with spatial-spectral self-attention,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, eds., Lecture Notes in Computer Science (Springer International Publishing, 2020), 12368, pp. 187–204.

15. Z. Zhao, Z. Meng, Z. Ju, et al., “A compact dual-dispersion architecture for snapshot compressive spectral imaging,” in Asia Communications and Photonics Conference 2021 (Optica Publishing Group, 2021), p. T4A.269.

16. R. Habel, M. Kudenov, and M. Wimmer, “Practical spectral photography,” Comput. Graph. Forum 31(2pt2), 449–458 (2012). [CrossRef]

17. H. Arguello, S. Pinilla, Y. Peng, et al., “Shift-variant color-coded diffractive spectral imaging system,” Optica 8(11), 1424 (2021). [CrossRef]

18. X. Cao, H. Du, X. Tong, et al., “A prism-mask system for multispectral video acquisition,” IEEE Trans. Pattern Anal. Mach. Intell. 33(12), 2423–2435 (2011). [CrossRef]

19. Y. August, C. Vachman, Y. Rivenson, et al., “Compressive hyperspectral imaging by random separable projections in both the spatial and the spectral domains,” Appl. Opt. 52(10), D46 (2013). [CrossRef]

20. L. Li, L. Wang, W. Song, et al., “Quantization-aware Deep Optics for Diffractive Snapshot Hyperspectral Imaging,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2022), pp. 19748–19757.

21. K. Ozawa, I. Sato, and M. Yamaguchi, “Hyperspectral photometric stereo for a single capture,” J. Opt. Soc. Am. A 34(3), 384 (2017). [CrossRef]

22. L. Wang, Z. Xiong, D. Gao, et al., “Dual-camera design for coded aperture snapshot spectral imaging,” Appl. Opt. 54(4), 848 (2015). [CrossRef]

23. S. K. Sahoo, D. Tang, and C. Dang, “Single-shot multispectral imaging with a monochromatic camera,” Optica 4(10), 1209 (2017). [CrossRef]

24. S. Foix, G. Alenya, and C. Torras, “Lock-in time-of-flight (ToF) cameras: a survey,” IEEE Sens. J. 11(9), 1917–1926 (2011). [CrossRef]

25. J.-R. Chang and Y.-S. Chen, “Pyramid stereo matching network,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2018), pp. 5410–5418.

26. H. Fu, M. Gong, C. Wang, et al., “Deep ordinal regression network for monocular depth estimation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2018), pp. 2002–2011.

27. Y. Yao, Z. Luo, S. Li, et al., “MVSNet: depth inference for unstructured multi-view stereo,” in Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss, eds., Lecture Notes in Computer Science (Springer International Publishing, 2018), 11212, pp. 785–801.

28. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006). [CrossRef]

29. E. J. Candes and T. Tao, “Near-optimal signal recovery from random projections: universal encoding strategies?” IEEE Trans. Inf. Theory 52(12), 5406–5425 (2006). [CrossRef]

30. E. J. Candès, “The restricted isometry property and its implications for compressed sensing,” Comptes Rendus Math. 346(9-10), 589–592 (2008). [CrossRef]

31. Y. Xu, A. Giljum, and K. F. Kelly, “A hyperspectral projector for simultaneous 3D spatial and hyperspectral imaging via structured illumination,” Opt. Express 28(20), 29740 (2020). [CrossRef]

32. G. Yang, J. Manela, M. Happold, et al., “Hierarchical deep stereo matching on high-resolution images,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2019), pp. 5510–5519.

33. J. Luo, E. Forsberg, S. Fu, et al., “4D dual-mode staring hyperspectral-depth imager for simultaneous spectral sensing and surface shape measurement,” Opt. Express 30(14), 24804 (2022). [CrossRef]

34. M. H. Kim, T. A. Harvey, D. S. Kittle, et al., “3D imaging spectroscopy for measuring hyperspectral patterns on solid objects,” ACM Trans. Graph. 31(4), 1–11 (2012). [CrossRef]

35. W. Feng, H. Rueda, C. Fu, et al., “3D compressive spectral integral imaging,” Opt. Express 24(22), 24859 (2016). [CrossRef]

36. H. Rueda-Chacon, J. F. Florez-Ospina, D. L. Lau, et al., “Snapshot compressive ToF + spectral imaging via optimized color-coded apertures,” IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2346–2360 (2020). [CrossRef]

37. Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000). [CrossRef]

38. A. Goy, K. Arthur, S. Li, et al., “Low photon count phase retrieval using deep learning,” Phys. Rev. Lett. 121(24), 243902 (2018). [CrossRef]

39. M. Lyu, W. Wang, H. Wang, et al., “Deep-learning-based ghost imaging,” Sci. Rep. 7(1), 17865 (2017). [CrossRef]

40. M. Qiao, Z. Meng, J. Ma, et al., “Deep learning for video compressive sensing,” APL Photonics 5(3), 030801 (2020). [CrossRef]

41. F. Yasuma, T. Mitsunaga, D. Iso, et al., “Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum,” IEEE Trans. Image Process. 19(9), 2241–2253 (2010). [CrossRef]

42. N. Mayer, E. Ilg, P. Hausser, et al., “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2016), pp. 4040–4048.

43. M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2015), pp. 3061–3070.

44. D. Scharstein, H. Hirschmüller, Y. Kitajima, et al., “High-resolution stereo datasets with subpixel-accurate ground truth,” in Pattern Recognition, X. Jiang, J. Hornegger, R. Koch, eds., Lecture Notes in Computer Science (Springer International Publishing, 2014), 8753, pp. 31–42.

45. T. Schops, J. L. Schonberger, S. Galliani, et al., “A Multi-view stereo benchmark with high-resolution images and multi-camera videos,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2017), pp. 2538–2547.

46. I. Choi, D. S. Jeon, G. Nam, et al., “High-quality hyperspectral reconstruction using a spectral prior,” ACM Trans. Graph. 36(6), 1–13 (2017). [CrossRef]

47. J. M. Bioucas-Dias and M. A. T. Figueiredo, “A new twist: two-step iterative shrinkage/thresholding algorithms for image restoration,” IEEE Trans. Image Process. 16(12), 2992–3004 (2007). [CrossRef]

48. L. Wang, Z. Xiong, G. Shi, et al., “Simultaneous depth and spectral imaging with a cross-modal stereo system,” IEEE Trans. Circuits Syst. Video Technol. 28(3), 812–817 (2018). [CrossRef]

49. X. Hu, Y. Cai, J. Lin, et al., “HDNet: high-resolution dual-domain learning for spectral compressive imaging,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2022), pp. 17521–17530.

50. Y. Cai, J. Lin, X. Hu, et al., “Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2022), pp. 17481–17490.

51. G. Xu, J. Cheng, P. Guo, et al., “Attention concatenation volume for accurate and efficient stereo matching,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2022), pp. 12971–12980.

	SA	MRF	UNET	SPP	MSF	PSNR	SSIM
(a)						30.05	0.881
(b)	√					30.89	0.912
(c)		√				30.11	0.887
(d)			√			30.41	0.902
(e)				√		30.15	0.898
(f)					√	30.68	0.906
(g)		√	√	√	√	32.13	0.936
(h)	√		√	√	√	33.03	0.953
(i)	√	√		√	√	32.18	0.942
(j)	√	√	√		√	33.07	0.964
(k)	√	√	√	√		31.99	0.933
(l)	√				√	31.56	0.920
(m)	√	√	√	√	√	34.18	0.971

Algorithms	Reconstruction time
TwIST	1557.60 s
Mutually beneficial iterative algorithm	798.83 s
Proposed algorithm	1.83 s
HD-net	1.53s
TSA-net	1.67s
MST	1.45s
ACVNet	1.69s

	SA	MRF	UNET	SPP	MSF	PSNR	SSIM
(a)						30.05	0.881
(b)	√					30.89	0.912
(c)		√				30.11	0.887
(d)			√			30.41	0.902
(e)				√		30.15	0.898
(f)					√	30.68	0.906
(g)		√	√	√	√	32.13	0.936
(h)	√		√	√	√	33.03	0.953
(i)	√	√		√	√	32.18	0.942
(j)	√	√	√		√	33.07	0.964
(k)	√	√	√	√		31.99	0.933
(l)	√				√	31.56	0.920
(m)	√	√	√	√	√	34.18	0.971

Algorithms	Reconstruction time
TwIST	1557.60 s
Mutually beneficial iterative algorithm	798.83 s
Proposed algorithm	1.83 s
HD-net	1.53s
TSA-net	1.67s
MST	1.45s
ACVNet	1.69s

Ultra-high-speed four-dimensional hyperspectral imaging

Abstract

1. Introduction

2. System principles

2.1 R-CASSI and hyperspectral reconstruction

2.2 Stereo matching

2.3 Hyperspectral and panchromatic fusion

3. Experiment setup

4. Evaluation and results

4.1 Ablation study

4.2 Experiment results of hyperspectral fusion

4.3 Comparison with other methods

4.4 Depth accuracy

4.5 Four-dimensional modeling

5. Conclusion

Funding

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (7)

Tables (2)

Equations (4)

Optics Express