Hyperspectral image reconstruction via patch attention driven network

Yechuan Qiu; Shengjie Zhao; Xu Ma; Tong Zhang; Gonzalo R. Arce

doi:10.1364/OE.479549

1. Introduction

Hyperspectral image data cubes contain abundant information in the spectral and spatial dimensions. In comparison to traditional RGB images, HSIs have higher spectral resolution thus representing more spectral characteristics. Although HSIs have lower spatial resolution due to the constrains of imaging hardware [1], the applications of super-resolution algorithms and other post-processing techniques help improve the image quality of HSIs [2]. HSIs are widely used in various fields, including computer vision [3], medical imaging [4] and remote sensing [5]. To obtain HSIs, conventional hyperspectral imaging systems use 2D detectors to scan the scenes, which is unsuitable for capturing the dynamic scenes. Based on compressive sensing theory [6], coded aperture snapshot spectral imaging (CASSI) [7–9] utilizes a coding mask and a prism to perform the spatial and spectral modulations so as to compress 3D cubes into 2D snapshots, as shown in Fig. 1. Inverse algorithms are used to reconstruct the HSIs from the measurements.

Fig. 1. The optical principle of the CASSI system. The hyperspectral scene is spatially modulated by the coded aperture and spectrally dispersed by the prism. The compressive measurement is then captured by the 2D detector.

Download Full Size | PDF

Conventional model-based HSI reconstruction methods employ different hand-crafted priors for regularization, such as sparse priors [10–13], low-rankness priors [14] and total variation priors [15,16]. However, these pre-defined priors are often inadequate and limit the reconstruction performance. Besides, the iterative optimization process of model-based methods results in a low reconstruction speed. These disadvantages restrict the practical application of the CASSI system and similar snapshot spectral imaging methods [17]. With the rapid advances of deep learning and its applications to computational imaging, many deep learning-based algorithms were introduced into HSI reconstruction with remarkable success [18–20].

There are some different deep learning approaches in the HSI reconstruction. The deep unrolling reconstruction algorithms transfer the iterative optimization process into deep neural networks [21,22]. The plug-and-play algorithms use the pre-trained denoising networks as the reconstruction priors [23]. But in most cases, the deep convolution neural networks (CNNs) are used to model the reconstruction mapping from the compressive measurement to the HSI. Due to the structure of local receptive fields, CNNs have strong ability to model local features but are limited in capturing long-range relationships. Increasing the depth of CNNs can improve the modelling capacity, but also introduces obstacles in network training [24]. Besides, not all features extracted from the measurements have equal contribution to the reconstruction, but CNNs treat them equally. The lack of discriminative learning ability weakens the learning ability of CNNs [25]. To compensate for these deficiencies, the attention mechanisms along with the elaborate network structures were introduced for HSI reconstruction. Miao et al. designed a two stage reconstruction network and proposed a pixel-wise self-attention to model the spatial correlation [26]. Meng et al. proposed an axis-wise self-attention module to model the spatial and spectral features and added it into the U-net [27]. Cai et al. presented a spectral-wise multi-head self-attention to capture the inter-spectra similarity and combined it with a symmetrical network structure [28].

Our work mainly focuses on using the attention mechanism to capture the uneven distribution of feature information. Recently, ClassSR proposed a general framework for super-resolution problems, where images were spatially decomposed into small sub-images and sorted into three categories according to their complexities [29]. The categories were then sent to the different super-resolution networks, which signified that distinct image regions had dissimilar restoration difficulties. This triggers us that the uneven distribution of the feature information in both spatial and spectral domains may serve as a heuristic clue to guide the HSI reconstruction network. In the previous low-level visual works, channel-wise attention has been used to help the networks concentrate on important channels [25,30]. Inspired by these works, we propose the patch attention to differentiate informative regions in both spectral and spatial dimensions. Specifically, we apply the patch attention to model the long-range correlations between the patches instead of catching all interactions in the local and non-local pixels to improve the network efficacy.

Furthermore, we propose the attention guided deep residual network (AGDRN) by combining the patch attention with an multilevel residual structure. We apply the local skip connections, group skip connections and global skip connections to construct hierarchical modules, which enable the network to transfer the feature information of different frequency levels between the modules. The patch attention modules are integrated into the structure to enhance the discriminative learning capability of the network.

In addition to the reconstruction network, we propose an effective information integrating method, namely complementary input (CI), for data pre-processing. At the beginning, the reconstruction models merely used the measurement as the input [26]. Since HSIs are modulated by the coding mask, the spatial encoding information plays an important role in the reconstruction [31]. Subsequently, the mask information was introduced into the measurement representation by conducting an element-wise multiplication between them [27,32,33], which unfortunately corrupts the spatial structure of the measurement to solve the ill-posed inverse image reconstruction problem. Different from previous works, we further concatenate the multiplication result with the measurement in the channel dimension to compensate for the corrupted representation. Experimental results demonstrate that our CI method can be easily combined with existing networks and obtains significant performance improvement in HSI reconstruction.

The main contributions of this work are listed as follows:

- The patch attention module is proposed to capture the uneven feature distribution by exploiting the global correlations in the spectral and spatial dimensions.
- The complementary input method is designed to effectively integrate the measurement with the coded aperture.
- Extensive experimental results in simulation illustrate that our method outperforms state-of-the-art (SOTA) methods.

Section 2 introduces the mathematical model of the CASSI system. In Section 3, we illustrate the proposed network AGDRN, patch attention mechanism and complementary input method. In Section 4, we perform simulation experiments to validate the proposed methods. Section 5 concludes this paper.

2. Measurement model

As depicted in Fig. 1 and Fig. 2, the scene is represented by the 3D HSI cube which is first spatially coded by a coding mask, and then spectrally dispersed by a prism. Finally, the data cube is captured by the detector plane as a 2D compressive measurement. Let $\mathbf {X} \in \mathbb {R}^{H \times W \times C}$ represent the HSI cube, and let $\mathbf {M} \in \mathbb {R}^{H \times W}$ denote the coding mask. After spatial modulation, we have

(1)$$\mathbf{X'}(i, j, k) = \mathbf{X}(i, j, k) \mathbf{M}(i, j),$$

where $\mathbf {X'}\in \mathbb {R}^{H \times W \times C}$ denotes the modulated HSI, $1 \le i\le H$, $1 \le j \le W$ and $1 \le k \le C$ index the voxel in the data cube.

Fig. 2. An illustration of optical flow in CASSI system

Download Full Size | PDF

After dispersion, the image in each spectrum channel of the cube shifts spatially [9]. Let $\mathbf {X''}\in \mathbb {R}^{H \times (W+d(C-1)) \times C}$ be the shifted HSI. Then, the $(i,j,k)$-th voxel of $\mathbf {X''}$ can be expressed as

(2)$$\mathbf{X''}(i, j, k) = \mathbf{X'}(i, j+d(k-1), k),$$

where $d$ denotes the shifting step.

Finally, the measurement captured by the detector can be written as

(3)$$\mathbf{Y}(i, j) = \sum_{k=1}^{C} \mathbf{X''}(i, j, k) + \mathbf{N}(i, j),$$

where $\mathbf {Y} \in \mathbb {R}^{H \times (W+d(C-1))}$ denotes the measurement and $\mathbf {N} \in \mathbb {R}^{H \times (W+d(C-1))}$ denotes the imaging noise. The measurement process can be described in matrix form as

(4)$$\mathbf{y} = \mathbf\Phi \mathbf{x} + \mathbf{n},$$

where $\mathbf {y},\mathbf {n} \in \mathbb {R}^{H(W+d(C-1))}$ are the vector forms of $\mathbf {Y}$ and $\mathbf {N}$, $\mathbf \Phi \in \mathbb {R}^{H(W+d(C-1))\times HWC}$ denotes the measurement matrix of the CASSI system, and $\mathbf {x} \in \mathbb {R}^{HWC}$ is the vector form of $\mathbf {X}$.

3. Reconstruction network

Different from previous reconstruction approaches of CASSI, we introduce a new network architecture for HSI reconstruction. As depicted in Fig. 3, it consists of two stages: data pre-processing and reconstruction model. In the data pre-processing stage, we use the complementary input method to aggregate the measurement data with the mask and generate the initialized input for the reconstruction model, as shown in Fig. 4. In the reconstruction model, we design the attention guided deep residual network to combine the patch attention module with a multilevel residual structure, as shown in Fig. 5 and Fig. 6. The details are presented as follows.

Fig. 3. The proposed reconstruction network architecture.

Download Full Size | PDF

Fig. 4. Illustration of the complementary input method. (a) The measurement contains spectral correlation information due to the spatial modulation and dispersive effect in the CASSI measurement process. (b) To initialize the network input, the measurement is first reshaped to a cube by cutting and shifting operations. Then, an element-wise multiplication between the coding mask and the cube is conducted to integrate them, which also corrupts the measurement representation. We further concatenate it with the measurement cube to complement the corrupted spectral information.

Download Full Size | PDF

Fig. 5. The structure of the attention guided deep residual network. It uses patch attention to guide residual blocks in the multilevel skip connections.

Download Full Size | PDF

Fig. 6. Details of the patch attention module. (a) Our patch attention calculates attention maps along the spectral and spatial dimensions separately. (b) The spectral patch attention uses pooling operations to aggregate spectral information from channels. (c) The spatial patch attention decomposes the feature map into patches and extract the spatial features from patches. (d) The components of feed forward network (FFN).

Download Full Size | PDF

3.1 Complementary input

The HSI reconstruction network recovers the HSI $\mathbf {X}$ from the corresponding compressive measurement $\mathbf {Y}$ and the coding mask $\mathbf {M}$. Due to the spatial modulation and spectral dispersion in the measurement process, the compressive measurement is an accumulation of light rays with multiple wavelengths from adjacent spatial locations, as shown in Fig. 4(a). Previous data pre-processing methods [27,32,33] cut and shifted the measurement $\mathbf {Y}$ to a cube $\mathbf {Y}' \in \mathbb {R}^{H \times W\times C}$ as

(5)$$\mathbf{Y}'(i, j, k) = \mathbf{Y}(i, j-d(k-1)),$$

and then conducted an element-wise multiplication between $\mathbf {Y}'$ and $\mathbf {M}$ to introduce spatial fidelity information as

(6)$$\mathbf{Y}''(i, j, k) = \mathbf{Y}'(i, j, k) \mathbf{M}(i, j).$$

The preprocessed data cube $\mathbf {Y}''\in \mathbb {R}^{H \times W\times C}$ was then fed into the reconstruction model as the input. Previous experiments [28] have shown that integrating mask information leads to better reconstruction than just using the measurement cube.

However, as shown in Fig. 4(b), although the imaging information of each wavelength is retained in the corresponding channel, the element-wise multiplication disrupts the measurement representation which characterize the spectral correlation features. It brings obstacles for the neural network to capture the interdependencies, and eventually limits the reconstruction performance.

Our method further concatenate $\mathbf {Y}''$ and $\mathbf {Y'}$ in the spectral dimension as

(7)$$\mathbf{Y}'''' = \mathrm{concatenate}(\mathbf{Y}',\mathbf{Y}''),$$

where $\mathbf {Y'''}\in \mathbb {R}^{H \times W\times 2C}$ denotes the final initialized input. By this simple operation, we complement the corrupted measurement information with additional channels while preserving the guidance effect of the mask. Since our method only operates in the data pre-processing stage, it can be easily combined with existing networks to improve the reconstruction performance. Note that no additional measurements are need in the complementary data pre-processing stage shown in Eq. (7).

As for the mask choice, following the setting of many previous HSI reconstruction methods, we apply a random binary coded aperture with 50% transmittance, which usually serves as a baseline in various CASSI systems [34]. Notably, some research has shown that an optimized coded aperture may bring better reconstructions, such as the blue noise coded aperture [31], the colored coded aperture [35], and the deep learning optimized coded aperture [18,36].

3.2 Attention guided deep residual network

Previous HSI reconstruction works usually use the U-net as the backbone network [26,33]. In the proposed AGDRN, we introduce a new multilevel residual structure and explore the relationship between the attention module and residual learning. The structure of AGDRN is depicted in Fig. 5. Our network mainly consists of three parts: a feature extraction layer, $K$ feature processing groups, and a reconstruction layer.

The feature processing groups are the main body of the network, which are constructed by the global skip connection, group skip connections and the local skip connections. In each group, we place the attention module first to emphasize or suppress different regions in the feature map, and then use $N$ residual blocks (ResBlocks) [24] to process the rescaled feature map.

Specifically, in the feature extraction layer, one convolution layer is used to extract a feature map $\mathbf {X}_e\in \mathbb {R}^{H \times W \times C_f}$ from the input data cube $\mathbf {Y}'''$ as

(8)$$\mathbf{X}_e = f^{5\times5}(\mathbf{Y}'''),$$

where $f^{5\times 5}$ denotes a convolution layer with a $5\times 5$ filter. To sufficiently extract feature information from the input data, a large size filter is applied in this layer. During the feature extraction, the spatial size of the data cube stays unchanged, while the number of its spectral channels is raised from $C$ to $C_f$.

The $K$ feature processing groups are combined by a global skip connection. A convolution layer is added after the last group for the feature fusion. The feature map $\mathbf {X}_g \in \mathbb {R}^{H \times W \times C_f}$ processed by the this part can be represented as

(9)$$\mathbf{X}_g = f^{5\times5}(G_{K}(G_{K-1}({\ldots}G_1(\mathbf{X}_e){\ldots})))+\mathbf{X}_e,$$

where $G(\cdot )$ denotes the operations of a feature processing group. Each feature processing group consists of an residual attention module, $N$ ResBlocks, and a group skip connection. Let $\mathbf {T}\in \mathbb {R}^{H \times W \times C_f}$ denote the input feature map of the module or block, the operation of the feature processing group can be expressed as

(10)$$G(\mathbf{T}) = f^{5\times5}(R_{N}(R_{N-1}({\ldots}R_{1}(A(\mathbf{T})+\mathbf{T}){\ldots})))+\mathbf{T},$$

where $R(\cdot )$ denotes the operations of a ResBlock, and $A(\cdot )$ denotes the attention module. A ResBlock contains two convolutions with the $3\times 3$ filters, a ReLU activation function and a local skip connection, which can be written as

(11)$$R(\mathbf{T}) = f^{3\times3}(\mathrm{ReLU}(f^{3\times3}(\mathbf{T})))+\mathbf{T}.$$

The combination of ReLU activation function and the small size kernels convolution performs well with low computational costs in the vision tasks. Due to the characteristics of the residual structure, the feature map maintains its shape in the feature processing groups.

In the reconstruction layer, we apply a convolution layer to generate the HSI from the processed feature map $\mathbf {X}_g$. The reconstructed HSI $\mathbf {X}_r \in \mathbb {R}^{H \times W \times C}$ can be represented as

(12)$$\mathbf{X}_r = f^{5\times5}(\mathbf{X}_g).$$

Finally, the feature map of the last reconstruction layer is reduced to the shape of the HSI cube, and its values are reorganized into the HSI voxels. Since the features are mapped to the HSIs in this layer, we use a large filter to better express details.

3.3 Patch attention

The features of HSIs are unevenly distributed in the spatial and spectral domains. Furthermore, the HSIs are spatially coded by the coding mask during the measurement process, which causes fidelity differences from region to region. By giving discriminative weights to the different positions of the HSIs, attention modules rescale the feature map and allow the network to focus on the informative and high-fidelity regions.

Past approaches to account for channel attention-based and self-attention-based methods have been proposed to generate the channel-wise spectral attention map [28,30]. As for the spatial attention, previous approaches calculated the attention map by modelling the correlation between each pixel and every other pixel, leading to large network sizes and high computational complexity [26]. To reduce the computation cost, some researchers employed the self-attention in a local area instead of globally [37], and some applied attention along individual axes to approximate the spatial attention [27]. Different from previous methods, our method decomposes the feature map into patches to utilize the sparsity of the HSIs. The patches are seen as tokens to calculate the global attention map. Benefiting from this design, the attention module could focus on catching the long-range correlations between the patches, thus improving the network efficiency. The local correlations inside each patch are left to the subsequent convolution module, since convolution layers have the strong ability in modelling local interactions.

Our attention module deduces the patch-wise attention maps in the spectral and spatial dimensions separately. As shown in Fig. 6(a), we calculate the spectral maps first, and then calculate the spatial maps. Given an input feature map $\mathbf {T}\in \mathbb {R}^{H \times W \times C_f}$, the overall process of the patch attention $A(\cdot )$ can be represented as

(13)$$A(\mathbf{T}) = A_s(A_c(\mathbf{T})),$$

where $A_s(\cdot )$ denotes the spatial attention, and $A_c(\cdot )$ denotes the spectral attention.

The spectral attention submodule is illustrated in Fig. 6(b). We divide the feature map into patches along the spectral channel. A common practice is to use the average pooling to aggregate the information from channels [30,38]. As the max-pooled features have also shown to be meaningful in the image classification tasks, and using the both types of pooling leads to finer attention inference than the single [39], we employ them simultaneously to compress the $C_{f}$ channels into $C_{f}/P_{c}$ descriptors, where $P_{c}$ denotes the channel patch size. Then, we apply the feed forward network to model the nonlinear interactions between the channels, and use the sigmoid activation to normalize the attention map. After generating the spectral attention map, we conduct a channel-wise multiplication to rescale the feature map. Let $\sigma$ denotes the sigmoid function, and the spectral attention submodule $A_{c}(\cdot )$ can be expressed as

(14)$$A_{c}(\mathbf{T}) = \mathbf{T}\odot_c\sigma(\mathrm{FFN}(\mathrm{Avg}(\mathbf{T}))+\mathrm{FFN}(\mathrm{Max}(\mathbf{T}))),$$

where $\mathrm {Avg}(\cdot )$ and $\mathrm {Max}(\cdot )$ denote the average pooling and the max pooling, and $\odot _c$ refers to the channel-wise multiplication.

The spatial attention submodule is shown in Fig. 6(c). As the information in HSI is spatially sparse, generating a pixel-wise attention map is costly and unnecessary. We decompose the feature map into patches and calculate the attention weight for each patch. Specifically, for an input feature map $\mathbf {T}\in \mathbb {R}^{H \times W \times C_f}$, we spatially split it into patches $\mathbf {T}_{p}\in \mathbb {R}^{H/P \times W/P \times P^2 \times C_f}$ and extract each patch in a descriptor, where $P$ denotes the spatial patch size. Then, we use the FFN and sigmoid function to model the global correlations between patches and generate the attention map. Finally, we conduct a patch-wise multiplication between it and the feature map. In practice, we apply a convolution layer to decompose the feature map into patches. The spatial attention submodule $A_{s}(\cdot )$ can be represented as

(15)$$A_{s}(\mathbf{T}) = \mathbf{T}\odot_p\sigma(\mathrm{FFN}(f^{P\times P}_P(\mathbf{T}))),$$

where $f^{P\times P}_P(\cdot )$ denotes the convolution layer with a $P\times P$ filter and a stride of $P$, and $\odot _p$ represents the patch-wise multiplication. With the $P\times P$ filter and a stride of $P$, the convolution layer can calculate in the non-overlapping patch regions, and extract the descriptors.

4. Experiments

4.1 Experiment setting

To evaluate the proposed HSI reconstruction network architecture, we conduct simulations on three public hyperspectral datasets ICVL [40], Harvard [41] and CAVE [42]. The ICVL dataset contains 201 HSIs at a size of $1300\times 1392\times 31$, and the wavelength range is from 400nm to 700nm with 10nm bandwidth for each channel. The Harvard dataset includes 50 HSIs at a size of $1040\times 1392\times 31$ in daylight illumination, and the wavelength range is from 420nm to 720nm. The CAVE dataset consists of 32 HSIs at a size of $512\times 512\times 31$, and the wavelength range is from 400nm to 700nm. We remove some HSIs with similar contents or backgrounds, and normalize the intensity of the rest HSIs. Then, we randomly pick 100 HSIs for training and 50 HSIs for testing from the ICVL dataset, 35 HSIs for training and 9 HSIs for testing from the Harvard dataset, and 26 HSIs for training and 6 HSIs for testing from the CAVE dataset. The training sets and testing sets are non-overlapping. For each dataset, we crop 5000 samples of size $256\times 256\times 31$ from the training set, and use 90% samples for training and the rest for validating.

To conduct experiments on simulation data, we use a random binary mask to modulate the samples and shift them with the step of 1 pixel to generate a synthetic measurement with dimension of $256\times 286$. The single shot measurement and the coded aperture are subsequently initialized as the input for networks to reconstruct the corresponding HSI. The training goal is to minimize the RMSE-based loss function as

(16)$$\mathcal{L}_{RMSE}(\mathbf{X}, \mathbf{X}_r) = \sqrt{\frac{1}{D}\sum^D_{d=1}\Vert \mathbf{X}-\mathbf{X}_r\Vert^2},$$

where $\mathbf {X}_r$ is the reconstructed HSI, $\mathbf {X}$ is the ground-truth, and $D$ denotes the number of training samples. We implement our network in Pytorch. The learning rate is initialized as 0.0002 and halved every 20 epochs during training. We apply AdamW optimizer with $\beta _1=0.9$, $\beta _2=0.999$ to train all models.

4.2 Comparison with state-of-the-art methods

In the implementation, we set channel number as $C_f=64$ the feature processing group number as $K=3$, and the Resblock number as $N=8$ to present our standard AGDRN model. To show the scalability and the efficiency of the design, we reduce the channel number and the Resblock number as $C_f=62$ and $N=5$ and limit all convolution kernel size to $3\times 3$ to establish the AGDRN-S as a small version model. More implementation details can be found in Code 1 (Ref. [43]). We compare our AGDRN with several SOTA HSI reconstruction methods, including two attention-based networks TSA-net [27] and SwinIR [37], a CNN-based methods SRN [32] and a deep prior-based DGSMP [44]. All methods are implemented based on their official source codes and evaluated with the same data, mask and setting.

To quantitatively compare the reconstruction performance, we use the peak signal to noise ratio (PSNR), the structured similarity index metrics (SSIM) and root mean square error (RMSE) as the evaluation indicators. For reconstructed HSIs on different datasets, we calculate the PSNR and SSIM value for each spectral channel and take the average as the final result. Higher PSNR and SSIM values and lower RMSE loss indicate better reconstruction performance. As shown in Table 1, the proposed networks significantly outperform other algorithms on the three datasets. Our AGDRN achieve the highest PSNR and SSIM of 38.19dB and 0.982 on the ICVL dataset, 37.73dB and 0.956 on the Harvard dataset, and 33.72dB and 0.953 on the CAVE dataset. In terms of RMSE loss, the AGDRN has the lowest value of 0.038 on the ICVL dataset, 0.054 on the Harvard dataset, and 0.117 on the CAVE dataset. The result of AGDRN-S decreased slightly, but still performs better than other compared methods. Since the test scenes have been normalized, the RMSE losses measured for different reconstruction methods are relatively low.

Table 1. The reconstruction results of different methods on three datasets

View Table | View all tables in this article

For visual comparison, we choose three representative test scenes from the CAVE, ICVL and Harvard datasets, and show the reconstructed HSIs in Figs. 7, 8 and 9. We select 4 of 31 spectral channels for exhibition, and provide the RGB images. The HSIs reconstructed by the AGDRN retain more details and introduce fewer artifacts and distortion than the other methods.

Fig. 7. Reconstructed HSIs obtained using different methods. This test scene is from the CAVE dataset. The RGB images and 4 out of 31 spectral channels are presented.

Download Full Size | PDF

Fig. 8. Reconstructed HSIs obtained using different methods. This test scene is from the ICVL dataset. The RGB images and 4 out of 31 spectral channels are presented.

Download Full Size | PDF

Fig. 9. Reconstructed HSIs obtained using different methods. This test scene is from the Harvard dataset. The RGB images and 4 out of 31 spectral channels are presented.

Download Full Size | PDF

Moreover, to compare the spectral accuracy, we plot the reconstructed spectral curves for two points on each above-mentioned test scene. The points are marked in the first sub-pictures of Figs. 7, 8 and 9. Figure 10 shows that our result is closest to the reference.

Fig. 10. Comparison of spectral accuracy for two points on each test scene. The points are indicated in Figs. 7, 8 and 9.

Download Full Size | PDF

We also compare parameter numbers and the computation time to analyze the model size and the time complexity of the different methods. The training time and evaluation time of all models in the ICVL dataset are listed in Table 2. Each model is trained on the GPU for 100 epochs, and reconstruct 50 test scenes for evaluation. Among the compared methods, our AGDRN-S has the fewest parameters. The AGDRN’s parameters are increased, but still at a low level compared with other attention-driven methods. As for the computation time, our AGDRN-S consumes more training time than the SRN which contains no attention computation, but less than the other methods. In terms of evaluation time (reconstruction time), it can be found that the deep network-based methods take a similar amount of time, which is less than the deep prior-based method (DGSMP).

Table 2. Model size and running times comparison for the different methods

View Table | View all tables in this article

4.3 Ablation study

We analyze the effects of complementary input method and patch attention module in this subsection.

We combine the CI method with three end-to-end reconstruction models on the ICVL dataset to evaluate its effectiveness and generality. To integrate CI, we only change the width of the first layer in these models, because the CI method doubles the channels of the input data. We compare the CI with the previous data pre-processing method, and list the results in Table 3(a). The reconstruction performance of all three models is significantly improved with the proposed CI method. TSA-Net boosts 1.38dB/0.009, SRN boosts 0.93dB/0.003 and AGDRN boosts 0.35dB/0 in terms of PSNR/SSIM.

Table 3. Comparison results of the ablation study

View Table | View all tables in this article

To investigate the effect of the PA modules, we further remove them from our network and conduct experiments on the ICVL, Harvard and CAVE datasets. As shown in Table 3(b), PA helps improve the network performance in all datasets, which demonstrates its stability. The attention weights in the first spatial and spectral patch attention module of the AGDRN are visualized to show its guiding effects. As plotted in Fig. 11, the network learns the long-range correlations information and discriminate the feature regions through the patch attention.

Fig. 11. Visualization of the attention weights on two test scenes

Download Full Size | PDF

Overall, the proposed CI method provides the original measurement while introducing the mask information, generally making it easier for the reconstruction models to learning the mapping from the snapshot to the HSI. The proposed PA module catches the global correlations in spatial and spectral dimensions, thus obtaining performance improvement in all datasets.

4.4 Experiment with the real mask

The masks in practical scenarios usually contain some noise due to the fabrication error, which has a significant impact toward the reconstruction performance. To thoroughly ensure the performance and robustness of the proposed method, we further conduct experiments on a benchmark dataset which includes a mask captured from the real CASSI system [27]. See [45] for a description of mask fabrication technology and limitations. The training set contains 28 channels of the CAVE dataset while the testing set employs 10 scenes from the KAIST dataset with the same channel count [20]. The reconstruction results of several start-of-the-art methods are reported in Table 4. Our method achieves the best reconstruction performance.

Table 4. Performance comparison with the real coded aperture

View Table | View all tables in this article

5. Conclusion

In this paper, we propose a novel, to our knowledge, network architecture for HSI reconstruction. To effectively integrate the compressive measurement and the coding mask during the data pre-processing stage, we propose the complementary input method to compensate for the corrupted measurement representation. To fully exploit the potential of the residual learning and explore the relationship between the attention mechanism and the residual structure, we introduce the multilevel residual structure and combine it with the patch attention module. To differentiate informative regions in spectral and spatial dimensions, we develop the patch-wise attention to rescale the feature map. Experiments of comparison illustrate that our method exceeds the state-of-the-art algorithms. Ablation experiments shows the effectiveness of the proposed methods. We hope the proposed network architecture will benefit future works in compressive hyperspectral image reconstruction.

Funding

National Key Research and Development Program of China (2019YFB2102300); National Natural Science Foundation of China (61936014U2241275); Shanghai Municipal Science and Technology Major Project (2021SHZDZX0100); Shanghai Science and Technology Innovation Action Plan Project (22511105300); Fundamental Research Funds for the Central Universities.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in [27], [40], [41] and [42].

References

1. S. He, H. Zhou, Y. Wang, W. Cao, and Z. Han, “Super-resolution reconstruction of hyperspectral images via low rank tensor modeling and total variation regularization,” in 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), (IEEE, 2016), pp. 6962–6965.

2. W. Dong, C. Zhou, F. Wu, J. Wu, G. Shi, and X. Li, “Model-guided deep hyperspectral image super-resolution,” IEEE Trans. on Image Process. 30, 5754–5768 (2021). [CrossRef]

3. M. H. Kim, T. A. Harvey, D. S. Kittle, H. Rushmeier, J. Dorsey, R. O. Prum, and D. J. Brady, “3d imaging spectroscopy for measuring hyperspectral patterns on solid objects,” ACM Trans. Graph. 31(4), 1–11 (2012). [CrossRef]

4. G. Lu and B. Fei, “Medical hyperspectral imaging: a review,” J. Biomed. Opt. 19(1), 010901 (2014). [CrossRef]

5. M. Borengasser, W. S. Hungate, and R. Watkins, Hyperspectral remote sensing: principles and applications (CRC press, 2007).

6. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006). [CrossRef]

7. M. E. Gehm, R. John, D. J. Brady, R. M. Willett, and T. J. Schulz, “Single-shot compressive spectral imaging with a dual-disperser architecture,” Opt. Express 15(21), 14013–14027 (2007). [CrossRef]

8. A. Wagadarikar, R. John, R. Willett, and D. Brady, “Single disperser design for coded aperture snapshot spectral imaging,” Appl. Opt. 47(10), B44–B51 (2008). [CrossRef]

9. G. R. Arce, D. J. Brady, L. Carin, H. Arguello, and D. S. Kittle, “Compressive coded aperture spectral imaging: An introduction,” IEEE Signal Process. Mag. 31(1), 105–115 (2014). [CrossRef]

10. D. Kittle, K. Choi, A. Wagadarikar, and D. J. Brady, “Multiframe image estimation for coded aperture snapshot spectral imagers,” Appl. Opt. 49(36), 6824–6833 (2010). [CrossRef]

11. X. Lin, Y. Liu, J. Wu, and Q. Dai, “Spatial-spectral encoded compressive hyperspectral imaging,” ACM Trans. Graph. 33(6), 1–11 (2014). [CrossRef]

12. J. Tan, Y. Ma, H. Rueda, D. Baron, and G. R. Arce, “Compressive hyperspectral imaging via approximate message passing,” IEEE J. Sel. Top. Signal Process. 10(2), 389–401 (2016). [CrossRef]

13. Y. Wu, P. Ye, I. O. Mirza, G. R. Arce, and D. W. Prather, “Experimental demonstration of an optical-sectioning compressive sensing microscope (csm),” Opt. Express 18(24), 24565–24578 (2010). [CrossRef]

14. Y. Liu, X. Yuan, J. Suo, D. J. Brady, and Q. Dai, “Rank minimization for snapshot compressive imaging,” IEEE Trans. Pattern Anal. Mach. Intell. 41(12), 2990–3006 (2019). [CrossRef]

15. L. Wang, Z. Xiong, D. Gao, G. Shi, and F. Wu, “Dual-camera design for coded aperture snapshot spectral imaging,” Appl. Opt. 54(4), 848–858 (2015). [CrossRef]

16. X. Yuan, “Generalized alternating projection based total variation minimization for compressive sensing,” in 2016 IEEE International Conference on Image Processing (ICIP), (IEEE, 2016), pp. 2539–2543.

17. E. Salazar, A. Parada-Mayorga, and G. R. Arce, “Spectral zooming and resolution limits of spatial spectral compressive spectral imagers,” IEEE Trans. Comput. Imaging 5(2), 165–179 (2019). [CrossRef]

18. L. Wang, T. Zhang, Y. Fu, and H. Huang, “Hyperreconnet: Joint coded aperture optimization and image reconstruction for compressive hyperspectral imaging,” IEEE Trans. on Image Process. 28(5), 2257–2270 (2019). [CrossRef]

19. L. Wang, C. Sun, Y. Fu, M. H. Kim, and H. Huang, “Hyperspectral image reconstruction using a deep spatial-spectral prior,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE Computer Society, 2019), pp. 8024–8033.

20. I. Choi, D. S. Jeon, G. Nam, D. Gutierrez, and M. H. Kim, “High-quality hyperspectral reconstruction using a spectral prior,” ACM Trans. Graph. 36(6), 1–13 (2017). [CrossRef]

21. J. Ma, X.-Y. Liu, Z. Shou, and X. Yuan, “Deep tensor admm-net for snapshot compressive imaging,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (IEEE, 2019), pp. 10222–10231.

22. L. Wang, C. Sun, M. Zhang, Y. Fu, and H. Huang, “Dnu: Deep non-local unrolling for computational spectral imaging,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE, 2020), pp. 1661–1671.

23. X. Yuan, Y. Liu, J. Suo, and Q. Dai, “Plug-and-play algorithms for large-scale snapshot compressive imaging,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2020), pp. 1444–1454.

24. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (IEEE, 2016), pp. 770–778.

25. Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European conference on computer vision (ECCV), (Springer, 2018), pp. 294–310.

26. X. Miao, X. Yuan, Y. Pu, and V. Athitsos, “l-net: Reconstruct hyperspectral images from a snapshot measurement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (IEEE, 2019), pp. 4059–4069.

27. Z. Meng, J. Ma, and X. Yuan, “End-to-end low cost compressive spectral imaging with spatial-spectral self-attention,” in European Conference on Computer Vision, (Springer, 2020), pp. 187–204.

28. Y. Cai, J. Lin, X. Hu, H. Wang, X. Yuan, Y. Zhang, R. Timofte, and L. Van Gool, “Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE, 2022), pp. 17502–17511.

29. X. Kong, H. Zhao, Y. Qiao, and C. Dong, “Classsr: A general framework to accelerate super-resolution networks by data characteristic,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 12011–12020 (2021).

30. Y. Fu, T. Zhang, L. Wang, and H. Huang, “Coded hyperspectral image reconstruction using deep external and internal learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

31. C. V. Correa, H. Arguello, and G. R. Arce, “Spatiotemporal blue noise coded aperture design for multi-shot compressive spectral imaging,” J. Opt. Soc. Am. A 33(12), 2312–2322 (2016). [CrossRef]

32. J. Wang, Y. Zhang, X. Yuan, Y. Fu, and Z. Tao, “A simple and efficient reconstruction backbone for snapshot compressive imaging,” arXiv preprint arXiv:2108.07739 (2021). [CrossRef]

33. Z. Meng, S. Jalali, and X. Yuan, “Gap-net for snapshot compressive imaging,” arXiv preprint arXiv:2012.08364 (2020). [CrossRef]

34. X. Yuan, D. J. Brady, and A. K. Katsaggelos, “Snapshot compressive imaging: Theory, algorithms, and applications,” IEEE Signal Process. Mag. 38(2), 65–88 (2021). [CrossRef]

35. H. Arguello and G. R. Arce, “Colored coded aperture design by concentration of measure in compressive spectral imaging,” IEEE Trans. on Image Process. 23(4), 1896–1908 (2014). [CrossRef]

36. M. Iliadis, L. Spinoulas, and A. K. Katsaggelos, “Deepbinarymask: Learning a binary mask for video compressive sensing,” IEEE Trans. on Image Process. 96, 102591 (2020). [CrossRef]

37. J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (IEEE, 2021), pp. 1833–1844.

38. Y. Zhang, K. Li, K. Li, G. Sun, Y. Kong, and Y. Fu, “Accurate and fast image denoising via attention guided scaling,” IEEE Trans. on Image Process. 30, 6255–6265 (2021). [CrossRef]

39. S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), (Springer, 2018), pp. 3–19.

40. B. Arad and O. Ben-Shahar, “Sparse recovery of hyperspectral signal from natural rgb images,” in European Conference on Computer Vision, (Springer, 2016), pp. 19–34.

41. A. Chakrabarti and T. Zickler, “Statistics of real-world hyperspectral images,” in CVPR 2011, (IEEE, 2011), pp. 193–200.

42. F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar, “Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum,” IEEE Trans. on Image Process. 19(9), 2241–2253 (2010). [CrossRef]

43. Y. Qiu, “Agdrn,” figshare, (2023), https://doi.org/10.6084/m9.figshare.21971612.

44. T. Huang, W. Dong, X. Yuan, J. Wu, and G. Shi, “Deep gaussian scale mixture prior for spectral compressive imaging,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE, 2021), pp. 16216–16225.

45. H. F. Rueda, H. Arguello, and G. R. Arce, “Compressive spectral testbed imaging system based on thin-film color-patterned filter arrays,” Appl. Opt. 55(33), 9584–9593 (2016). [CrossRef]

Methods	Profile	Metrics	ICVL	Harvard	CAVE
AGDRN (proposed)	Complementary input and patch-wise spectral and spatial attention	PSNR	38.19	37.73	33.72
		SSIM	0.982	0.956	0.953
		RMSE	0.038	0.054	0.117
AGDRN-S (proposed)	Small version of the AGDRN	PSNR	37.51	37.20	33.07
		SSIM	0.980	0.953	0.946
		RMSE	0.040	0.055	0.127
SRN	Highly efficient residual network	PSNR	36.72	36.90	32.79
		SSIM	0.977	0.951	0.942
		RMSE	0.044	0.059	0.127
SwinIR	Self-attention within the shifted windows	PSNR	34.12	35.37	31.71
		SSIM	0.962	0.939	0.929
		RMSE	0.060	0.071	0.140
DGSMP	Gaussian scale mixture prior based on deep CNN	PSNR	33.81	35.11	29.94
		SSIM	0.966	0.937	0.908
		RMSE	0.060	0.072	0.187
TSA-net	Axis-wise spectral and spatial attention	PSNR	32.70	32.75	30.50
		SSIM	0.955	0.870	0.919
		RMSE	0.068	0.112	0.171

Methods	Parameters(M)	Training time(h)	Evaluation time(s)
AGDRN	2.32	20.6	3.9
AGDRN-S	1.23	16.7	3.3
SRN	1.25	12.5	3.1
SwinIR	1.45	33.4	4.4
DGSMP	3.76	19.7	9.4
TSA-net	44.25	20.3	3.9

(a) Comparison of different data pre-processing methods
Methods	Data pre-processing	PSNR	SSIM
TSA-net	$Y^{'} ⊙ M$	32.70	0.955
TSA-net	Complementary Input	34.08	0.964
SRN	$Y^{'} ⊙ M$	36.72	0.977
SRN	Complementary Input	37.65	0.980
AGDRN	$Y^{'} ⊙ M$	37.86	0.982
AGDRN	Complementary Input	38.19	0.982

(b) Ablation of the patch attention in the AGDRN
Dataset	Attention	PSNR	SSIM
ICVL	None	38.07	0.981
ICVL	PA	38.19	0.982
Harvard	None	37.51	0.955
Harvard	PA	37.73	0.956
CAVE	None	33.59	0.951
CAVE	PA	33.72	0.953

Methods	lambda-net [26]	TSA-Net [27]	DGSMP [44]	SRN [32]	MST-M [28]	AGDRN
PSRN	29.25	30.08	32.60	33.32	33.45	34.85
SSIM	0.886	0.898	0.923	0.945	0.949	0.959

Hyperspectral image reconstruction via patch attention driven network

Abstract

1. Introduction

2. Measurement model

3. Reconstruction network

3.1 Complementary input

3.2 Attention guided deep residual network

3.3 Patch attention

4. Experiments

4.1 Experiment setting

4.2 Comparison with state-of-the-art methods

4.3 Ablation study

4.4 Experiment with the real mask

5. Conclusion

Funding

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (11)

Tables (4)

Equations (16)

Optics Express