Hyperspectral face recognition based on sparse spectral attention deep neural networks

Zhihua Xie; Zhihua Xie; Yi Li; Jieyi Niu; Ling Shi; Zhipeng Wang; Guoyu Lu

doi:10.1364/OE.404793

1. Introduction

For each pixel in an image, a hyperspectral camera captures the light intensity for a large number of contiguous spectral bands of the ultraviolet, visible, near-infrared, mid-infrared region and beyond [1]. Every pixel in the image has a continuous spectrum, which can precisely distinguish between various intrinsic material properties of the object [2]. Hence, in the remote sensing areas, hyperspectral images have been considered to be a critical datatype for object identification and classification tasks [3]. Since hyperspectral images have many spectral bands compared with visible images, we usually treat the hyperspectral imaging data as a cube. This cube data is structured with $x$ and $y$ coordinates making up the spatial image pixels and $z$ coordinate making up the spectrum wavelength (see Fig. 1).

Fig. 1. Hyperspectral face image cube of single sample from UWA

Download Full Size | PDF

With the decreasing cost of hyperspectral cameras, hyperspectral imaging offers new opportunities for different imaging tasks. In face recognition applications, hyperspectral imaging samples a face at many contiguous narrow spectral bands within the ultraviolet, visible, near-infrared and beyond, which leads to more biometric information from hyperspectral face images. This is because the epidermal and dermal layers of human skin are essentially a scattering medium that consists of several pigments such as hemoglobin, melanin, bilirubin and carotene [4]. Various biometric features can be generated when the spectrum passes through the different scattering mediums. Small changes in the distribution of the pigments result in a large change of spectral reflectance of skins [5]. It is demonstrated that the spectral properties of facial tissues can obviously increase inter-person discrimination. Therefore, many limitations in face recognition, which refers to the intra-person discrimination and exceeds inter-person discrimination [6], can be overcome by adding the spectral dimension.

However, hyperspectral face images also produce challenges, such as the difficulty of data acquisition, low signal to noise ratios and high dimensionality [6]. Especially, hyperspectral images are continuous spectral imaging over a very short period of time. When face organs move, especially blinking of the eyes, the generated images occur lots of misalignments of bands, which will introduce intra-person variations that must be suppressed. Generally, natural and synthetic objects have low spectral intensity close to the 400nm. This effect, combined with the low transmittance and narrow bandwidth, makes the band-images with low signal to noise ratios [6]. Hyperspectral imaging cameras can simultaneously capture hundreds of spectral bands with a small spectral resolution. Additionally, such a large number of spectral bands remarkably influence the performance of face recognition and result in heavy computational cost. Furthermore, some bands may only carry little discriminative information. Meanwhile, interference will be produced between different spectral bands, which lowers down the accuracy of face recognition.

Although lots of works on hyperspectral image classification focuses on those problems, the research on the hyperspectral face recognition is relatively rare. One of the earliest works of hyperspectral face recognition was conducted by Pan et al. [7], who manually extracted the bands in the near-infrared spectrum for face classification. They used their own proprietary database to conduct experiments and obtained a good result. However, their results are not repeatable on the public hyperspectral face database [8]. The same database was used by Pan et al. [9], who combined spatial and spectral information and converted the hyperspectral cube to a 2D image. But they used an ad-hoc method for band integration and converted to a 2D image by selecting one value from a specific band at each pixel, which discarded a lot of pixel values from each hyperspectral data. Robila et al. [10] compared with different spectral signatures from face different regions using the spectral angle measurements. However, they were limited to a small hyperspectral face database, which contained eight subjects. Di et al. [11] projected hyperspectral images into low-dimensional space for feature extraction using 2D-Principal components analysis (PCA). Shen and Zheng [12] applied Gabor Wavelets to the hyperspectral cube, where Gabor Wavelets generated 52 new data cubes from each hyperspectral data cube, and then used ad-hoc sub-sampling scheme to reduce the huge amount of data. Liang et al. [13] used 3D high-order texture pattern descriptors to extract micro-patterns. These patterns integrated spatial and spectral information into features. Uzair et al. [14] used the low-frequency coefficients of 3D discrete cosine transform (DCT) as features for face recognition and used partial least squares regression (PLS) for classification. Furthermore, Uzair et al. [6] extracted spatio-spectral covariance features on hyperspectral 3D cubes, for which they converted 3D cubes into 2D images. The converted image contains both spatial and spectral information. PLS is used for face classification. Vivek Sharma et al. [15] regarded each band of hyperspectral face image as a separate image. Thus, they realized classification from pixel-level to image-level. To expand their work, Vivek Sharma et al. [1] extracted deep hyperspectral face features by Adaboost bandwidth selection and Convolutional Neural Networks, named S-CNN. They applied the AdaBoost SVM method for band selection. Zhihua Xie et al. [16] extracted optimal bands from different face regions. The work considers different face regions to possess different optimal bands. Finally, the result is gained based on the maximum voting of Support vector machine (SVM) from different face regions. Taherkhani et al. [17] applied the group Lasso algorithm to the first convolution layer of VGG-19 [18]. Then, by training the VGG-19 network, they can obtain the optimal sparse bands. Ashok Kumar Rai et al. [19] used Firefly algorithm for band fusion and the Convolution Neural Network for hyperspectral face recognition.

Based on the previous hyperspectral face recognition methods, two remaining issues have not been well addressed on existing solutions. The first issue is that all the previous methods set equal weights to all spectral bands. The second issue is that the previous works only select a group of optimal bands based on the whole face database after a learning process. Not all the bands are beneficial to the recognition tasks. Only a selective range of bands can be deployed to obtain a promising performance [20]. In this paper, the main motivation is that each face data should have its optimal bands, rather than choosing a fixed set of bands that are the same for each data. These issues give us an incentive to design a CNN network with the adaptive spectral selection mechanism.

Inspired by recent advances in attention mechanism [21] and its applications in remote sensing image classification [22,23], this paper proposes a novel deep network named sparse spectral channel-wise attention-based network (SSCANet), which is based on channel-wise attention algorithm and Lasso constraint. The main contributions of this work are summarized as follows:

(1) The SSCANet model can analyze the significance of different spectral bands and recalibrating them by learning global information to selectively emphasize informative features and suppress less useful ones. Thus, this model can generate different weights for different spectral bands, which will be fed to following deep networks.
(2) The SSCANet model can adaptively extract the distinctive bands learned from tasks and hyperspectral data in an end-to-end network, rather than manually selecting the bands or selecting the bands by a greedy algorithm. Compared with other methods that obtain the fixed bands based on the whole hyperspectral database, we can adaptively select bands for different subjects. Moreover, our work not only can select the traditional bands (green, red wavelength range), but also extract the bands (the blue wavelength range), which had not been considered.
(3) The SSCANet model will zero-out the redundant bands by using the domain knowledge of spectral characteristics. During the training stage, the lasso constraint makes the model sparser, which can accelerate the training speed.
(4) We applied SSCANet on three widely used databases, which respectively are HK-PolyU [11], CMU [24], UWA [14]. Comprehensive results demonstrate that our proposed SSCANet obtains better performance compared to other deep networks.

The remainder of this paper is organized as follows. In section 2, we explain the theory of our proposed method, called SSCANet, and other related work that we adopted, such as, Multi-task Cascaded Convolutional Networks (MTCNN) [25] for image preprocessing and Cyclical Learning Rate (CLR) [26] for the learning rate adjustment. In section 3, we introduce the parameters of our method and experiment on the three public databases. Then we further compare with other state-of-the-art works. Finally, we summarize the paper and look forward to future work in section 4.

2. Methodology

In this section, we will introduce SSCANet, which mainly consists of SSCA network unit and VGG-19 network. The SSCA network unit is used for adaptive band selection and VGG-19 network is used for hyperspectral face classification. Moreover, the related methods are also briefly described below, such as MTCNN and CLR method mentioned in section 1.

2.1 SSCANet

A SSCANet is generated by simply stacking a SSCA network unit and VGG-19 network. The VGG-19 network achieved excellent results on the ILSVRC-2014 dataset (i.e., ImageNet competition) and was more popular than other deep networks for image classification. In our method, we set a SSCA network unit in front of the VGG-19 framework. When an image passed the SSCA network unit, we can obtain its weights. The dimension of weights is the same as the dimension of spectrums of hyperspectral face data. This weight is learned by the self-gating mechanism based on channel dependence, governing the excitation of each channel. Then, the spectral face images are reweighted to generate the output of new face images by channel weighted to original spectral face images, which then can be fed directly into the subsequent VGG-19 network. The entire network framework is shown in Fig. 2.

Fig. 2. The framework of proposed SSCANet for hyperspectral face recognition

Download Full Size | PDF

Note that the input-channel of the first convolution layer of VGG-19 network should be the same as the number of spectral channels of spectral face images. For this situation, we expand the number of input-channel of the first convolution layer to be the same as the number of channels in hyperspectral image by duplicating the filters of the VGG-19 network in the first convolutional layer. For example, assuming the depth of the filters of the first convolution layer is 3, but we have the 3k spectral bands. Thus, in this case, we duplicate the filters of the first convolution layer k times, so that we will obtain 3k filters from 3 filters. More importantly, we duplicate not only the number of filters but also the weights and bias parameters of the filters.

2.1.1 SSCA unit

Channel-wise attention algorithm was proposed by J. Liang et al. [21] Inspired by their work about squeeze-excitation block experimented on the ILSVRC 2017, we propose the SSCA network unit, which can extract useful bands and suppress the redundant bands for hyperspectral face images and adaptively learned in an end-to-end spectral-spatial classification network. In SSCA network unit, the recalibrated hyperspectral face image $y$ is generated through the SSCA unit, as shown in the Eq. (1):

(1)$$y = {f_{SSCA}}(x),$$

where $x$ represents the original hyperspectral face image cube, y is the recalibrated hyperspectral face image, and $x,y \in {R^{H \times W \times C}}$ . ($H,W,C$ respectively are the dimension of spatial image and channel)

In order to acquire the recalibrated hyperspectral face image y, we should obtain the channel attention weight learned from original hyperspectral face data, which possess the sensitivity for each band dependency and recalibrate the strength of different spectral bands of the input. Thus, we adopt the convolution operation. It is because that convolution operation is able to shrink the channel patch and its different property allows end-to-end learning. But, in general, the convolution operation is a local receptive field, such as 3×3 in the network, which leads to an inability to obtain the contextual information outside of this region. So that we applied the adaptive average pool operation and adaptive max pool operation to generate two channel-wise statistics ${g_{avg}}$, ${g_{\max }}$. The statistics ${g_{avg}}$, ${g_{\max }}$ are simply generated by shrinking the spatial dimensions $H \times W$ of x. The Eq. (2) and Eq. (3) are as follows:

(2)$${g_{avg}} = \frac{1}{{H \times W}}\sum\limits_{i = 1}^H {\sum\limits_{j = 1}^W {x(i,j)} } ,$$

(3)$${g_{\max }} = \max (x),$$

where ${g_{avg}}$, ${g_{\max }} \in {R^C}$. Therefore, the statistics ${g_{avg}}$, ${g_{\max }}$ can express the whole image and represent the global distribution of responses on band channels. Moreover, ${g_{avg}}$, ${g_{\max }}$ also allows the layer closed to the input to obtain the global receptive filed, which is useful for many tasks.

For the sake of fully capturing channel-wise dependencies, we simply adopt the gating mechanism with sigmoid activation by forming a bottleneck with two fully connected layers (FC) to learn a nonlinear channel relationship and limit the model complexity. The reason for opting two fully connected layers is that it can increase the nonlinear operation to fit the complex relationship between channels. We set the size of convolution filters to 1×1, which is functionally equal to the fully connected layer. Then, we improve the ability of SSCA unit by adding the Rectified Linear Unit (Relu) function [27] after FC-1. Finally, we perform the sigmoid activation to obtain a normalized weight from 0 to 1. The channel attention weight is represented as follow Eq. (4):

(4)$$u = \sigma ({W_2} \ast \varphi ({W_1} \ast {g_{avg}}) + {W_2} \ast \varphi ({W_1} \ast {g_{\max }})),$$

where ∗ is the convolution operation, φ refers to the Relu function, and σ refers to the sigmoid activation. ${W_\textrm{1}}$ and ${W_\textrm{2}}$ respectively are weights of FC-1 and FC-2, and ${W_1} \in {R^{\frac{C}{\textrm{r}} \times C}}$, ${W_2} \in {R^{C \times \frac{C}{\textrm{r}}}}$. $u \in {R^C}$ are the channel attention weight. r is the reduction ratio, whose specific value is discussed in section 3. Finally, the recalibrated hyperspectral face images y can be obtained by Eq. (5):

(5)$$y = u \cdot x,$$

where u is channel attention weight correspond to training sample x . The symbol · is the channel-wise multiplication between the original data x and the scalar u. Finally, the SSCA network unit framework is shown in Fig. 3. In order to show the result of band selection, we set the value closed to 1 in channel attention weight to 1 and others are set to zero. Figure 4 shows the selective spectral information of partitional subjects from UWA. By our SSCA unit, the selective bands and its number are different from various subjects.

Fig. 3. The structure of SSCA unit

Download Full Size | PDF

Fig. 4. The selective spectral information of partitional subjects from UWA

Download Full Size | PDF

2.1.2 Lasso loss function

Although SSCA unit generates channel weights, which relatively are sparse to make SSCA unit possess the function of band selection, we want to further make the channel attention weigh close to 1 when the band contains useful information and close to 0 when the band contains interference information. Meanwhile, to guarantee that network has fewer parameters to boost the training process, we adopt the lasso regularization algorithm, which can effectively zero out the redundant bands whose weights are close to 0 [17,28]. Then, the domain knowledge (hyperspectral constraint) was applied to the convolution filters of the FC-2. Therefore, this paper designed a Lasso loss function by using the lasso regularization algorithm for band selection. The Lasso loss function as defined in Eq. (6).

(6)$$G({W_2}) = {{\bigg \Vert}{\sum\limits_{i = 1}^C {\sum\limits_{j = 1}^{\frac{C}{r}} {{W_2}(i,j)} } } {\bigg \Vert}_1},$$

where ${W_2} \in {R^{C \times \frac{C}{\textrm{r}}}}$ is the weight of the FC-2,${W_2}(i,j)$ denotes ${i^{th}}$ convolution filter corresponding to the channel band and ${j^{th}}$ input channel of FC-2. Note that ${W_2}(i,:)$ belongs to the same group of weights from one band.

2.1.3 Total loss function

In the entire network framework, we design a total loss function including cross entropy loss function, center loss function [29], and Lasso loss function. Cross entropy loss function is typically used in the CNN, which can force the features of different classes to stay apart and enlarge the inter-classes distance. However, cross entropy loss function has a disadvantage about optimizing the intra-class distance. Thus, we adopt the center loss function to solve the defect of cross entropy loss function. The center loss function can learn the center of each deep feature by adding the distance between the feature and its center. Obviously, the penalty term can enhance the distinguishing ability of deep features. Our total loss function not only can make the inter-classes distance large but also make the samples of intra-classes close to each other. The total loss function is defined as Eq. (7):

(7)$$L(w) ={-} \sum\limits_{i = 1}^n {\sum\limits_{j = 1}^k {y_i^j} } \log (p_i^j) + \frac{\tau }{2}\sum\limits_{i = 1}^n {||{f(w,{x_i}) - {c_{yi}}} ||_2^2 + \lambda \sum\limits_{i = 1}^n {G(w_{f{c_2}}^i)} } ,$$

where n is the number of the whole batch training data, k is the number of classes. $f(w,{x_i})$ is the output of the CNN. ${x_i}$ is the ${i^{th}}$ training sample in whole batch training data. ${y_i}$ is the one-hot encoding label corresponding to the training data ${x_i}$. Thus, $y_i^j$ is the element in vector ${y_i}$. ${p_i}$ is the output of CNN after softmax. The variable ${c_{yi}}$ is the center of the features corresponding to the ${i^{th}}$ class. $w_{f{c_2}}^i$ is the weight of FC-2 for the ${i^{th}}$ training sample. $\tau $ and $\lambda$ are the hyperparameters used to balance the influence of the three terms. The specific values of $\tau $ and $\lambda$ are discussed in section 3.

2.2 Supplementary methods

In this part, we will briefly introduce the MTCNN (multi-task convolutional neural network) [25] used to face preprocessing and CLR [26] used for the learning rate adjustment. The two supplementary methods can effectively help us improve recognition performance.

2.2.1 MTCNN for image preprocessing

MTCNN adopts a cascaded structure with three stages of carefully designed deep convolutional networks. It can predict face and landmark location in coarse-to-fine manner [25]. This network can be used for face detection and face alignment. In our method, we extract the landmarks of hyperspectral face image. Then, we crop the finer face image by using its landmark. The results of face images after cropping are shown in Fig. 5. It can be seen from Fig. 5 that MTCNN can effectively extract the face region from hyperspectral face images, which can relieve the side effect on classification.

Fig. 5. The cropped hyperspectral face images from UWA by MTCNN

Download Full Size | PDF

2.2.2 CLR for training neural networks

Learning rate is one of the most important hyperparameters for the network training process [30]. It will determine how much the current weighting parameter changes in the direction of loss reduction. If the learning rate is too large, the loss function cannot find the global optimal value to minimize the loss. On the contrary, loss function may fall into a local minimum and cannot come out when the learning rate is too small, such as saddle points. Based on this point, we adopt the cyclic learning rate adjustment to train our network. The method sets a maximum learning rate and a minimum learning rate, and then makes the learning rate periodically changes from maximum to minimum. If loss value falls into a local minimum, CLR adjustment can make the loss value jump out the local minimum when the learning rate increases from small to large.

3. Experimental results and analysis

In this section, we will introduce the network architecture, the databases, and experimental process in detail. Moreover, we compared the differences between the result of bands selected under different methods and analyzed the reasons for obtaining the best performance based on these bands selected by our proposed method. In particular, we discussed the influence of different hyperparameters and components of SSCANet.

3.1 CNN architecture

We adopt the VGG-19 for task classification, whose architecture is shown in Fig. 2. VGG-19 has 19 hide layers consisted of 16 convolution layers and 3 fully connected layers. In general, VGG-19 can be divided into 6 blocks. Except for the last block composed of fully connected layers, the first 5 blocks are composed of a few convolution layers. Meanwhile, each block is followed by a max-pooling layer, which is carried out on a 2×2 pixels window with the stride of 2. In the first block, VGG-19 possesses 2 convolution layers, which have 64 kernels of 3×3 receptive field with the stride of 1 and padding of 1. The second block have 2 convolution layers, which have 128 kernels of 3×3 receptive field with the stride of 1 and padding of 1. The third block has 4 convolution layers which have 256 kernels of 3×3 receptive field with the stride of 1 and padding of 1. The fourth and fifth blocks have same configuration, which possesses 512 kernels of 3×3 receptive field with the stride of 1 and padding of 1. Note that each convolution layer is followed by a batch-normalization and Relu activation function. The last block has 3 fully connected layers as follows: the first and the second have 4096 nodes; the last maintains 1000 nodes. And finally, the output of the last fully connected layer is fed to the softmax layer. In our work, we simply modified the VGG-19 framework to fit our task, which makes input-channel of the first convolution layer as same as the bands of hyperspectral face image and change the nodes of last fully connected layer to the k nodes, where $k$ stands for the number of classes.

3.2 Hyperspectral database

In this paper, all experiments are performed on the three standard and public hyperspectral face datasets. The three public databases respectively are HK-Poly [11,31], UWA [14], CMU [24].

HK-Poly HSFD: The face image in this database is acquired using the CRI’s VariSpec Liquid-Crystal-Tuneable-Filter (LCTF). Each hyperspectral cube in this database contains 33 bands with wavelengths ranging from 400 to 720 nm, with a step of 10nm. HK-Poly HSFD contains 48 subjects. Among them, it has 13 females and 35 males. There are 4–7 cubes in the first 26 subjects and 1-3 cubes in the last 22 subjects. The partial samples from HK-Poly HSFD are shown in Fig. 6. In our experiments, we select the first 25 subjects, which contain 113 hyperspectral image cubes. For each subject, we randomly select two cubes for the training and the remaining 63 cubes for testing.

Fig. 6. Samples of HK-Poly HFSD

Download Full Size | PDF

UWA-HSFD: the face image is acquired with the CRI’s LCTF integrated with a photon focus camera. Each hyperspectral image cube has 33 bands covering the spectral range of 400-720nm with a step of 10nm. This database consists of 147 cubes of 80 subjects. For each subject, we randomly select one cube as training data and the remaining 67 cubes as testing data. The partial samples from UWA-HSFD are shown in Fig. 7.

Fig. 7. Samples of UWA HFSD

Download Full Size | PDF

CMU-HFSD: the face image is acquired with a prototype spectro-polarimetric camera. In this database, each hyperspectral face cube contains 65 bands covering the spectral range of 450–1100nm with a step of 10nm. CMU-HFSD contains 48 subjects, and each subject has 4–20 cubes acquired at different sessions and lighting combinations, which are used by 600W halogen bulbs. In our experiments, we only choose the cubes obtained from different sessions that all lights are turned on. Thus, our experiment chooses 48 subjects, which contain 151 hyperspectral image cubes. For each subject, we randomly choose one cube as training and the remaining 103 cubes as testing.

3.3 Data preprocessing

In this paper, we will perform the face calibration and face detection to obtain the landmarks for hyperspectral faces by using MTCNN. Then, based on the landmarks, the hyperspectral face is cropped and resized to 264×264. To validate the effectiveness of MTCNN module, we introduce hyperspectral face recognition experiments which respectively conducted on original hyperspectral face images and hyperspectral face images preprocessed by MTCNN. The experimental results are listed in Table 1. As shown in Table 1, hyperspectral images cropped by MTCNN can obtain higher recognition rate than original hyperspectral images. Especially, with respect to CMU and UWA databases, the performance of MTCNN has a significant improvement. A reasonable explanation is that MTCNN eliminates background interference and highlights the face region in original hyperspectral images. This finding can also be convinced by Fig. 5 in section 2.2.1.

Table 1. The Recognition rates (%) of the methods with and without MTCNN.

View Table | View all tables in this article

3.4 SSCANet configurations for training

The Adam optimizer [32] with the default hyper-parameter values ($\varepsilon \textrm{ = 1}{\textrm{0}^{\textrm{ - 3}}}$, ${\beta _\textrm{1}}\textrm{ = 0}\textrm{.9}$, ${\beta _\textrm{2}}\textrm{ = 0}\textrm{.999}$) is adopted to minimize the total loss function. In CLR method, cross entropy loss function sets the max learning rate to 0.0005 and base learning rate to 0.0001, with 26 as the step size. The center loss function sets the max learning rate to 0.18 and base learning rate to 0.06, with 26 as the step size. The lasso loss function sets the max learning rate to 0.03 and base learning rate to 0.01, with 26 as the step size. In SSCANet, we train it for 100 epochs, and set batch size in all experiments to 4. The hyperparameter $\tau $, which controls the balance between center loss term and other loss terms, is set to 0.002.

Moreover, the weights of SSCA unit are initialized by using Xavier uniform. The weights of VGG-19 network are initialized by a VGG-19 network pre-trained on the ImageNet database. Then we fine-tune it on the CASIA-Web Face database [33]. The CASIA-Web Face database contains 10,575 subjects and 494,414 images. In the proposed model, we only choose 5000 images for training and 1000 images for testing. The whole network is implemented on the computer platform whose configuration is of 2.6GHz CPU, and 1060 6G GPU.

3.5 Hyperparameter $\lambda $

Hyperparameter $\lambda $ is the penalty coefficient of Lasso loss function. It is an important coefficient that determines the intensity of the band sparsity. Hence, we perform experiments to choose suitable values for different databases. The results for different values of hyperparameter $\lambda $ are shown in Fig. 8. As shown in Fig. 8, we can see that HK-Poly HFSD gets good result when $\lambda $ is 0.01, CMU-HFSD reaches satisfactory result when $\lambda $ is 0.1 and UWA-HFSD has good result when setting $\lambda $ to 1. The selection of $\lambda $ is a hyperparameter determination with the grid search strategy in a traditional verification stage. In our verification process, the $\lambda $ is the weight of band sparse regulation on the total loss of the entire model. To get a suitable $\lambda $, we conduct the verification experiments when the $\lambda $ ranges from 0.0001 to 10 with the interval ratio 10. As we can see from Fig. 8, at first, the recognition performance will improve with the increase of the $\lambda $. However, when $\lambda $ reaches a threshold, the accuracy will decline and fluctuate. Based on this tendency, the $\lambda $ is 1 for good generally ability for all three datasets, which will be able to reproduce good results on other databases. Considering above empirical knowledge, we can also learn the optimal $\lambda $ by the same verification stage on a new database when $\lambda $ range from 0.001 to 1 with a low complexity.

Fig. 8. Accuracy of different databases using different hyperparameter $\lambda $

Download Full Size | PDF

3.6 Reduction ratio $r$

The reduction ratio $r$ is introduced in Eq. (4). The reduction ratio r is a signification hyperparameter, which controls the computational cost of the network. To find the best balance between accuracy and computational cost, we conduct experiments base on SSCANet for a range of different reduction ratio $r$ values. Before carrying out the experiments, we list out the Eq. (8) about evaluating the computational complexity:

(8)$$\frac{\textrm{2}}{r}\sum\limits_{m = 1}^M {{N_m} \cdot C_m^2} ,$$

where M is the number of channels in the channel attention module. ${N_m}$ is the number of the repeated blocks in the channel attention module. Each block has two fully connection layers. In this paper, we design SSCA network unit which just has one channel attention module with one block. The computation costs and results in different database are shown in Table 2. It is obvious that the HK-Poly HFSD obtains the best balance when reduction ratio $r$ is 4, and CMU-HFSD, UWA-HFSD obtain the best balance when reduction ratio r is 8. Similar to $\lambda $, the reduction ratio $r$ also is a hyperparameter determination in a verification stage. To get the value of r, we carry out the verification experiments when r ranges from 1 to 32 with the interval ratio 2. The Table 2 shows that the performance will improve and then fall down with the increase of r. Based on this tendency, r can equal to 8 for good generally ability for the three datasets, although the HK-Poly database has a small reduction in accuracy (from 97.778% to 96.508%). The reason is that the data complexity and variation of HK-Ploy database are relatively simple compared with the other two databases. The hypermeter $r$ is set to 8 to achieve comprehensive performance for a new database.

Table 2. The parameter sizes and accuracy for different database at different reduction ratio r. Note that the original is respectively represent reduction ratio of original number of bands of different databases. (HK-Poly HSFD original ratio is set to 33)

View Table | View all tables in this article

Note that VGG-19 contains 144 million parameters. It can also be found that the computational cost of the SSCA network unit is negligible compared with the computational cost of VGG-19 network, which demonstrates that our network can improve performance across different database with a small increase in computational complexity.

3.7 Comparison results of different methods

To verify the effectiveness of the proposed method, we compared our method with previous hyperspectral face recognition methods comprehensively. For a fair comparison, Uzair’s band fusion method [6] and Vivek’s method [1] chooses the corresponding optimal band in their papers. The experiment comparison results are shown in Table 3.

Table 3. Comparison results of different band selection methods.

View Table | View all tables in this article

In Table 3, the results (the average recognition rate and variance) are obtained by randomly selecting the train set and test set in different models. The same experiments are repeated 10 times. As shown in Table 3, our method performs significantly better than other hyperspectral face recognition methods. Note that the SCANet does not have the lasso algorithm, and VGG-19 does not have the channel-wise attention and Lasso algorithm. It is obvious that our SSCANet framework is better than previous methods. SSCANet respectively obtained 97.778% for HK-Poly HSFD, 96.314% for CMU HSFD and 94.328% for UWA HSFD. These results are 2∼4% higher than the previous works, but we only introduce a negligible computational cost. Hence, our proposed model is demonstrated to be effective and feasible.

3.8 Ablation study

Although we gained the best result by using SSCANet, we also have done this ablation study [34,35] to analyze the influence of different hyperparameters (like $\lambda $ and $r$) and different components of SSCANet.

The influence of $\lambda $: To better illustrate the influence of $\lambda $ on the accuracy, we change $\lambda $ from 10 to 0.0001. The results are shown in Fig. 8. Considering the trend of recognition rates, we find that the accuracy is roughly increased first and then decreased when $\lambda $ alters from 10 to 0.0001. The initial upward trend is well understood. The SSCA unit can obtain more available bands as the penalty coefficient $\lambda $ decreases. This situation contributes to the higher recognition rate. However, when the penalty coefficient $\lambda $ is too small, the SSCA unit not only obtains the useful bands, but also choose some redundant bands which interfere the result. This situation led to the lower accuracy.

The influence of $r$: The reduction rate r is a signification hyperparameter, which can control the accuracy and computational cost. In order to judge the influence of r on accuracy and computational cost, we vary r from 2 to 32 uniformly for every database. The results are shown in Table 2. The result shows that HK-Ploy HSFD, UWA-HSFD and CMU-HSFD obtain the best results, which respectively is 97.778%, 93.432%, 96.314%, at r respectively is 4, 8 and 8. We find the same phenomenon: the recognition rate always changes from small to maximum, and then decreases. Based on this phenomenon, we speculate that the SSCA unit extracts the number of bands that exceed the number of useful bands when the reduction rate r is small and the parameter of convolution layers is large. Therefore, the redundant bands interfered with the accuracy. On the contrary, when the reduction rate r is large and the parameter of convolution layers is small, the convolution layers have inability to select enough informative bands, which results in the low accuracy.

The influence of different components in SSCANet: To analyze the influence of channel attention method, Lasso algorithm and VGG-19 network are analyzed by an ablation study. We carried out experiments on different strategies, which respectively only choose the VGG-19 network, VGG-19 network applied with Lasso algorithm, which is Deep-SSL proposed by Fariborz [17], SCANet composed with channel attention methods and VGG-19 network and SSCANet composed with channel attention methods, Lasso algorithm and VGG-19 network. The results are shown in Table 3. As shown in Table 3, the proposed network gets the best recognition rate among the all methods. It demonstrates that SSCANet has a superior effect on bandwidth selection. We also find a trend that the accuracy of hyperspectral face recognition is gradually increased from VGG-19 to Deep-SSL, and then to SSCANet. The initial experiment, which only chooses the VGG-19 network, obtained the lowest result. It is because that the VGG-19 network has not been capable of band selection. When we add the Lasso algorithm to the VGG-19 network, this network can select some discriminative bands, so that the accuracy has been increased by 0.173%∼0.97%. When adding the channel attention to the model, the accuracy continues to rise by 1.791%∼4.128%. Thus, it can be inferred that SSCANet adaptively selects the bands that are superior to the bands selected by group Lasso algorithm. In addition, we also perform the experiments which only applied the channel attention. The result shows that this experiment is higher than Deep-SSL, which proves that using channel attention alone is superior to the Lasso algorithm.

The influence of different regularization: To verify the influence of different regularization for recognition result, we respectively applied Lasso regression and Ridge regression to SSCA unit. The comparison results of different regularization methods are shown in Fig. 9. The recognition rates with Lasso regression on three databases had been increased by 2.866%, 1.946%, 1.194% respectively than the Ridge regression. It can be inferred that Lasso regression for SSCANet is better than Ridge regression. Since L1 regularization (i.e., Lasso regression) can zero-out the useless coefficient compared with L2 regularization (i.e., Ridge regression) which only can make the useless coefficient close to zero, L1 regularization is easier to obtain the sparse solution than L2 regularization. Hence, L1 regularization is superior to L2 regularization in band selection for deep hyperspectral face recognition.

Fig. 9. Comparison results of different regularization methods

Download Full Size | PDF

The impact of CLR: To evaluate the impact of CLR, we conduct experiments which respectively utilize CLR learning strategy and normal learning strategy. They both apply the Adam as the optimizer whose configuration is introduced in section 3.4. Note that normal learning strategy use the Multi-Step strategy. The milestones are set to {30, 60, 90} and the gamma is set to 0.1. In this way, when epochs reach 30, 60 and 90, all learning rates decay to 0.1. As for CLR strategy, the detail configuration is also shown in section 3.4. The experimental results of different strategies are shown in Table 4. It can be seen that the performance of CLR is better than that of normal learning strategy. Furthermore, the test accuracy curves of different learning strategies are shown in Fig. 10.

Fig. 10. Accuracy (%) curves of Adam with CLR and Adam without CLR

Download Full Size | PDF

Table 4. Comparison results (%) of different learning strategies.

View Table | View all tables in this article

We set the iteration to 13 on three pubic databases and thus set the abscissa range from 0 to 1300 with the interval iteration 13. Figure 10 shows the curve of the Adam with CLR is better than that of Adam without CLR on comprehensive effectiveness. This is because that CLR optimizer can jump out of local optimal point to find global optimal point, which lets the learning rate vary between reasonable boundary values cyclically. In addition, CLR also can eliminate the need to experimentally find the best value and schedule for the global learning rates, which greatly reduces the parameters adjustment cost.

3.9 Results of band selection

In this part, we will show the results of band selection by using our proposed model and comparing it with that of band selection methods of Uzair [6], V. Sharma [1] and Fariborz [17]. Finally, we will discuss why our methods can obtain better performance by using those bands selected in our method. In these experiments, we save the chosen bands of each face output and sum them up. The final results are shown in Fig. 11. The comparison results of different band selection methods are listed in Table 5.

Fig. 11. The number of bands selection by using our proposed method on three databases

Download Full Size | PDF

Table 5. The bands are selected by different methods. Note that the bold part indicates that we have selected the same wavelength compared with other methods.

View Table | View all tables in this article

As shown in Fig. 11 and Table 5, the bands selected by our work almost contain the bands selected by Uzair [6], V. Sharma [1], and Fariborz [17], which confirms the correctness of the proposed method. Especially, our method can select adaptive bands with discriminative ability, which varies based on subjects, instead of the constant bands for other methods. In other words, we can not only select the optimal bands within the wavelength range of red and green as other methods have proved, but also extract the discriminative information of the wavelength range of blue, which is discarded by other methods. The result shows that although optimal bands focus on the red, green and IR range, it is obvious that the blue bands still contain useful information for some identifies (faces).

To further demonstrate the superiority of the optimal bands by our SSCANet, we exploit optimal bands of Band fusion-PLS [6], S-CNN [1] and Deep-SSL [17] to fed into VGG-19 network which is used as a classifier module for the trained SSCANet. The experiment results are shown in Table 6. As we can see from Table 6, the performance of the optimal bands selected by SSCA unit is better than that of optimal bands selected by other methods. It means SSCANet can select the more discriminative bands effectively.

Table 6. Comparison results (%) of different optimal bands.

View Table | View all tables in this article

4. Conclusions

In this paper, we design the SSCANet framework to adaptively select bands for hyperspectral face recognition by using the channel-wise attention algorithm and Lasso constraint. Through extensive experiments and analysis on the three public hyperspectral face databases, we get three important insights in hyperspectral face recognition. The first one is that our SSCA network unit can effectively enhance the performance by learning importance band relationship with a gating mechanism and performing a dynamic band-wise recalibration. The second one is that our work can further improve the average recognition rates by adding the constraint with a simple computation cost. The last point is that the bands in the blue wavelength range also have good discriminative ability besides the red, green and IR wavelength range. Compared with other methods of hyperspectral face recognition, the experimental results show that our proposed model is superior to other methods. However, we also find issues about hyperspectral face recognition. For example, small databases may lead to over-fitting of network training and whether the spatial attention algorithm can also be used or not to hyperspectral face recognition when channel attention algorithm is used. Recently, based on the development of the mobile network or transfer learning, we can solve the over-fitting issue appeared on small database training. At the same time, we also focus on the attention mechanism progress and further applying it to hyperspectral face recognition. Therefore, we will continue to focus on the development of hyperspectral face recognition and introduce some new methods to the future hyperspectral face recognition research.

Funding

Graduate Research and Innovation Projects of Jiangsu Province (YC2020-S571); Jiangxi Provincial Department of Science and Technology (GJJ190578); National Natural Science Foundation of China (61861020).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

References

1. V. Sharma and L. Van Gool, “Hyperspectral CNN for image classification &band selection, with application to face recognition,” Technical report KUL/ESAT/PSI/160, KU Leuven, Belgium (2016).

2. H. Fu, L. Bian, X. Cao, and J. Zhang, “Hyperspectral imaging from a raw mosaic image with end-to-end learning,” Opt. Express 28(1), 314–324 (2020). [CrossRef]

3. L. Mou and X. X. Zhu, “Learning to Pay Attention on Spectral Domain: A Spectral Attention Module-Based Convolutional Network for Hyperspectral Image Classification,” IEEE Trans. Geosci. Electron. 58(1), 110–122 (2020). [CrossRef]

4. R. R. Anderson and J. A. Parrish, “The optics of human skin,” J. Invest. Dermatol. 77(1), 13–19 (1981). [CrossRef]

5. E. A. Edwards and S. Q. Duntley, “The pigments and color of living human skin,” Am. J. Anat. 65(1), 1–33 (1939). [CrossRef]

6. M. Uzair, A. Mahmood, and A. Mian, “Hyperspectral Face Recognition with Spatiospectral Information Fusion and PLS Regression,” IEEE Trans. on Image Process. 24(3), 1127–1137 (2015). [CrossRef]

7. Z. Pan, G. Healey, M. Prasad, and B. J. Tromberg, “Face recognition in hyperspectral images,” IEEE Trans. Pattern Anal. Machine Intell. 25(12), 1552–1560 (2003). [CrossRef]

8. D. Ryer, “Quest hierarchy for hyperspectral face recognition,” PhD Dissertation, Air Force Institute of Tech (2012).

9. Z. Pan, G. Healey, and B. J. Tromberg, “Comparison of spectral-only and spectral/spatial face recognition for personal identity verification,” EURASIP J. Adv. Signal Process. 2009(1), 943602 (2009). [CrossRef]

10. S. Robila, “Toward hyperspectral face recognition,” Proc. SPIE 6812, 68120X (2008). [CrossRef]

11. W. Di, L. Zhang, D. Zhang, and Q. Pan, “Studies on hyperspectral face recognition in visible spectrum with feature band selection,” IEEE Trans. Syst., Man, Cybern. A 40(6), 1354–1361 (2010). [CrossRef]

12. L. Shen and S. Zheng, “Hyperspectral face recognition using 3d Gabor wavelets,” in 2012 International Conference on Pattern Recognition (ICPR) (2012), 1574–1577.

13. J. Liang, J. Zhou, and Y. Gao, “3D local derivative pattern for hyperspectral face recognition,” in 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) (2015), 1–6.

14. M. Uzair, A. Mahmood, and A. Mian, “Hyperspectral face recognition using 3d-dct and partial least squares,” in 2013 the British Machine Vision Conference (BMVC) (2013), 57.1–57.10.

15. V. Sharma and L. Van Gool, “Image-level classification in hyperspectral images using feature descriptors, with application to face recognition,” arXiv preprint arXiv:1605.03428 (2016).

16. Z. Xie, Y. Li, J. Niu, X. Yu, and L. Shi, “Hyperspectral Face Recognition Using Block based Convolution Neural Network and AdaBoost Band Selection,” in 6th International Conference on Systems and Informatics (2019), 1270–1274.

17. Taherkhani F. Dawson, J. Nasrabadi, and M Nasser, “Deep Sparse Band Selection for Hyperspectral Face Recognition,” Hyperspectral Image Analysis, 319–350 (Springer, 2020).

18. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” In 2015 International Conference on Learning representation (ICLR) (2015), pp.1–14.

19. A. Kumar Rai, R. Senthilkumar, and A. Kumar R, “Combining pixel selection with covariance similarity approach in hyperspectral face recognition based on convolution neural network,” Microprocess Microsy. 76(7), 103096 (2020). [CrossRef]

20. H. Zhai, H. Zhang, L. Zhang, and P. Li, “Laplacian-regularized lowrank subspace clustering for hyperspectral image band selection,” IEEE Trans. Geosci. Electron. 57(3), 1723–1740 (2019). [CrossRef]

21. J. Hu, L. Shen, and G. Sun, “Squeeze-and-Excitation Networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018), 7132–7141.

22. J. Li and Z. Liu, “Efficient camera self-calibration method for remote sensing photogrammetry,” Opt. Express 26(11), 14213–14231 (2018). [CrossRef]

23. Z. Ge, G. Cao, X. Li, and P. Fu, “Hyperspectral Image Classification Method Based on 2D–3D CNN and Multibranch Feature Fusion,” IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing 13, 5776–5788 (2020). [CrossRef]

24. L. J. Denes, P. Metes, and Y. Liu, “Hyperspectral face database. Carnegie Mellon University,” Tech. Report, CMU-RI-TR-02-25, Carnegie Mellon University (2002).

25. K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks,” IEEE Signal Process. Lett. 23(10), 1499–1503 (2016). [CrossRef]

26. L. N. Smith, “Cyclical Learning Rates for Training Neural Networks,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) (2017), 464–472.

27. V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” In 2010 International Conference on Machine Learning (ICML) (2010), pp. 807–814.

28. M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” J. Royal. Statistical Soc. B. 68(1), 49–67 (2006). [CrossRef]

29. Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European Conference on Computer Vision (ECCV) (2016), 499–515.

30. G. Krishnan, R. Joshi, T. Connor, F. Pla, and B. Javidi, “Human gesture recognition under degraded environments using 3D-integral imaging and deep learning,” Opt. Express 28(13), 19711–19725 (2020). [CrossRef]

31. PolyU-HSFD Database, https://www4.comp.polyu.edu.hk/∼biometrics/hsi/hyper_face.htm.

32. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” In International Conference on Learning representation (ICLR) (2015), 1–15.

33. D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXiv preprint arXiv:1411.7923 (2014).

34. D. Chang, Y. Ding, J. Xie, A. K. Bhunia, X. Li, Z. Ma, M. Wu, J. Guo, and Y. Song, “The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification,” IEEE Trans. on Image Process. 29(7), 4683–4695 (2020). [CrossRef]

35. H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local Relation Networks for Image Recognition,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 3463–3472.

Input	HK-Poly HSFD	CMU-HSFD	UWA-HSFD
original images	92.38 ± 0.795	84.2719 ± 4.854	73.434 ± 4.48
Images processed by MTCNN	97.778 ± 0.80	96.314 ± 2.916	94.328 ± 2.236

Ratio $r$	HK-Poly	UWA	CMU	Million Parameter for HK-Poly	Million Parameter for UWA	Million Parameter for CMU
Ratio $r$	Accuracy (%)	Accuracy (%)	Accuracy (%)	Million Parameter for HK-Poly	Million Parameter for UWA	Million Parameter for CMU
2	93.968	87.76	94.722	0.001089	0.001089	0.004225
4	97.778	90.744	95.146	0.000545	0.000545	0.002113
8	96.508	93.432	96.314	0.000273	0.000273	0.001057
16	95.558	92.836	93.794	0.000137	0.000137	0.000529
32	94.422	91.342	95.926	0.000069	0.000069	0.000265
original	93.653	91.042	95.145	0.000066	0.000066	0.00012

method	HK-Poly HSFD (%)	CMU HSFD (%)	UWA HSFD (%)
Spectral Signature [7]	24.6 ± 3.87	38.1 ± 1.89	40.5 ± 1.08
Spectral Angle [10]	25.4 ± 4.36	38.1 ± 1.89	37.9 ± 4.15
Spectral Eigeface [9]	70.3 ± 3.61	72.1 ± 5.41	91.5 ± 3.07
2D PCA [11]	71.1 ± 3.16	72.1 ± 5.41	83.8 ± 2.42
3D Gabor Wavelets [12]	90.1 ± 2.09	91.6 ± 2.86	91.5 ± 3.07
3D-DCT [14]	84 ± 0.2	88.6 ± 0.75	44.776 ± 0.28
Band fusion + PLS [6]	85.17 ± 1.18	79.612 ± 0.01	46.269 ± 0.1
S-CNN + SVM: Majority-Voting [1]	60 ± 2.6	71 ± 1.2	-
Blocking SI-CNN + Adaboost.M1 [16]	88 ± 1.23	-	-
Deep-SSL [17]	93.65 ± 1.59	93.2 ± 0.975	92.537 ± 2.24
VGG-19(non-CA, non-Lasso) (ours)	93.477 ± 3.95	92.23 ± 1.425	92.236 ± 1.495
SCANet (non-Lasso) (ours)	93.65 ± 3.175	94.74 ± 2.427	93.136 ± 0.745
SSCANet (ours)	97.778 ± 0.80	96.314 ± 2.916	94.328 ± 2.236

Methods	HK-Poly HSFD	CMU-HSFD	UWA-HSFD
Normal learning strategy	93.016 ± 2.38	90.874 ± 4.37	89.55 ± 1.49
CLR leaning strategy	97.778 ± 0.80	96.314 ± 2.916	94.328 ± 2.236

	HK-Poly HSFD	CMU-HSFD	UWA-HSFD
Band fusion + PLS [6]	{530, 540, 550, 630, 670} nm	{570, 640, 720, 1000} nm	{530, 540, 610, 720} nm
S-CNN [1]	{520, 570, 590} nm	{730, 900, 970} nm	-
Deep-SSL [17]	{580, 640, 700} nm	{750, 810, 920, 990} nm	{570, 650, 680, 710} nm
SSCANet (ours)	{420, 450, 460, 480, 500, 530, 550, 570, 580, 600, 610, 620, 630, 660, 710} nm	{450, 470, 520, 570, 580, 640, 640, 650, 660, 670, 720, 730, 750, 760, 770, 780, 800, 810, 830, 850, 900, 910, 920, 930, 940, 980, 1020, 1050, 1060, 1080} nm	{400, 410, 440, 450, 490, 530, 610, 620, 690, 710} nm

Hyperspectral face recognition based on sparse spectral attention deep neural networks

Abstract

1. Introduction

2. Methodology

2.1 SSCANet

2.1.1 SSCA unit

2.1.2 Lasso loss function

2.1.3 Total loss function

2.2 Supplementary methods

2.2.1 MTCNN for image preprocessing

2.2.2 CLR for training neural networks

3. Experimental results and analysis

3.1 CNN architecture

3.2 Hyperspectral database

3.3 Data preprocessing

3.4 SSCANet configurations for training

3.5 Hyperparameter $\lambda $

3.6 Reduction ratio $r$

3.7 Comparison results of different methods

3.8 Ablation study

3.9 Results of band selection

4. Conclusions

Funding

Disclosures

References

Cited By

Figures (11)

Tables (6)

Equations (8)

Optics Express

Optimal bands	HK-Poly HSFD	CMU-HSFD	UWA-HSFD
Band fusion + PLS [6]	90.47	77.67	91.04
S-CNN [1]	85.71	82.52	-
Deep-SSL [17]	92.06	94.17	92.5373
SSCANet (ours)	97.78	96.31	94.33