Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Semisupervised classification of hyperspectral images with low-rank representation kernel

Open Access Open Access

Abstract

A semisupervised deformed kernel function, using low-rank representation with consideration of local geometrical structure of data, is presented for the classification of hyperspectral images. The proposed method incorporates the wealth of unlabeled information to deal with the limited labeled samples situation as a common case in HSIs applications. The proposed kernel needs to be computed before training the classifier, e.g., a support vector machine, and it relies on combining the standard radial basis function kernel based on labeled information and the low-rank representation kernel obtained using all available (labeled and unlabeled) information. The low-rank representation kernel can overcome the difficulties of clustering methods that are used to construct the kernels such as bagged kernel and multi-scale bagged kernel. The experimental results of two well-known HSIs data sets demonstrate the effectiveness of the proposed method in comparison with cluster kernels obtained using traditional clustering methods and graph learning methods.

© 2020 Optical Society of America

1. INTRODUCTION

Hyperspectral images (HSIs) acquired from hundreds of narrow contiguous bands provide more information than traditional RGB and multispectral images [1,2]. The nature of high spectral resolution facilitates the expensive application of HSIs in many fields, such as military, forestry, agriculture, mineralogy, medical, and environment [3]. As a basic processing for application, HSIs classification has become an important topic of research, which aims to assign each spectral pixel to one of the classes based on their spectral characteristics. However, obtaining a reliable and accurate class label for each pixel is an expensive and time-consuming task [4,5].

The generally unfavorable ratio between the large number of spectral bands and the limited number of available labeled samples results in several major challenges, e.g., the Hughes phenomenon [6]. To take full advantage of the rich information available in the HSI data, various methods have been proposed to tackle the problem of their analysis [7]. Two kinds of them are kernel-based methods [8] and semisupervised learning (SSL) [9]. Kernel-based methods, in general, and support vector machines (SVMs) [10] in particular, have been successfully used for HSI classification [11].

Due to the curse of dimensionality [12,13], most supervised classification methods suffer from the limited labeled training samples and high dimensionality of HSI data [14]. Kernel-based methods have been widely utilized due to their insensitivity to the curse of dimensionality [6]. Another interesting fact of most kernel-based methods is that there is no need to work explicitly with the mapped data. In fact, the nonlinear relations between data can be determined via a kernel (similarity) function that reproduces the similarity in reproducing kernel Hilbert space (RKHS) [15]. However, a bottleneck of these methods is that the final performance strongly depends on the quality of the training samples and the adequate definition of the kernel structural form [11]. Choosing a suitable kernel function that significantly improves the performance of these methods is one of the learning problems [13]. Based on the relationship of kernel design and graph learning, different kernels have been induced from the similarity matrix of the constructed graph using a graph-embedding framework such as locally linear embedding (LLE) [16]. However, the LLE method attempts to preserve the local relationships of samples, and the number of nearest neighbors to be considered dramatically affects the LLE performance. Instead of building a graph in two different steps including adjacency construction and weight calculation, sparse representation has been exploited for sparse graph construction in one unified step (${\ell _1}$-graph) [17]. The ${\ell _1}$-graph lacks global constraints, which greatly reduce the performance when the data is grossly corrupted. In fact, each sample is considered individually in sparse representation.

Recently, some deep learning (DL) models have been proposed for feature extraction and classification of HSIs that can provide good performance provided there are enough training samples available. Generally, artificial neural networks (ANNs) can provide a powerful representation capability with abundant training samples. Without enough training samples, ANNs face the problem of overfitting, which means that the classification performance of test samples will be downgraded. Applying DL to HSIs is a challenging problem because of the complex structure of HSI data and the limitation in the number of labeled training samples, which make it difficult to optimally learn a large number of parameters [18].

SSL methods, which aim to improve the performance of classifiers by learning from the information contained in the numerous unlabeled samples and the labeled samples, have aroused a great deal of interest in the machine learning community [19] and have been successfully used in HSI classification [20]. These methods are very beneficial because the collection of labeled samples is fairly expensive, but unlabeled samples could be available in large quantities at low cost [19].

However, the classification performance of SSL methods depends sensitively on the manner in which the labeled and the unlabeled training samples are connected [11]. In particular, as far as HSI classification is concerned, SVM has shown a good performance in dealing with high-dimensional HSI data [21] and distinguishes itself in terms of classification accuracy, computational cost, and low vulnerability to the Hughes phenomenon. Also, it works well with a few training samples [22,23]. This paper deals with designing the semisupervised kernel function using the amount information of unlabeled samples and training SVM with it.

In the last years, several methods have been proposed that model the data structure exploiting the information contained in the unlabeled samples and then reformulate the base kernel function of the SVM classifier. In Ref. [22] a transductive SVM, which maximizes the margin for the labeled and unlabeled samples, has been proposed for HSI classification. In Refs. [15,24], a bagged kernel (${{\textbf{K}}_{\rm bag}}$) encoding similarities between the labeled and the unlabeled samples has been introduced that exploits the information in the unlabeled samples. These methods try to deform the structure of the base kernel using a clustering approach. Despite their admissible performance for some tasks and simplicity of application, these methods have several challenges. Two parameters specific to the bagged kernels must be defined prior to classification. These are the number of times running the $k$-means clustering ($t$) and introduction the number of clusters to create ($k$). The $ k $-means algorithm for the proposed method in Ref. [24] is run $t$ times with different initializations and the same number of clusters $k$, which results in $t$ cluster assignments for each sample. Then a bagged kernel is built based upon the fraction of the number of times that each pair of samples is assigned to the same cluster. Finally, the SVM classifier is trained using the deformed kernel, which has been obtained as ${{\textbf{K}}_{{\rm S} \text {-} {\rm bag}}}( {{{\textbf{x}}_i},{{\textbf{x}}_j}} ) = \rho {{\textbf{K}}_{\rm RBF}} + ( {1 - \rho } ){{\textbf{K}}_{\rm bag}}$, where ${{\textbf{K}}_{\rm RBF}} = \exp ( {{{( {{{\textbf{x}}_i} - {{\textbf{x}}_j}} )}^2}/2{\sigma ^2}} )$, $\sigma \in {{\mathbb R}^ + }$ is the RBF kernel, ${{\textbf{x}}_i}$ and ${{\textbf{x}}_j}$ are two samples, and ${{\textbf{K}}_{\rm bag}} = \frac{1}{t}\mathop \sum \nolimits_{p = 1}^t [ {{c_p}( {{{\textbf{x}}_i}} ) = {c_p}( {{{\textbf{x}}_j}} )} ]$, where operator ${c_p}( {{{\textbf{x}}_i}} ) = {c_p}( {{{\textbf{x}}_j}} )$ returns “1” if samples ${{\textbf{x}}_i}$ and ${{\textbf{x}}_j}$ belong to the same cluster according to the $p$th realization of the clustering ${c_p}( \cdot )$ and “0” otherwise. The parameter $\rho \in [ {0,1} ]$ provides a trade-off between ${{\textbf{K}}_{\rm RBF}}$ and ${{\textbf{K}}_{\rm bag}}$. However, the $k$-means clustering algorithm is fast and efficient, but it leads to generating too blocky and sparse kernels, as it only gives hard decisions. To overcome this problem, in Ref. [15] the expectation-maximization (EM) clustering assuming a multi-scale (MS) Gaussian mixture model (GMM) has been proposed instead of $k$-means clustering to obtain a smoother kernel ($ {{{\textbf{K}}_{{\rm S} \text {-} {\rm MSbag}}}} $). The main limitation of this method is the need for enough training samples. Without enough training samples, the underlying probability distribution function of the data samples cannot be described properly. Therefore, the cluster structure cannot be estimated via the EM-GMM algorithm [24]. In addition, the clustering methods used in Refs. [24] and [15] are sensitive to noise and outliers [25]. Also, these methods need to know the number of subspaces a priori. In addition, most of the proposed methods usually utilize Euclidean distance (ED). A problem of the ED is that it treats all features equally, so it often does not yield reliable judgment. To overcome the problems of ED, many distance metric learning (DML) schemes have been developed to replace the ED [11].

In order to overcome the above problems, a kernel construction method using low-rank representation (LRR) is proposed in this paper. LRR has been proposed that address the subspace segmentation problem [25]. For a given set of data vectors drawn from a union of multiple subspaces, LRR finds the lowest rank representation of all data jointly. By choosing an appropriate dictionary, LRR can correctly preserve and reveal the membership and correlations between samples that belong to the same subspace. It has been proven that the initial data set is the best choice in most applications [26].

Compared to the clustering methods used for kernel construction in Refs. [15,24], LRR is better at capturing the global structures of data. So it can better fit the subspace segmentation that reveals the membership of the data samples. In comparison with the clustering methods used in the mentioned works, the main advantage of LRR is that the low-rank criterion can enforce to correct corruptions and LRR can handle noise and outliers in data explicitly. For corrupted data, since the corruption will largely increase the rank as described in matrix competition lectures [27,28], the lowest-rank criterion can enforce to correct corruptions. So LRR could be robust to noise and outliers. Moreover, LRR-based methods do not need to know the dimensions and the number of subspaces a priori [29]. As noted in Ref. [26], since each sample can be represented by itself, the general rank minimization problem of LRR always has feasible solutions, even when the data sampling is insufficient.

However, conventional LRR-based methods usually consider much on construction of the global subspace structure, and the local information may be missed in the learning process. In order to preserve the local geometrical structure of the data, an additional penalty term is added into the objective function of LRR. After obtaining the optimal representation of the data, the LRR kernel function is formed. Then the deformed kernel function is defined by combining ${{\textbf{K}}_{\rm RBF}}$ with a kernel constructed via LRR with consideration of the local geometrical structure of data to train a SVM classifier.

The remainder of this paper is as follows: in Section 2, the LRR method is described. In Section 3, the procedure of the proposed LRR kernel is introduced. The experimental results are presented in Section 4. Finally, Section 5 concludes the paper.

2. LRR

Let ${\textbf{X}}{ =} {{\textbf{X}}_{\rm labeled}} \cup {{\textbf{X}}_{\rm unlabled}} {=} \{ {( {{{\textbf{x}}_i},{l_i}} )} \}_{i = 1}^l \cup \{ {{{\textbf{x}}_i}} \}_{i = l + 1}^{l + u}{\in}\, {{\mathbb R}^{( {l + u} ) \times R}}$ be drawn from a union of multiple independent subspaces as $ \cup _{i = 1}^k{{\cal S}_i}$, ${{\cal S}_1}, {{\cal S}_2}, \ldots ,{{\cal S}_k}$, in which ${{\cal S}_i}$ is low-dimensional subspace. ${{\textbf{x}}_i}$ represents a sample, and $R$ denotes the number of spectral bands (dimensions). ${l_i} \in \{ {1,2, \ldots ,C} \}$ is the label of ${{\textbf{x}}_i}$, where $C$ is the number of classes and $( {l + u} ) = N$ are the total number of samples. LRR is a powerful representation tool that finds the lowest-rank representation by solving the convex optimization problem in Eq. (1):

$${\min\nolimits_{{\textbf{Z}},{\textbf{E}}}}\|{\textbf{Z}}\|_{\rm *} + \gamma {\|{\textbf{E}}\|_{2,1}}, \quad {\rm s.t.}\;{\textbf{X}} = {\textbf{AZ}} + {\textbf{E}},$$
where ${\textbf{A}}$ is a dictionary that can span the data subspaces, ${\|{\textbf{E}}\|_{2,1}}$ denotes the ${\ell _{2,1}}$ norm of the outliers and noises ${\textbf{E}}$, $\gamma$ is a balance parameter, and ${\textbf{Z}}$ denotes the coefficient matrix [30]. The low-rank constraint of the coefficient matrix ${\textbf{Z}}$ enables ${\textbf{Z}}$ to have a block diagonal structure, where the block corresponds to each subspace [25]. To better handle the LRR-based method, a more general minimization problem is written as follows:
$${\min\nolimits_{Z,E}}{\|{\textbf{Z}}\|_{\rm *}} + P\left( {{\textbf{Z}},{\textbf{E}}} \right), \quad {\rm s.t.}\;{\textbf{X}} = {\textbf{XZ}} + {\textbf{E}},$$
where $P$ is a penalty function on ${\textbf{Z}}$ and ${\textbf{E}}$ [30]. Various algorithms can be proposed by imposing the different definitions for $P$ such as [25,30,31]. The augmented Lagrange multiplier (ALM) [32] can be used to solve Eq. (2).

3. PROPOSED SEMISUPERVISED LRR KERNEL

The underlying idea of the proposed method is to construct a semisupervised kernel function to measure the similarity among labeled samples, taking into account the information of all available samples, i.e., labeled ($\ell$) and unlabeled ($ u $) samples. The constructed kernel has two contributions, one using all available ($ {\ell + u} $) samples and the other only computed with the $\ell$ labeled samples. The weighted combination of these two kernels is a valid kernel and can be used in any kernel method for classification or regression, such as SVM.

A. Selection of the Appropriate Kernels

As mentioned before, the adequate definition of the kernel structural form is regarded as the key problem in kernel-based learning methods, because of its direct effect on the performance of machine learning techniques [33,34]. Two different methodological approaches that estimate a likelihood kernel according to the unlabeled data structure are graph- and cluster-based methods [24]. These methods modify the assumed prior kernel that encodes data relations [35]. The main challenge of graph-based methods is introduction critical free parameters and a high computational load [24]. On the other hand, the cluster-based methods used for bagged kernel and multi-scale bagged kernel have some problems, such as sensitivity to noise and outliers, problems in the description of the underlying probability distribution function in limited training samples, and knowing the number of subspaces a priori. In order to overcome these problems, a semisupervised kernel based on LRR is proposed and used with a SVM classifier in this paper. The proposed method can better encode the information carried out by unlabeled samples with no modification of the SVM classifier.

B. Learning the Kernel from Unlabeled Samples

In order to define a suitable kernel whose structure can reflect data relations, the unlabeled information and geometrical relationships between labeled and unlabeled samples are utilized. To this end, a method is proposed that captures the local and global information of labeled and unlabeled data and deforms the structure of the base kernel (e.g., linear, polynomial, and RBF) using the unlabeled samples. Specifically, the proposed method combines a standard ${{\textbf{K}}_{\rm RBF}}$ kernel with a kernel constructed via low-rank representation with consideration of the local geometrical structure of data. The ${{\textbf{K}}_{\rm RBF}}$ is built only with the labeled samples. The second one, denoted as LRR kernel (${{\textbf{K}}_{\rm LRR}}$), focuses on the combination of labeled and unlabeled information of the data manifold that reveals the membership of the samples.

Conventional LRR-based methods usually consider much on construction of the global subspace structure, and the local information may be missed in the learning process. In order to overcome this problem, an additional penalty term is added into the objective function of LRR in Eq. (1). Here, a graph regularization term is incorporated in the objective function of LRR to preserve the intrinsic local structure of the original hyperspectral data. At first, a nearest-neighbor graph $G = \{ {{\textbf{X}},{\textbf{S}}} \}$ with vertex set ${\textbf{X}}$ and similarity matrix ${\textbf{S}} \in {{\mathbb R}^{N \times N}}$ is constructed. If sample ${{\textbf{x}}_i}$ is one of the $k$-nearest neighbors of ${{\textbf{x}}_j}$, the similarity ${S_{ij}}$ is assigned as

$${S_{ij}}( {i,j} ) = {e^{ - \frac{{{\|{\textbf{x}}_i} - {{\textbf{x}}_j}\|^2}}{\sigma }}},$$
which is known as the heat kernel [35]. It is expected that the corresponding representations of any two samples ${{\textbf{x}}_i}$ and ${{\textbf{x}}_j}$ maintain the same local structure of them. To achieve this, a reasonable way is to minimize the following term [35]:
$$\begin{split}\frac{1}{2}\mathop \sum \limits_{i,j = 1}^n \|{z_i} - {z_j}\|_2^2{S_{ij}} &= \mathop \sum \limits_{i = 1}^n z_i^T{z_i}{D_{ii}} - \mathop \sum \limits_{i,j = 1}^n z_i^T{z_j}{S_{ij}}\\& = {\rm trace}( {{\textbf{ZD}}{{\textbf{Z}}^{\textbf{T}}}} ) - {\rm trace}( {{\textbf{ZS}}{{\textbf{Z}}^{\textbf{T}}}} )\\& = {\rm trace}( {{\textbf{ZL}}{{\textbf{Z}}^{\textbf{T}}}} ),\end{split}$$
where ${\textbf{D}}$ is a diagonal matrix whose elements are the row (or column, since ${\textbf{S}}$ is symmetric) sums of ${\textbf{S}}$ matrix, ${\textbf{L}} = {\textbf{D}} - {\textbf{S}}$ is the graph Laplacian matrix, and ${S_{ij}}$ is determined by Eq. (3). Incorporating this graph regularization term into Eq. (1) and setting ${\textbf{A}} = {\textbf{X}}$ leading to the cost function of the proposed method for construction ${{\textbf{K}}_{\rm LRR}}$ as follows:
$$\begin{split}&{\min\nolimits_{Z,E}}{\|{\textbf{Z}}\|_{\rm *}} + {\lambda _1}{\|{\textbf{E}}\|_1} + + {\lambda _2}{\rm trace}\left( {{\textbf{Z}}{\rm L}{{\textbf{Z}}^{\textbf{T}}}} \right) \\ &\quad {\rm s.t.}\;{\textbf{X}} = {\textbf{XZ}} + {\textbf{E}},\;{\textbf{Z}} = {{\textbf{Z}}^{\textbf{T}}},\;{\textbf{Z}} \gt 0,\end{split}$$
where ${\lambda _1} \gt 0$, ${\lambda _2} \gt 0$ are balance parameters to trade off among the low rankness representation, errors, and manifold penalty terms. Since the negative coefficients lack the physical interpretation for many real applications, the nonnegative constraint is imposed on the data representation in Eq. (5). The ALM [32] method is utilized to solve the optimization problem in Eq. (5). The optimal solution of ${\textbf{Z}}$ can be considered as an improved low-rank representation that preserves the local information. Each column of ${\textbf{Z}}$ naturally characterizes the affinity of a sample with other samples.

After obtaining the optimal representation matrix ${\textbf{Z}}$, it is normalized to bring all values into the range [0,1]. After it, the kernel function (${{\textbf{K}}_{\rm LRR}}$) or similarity matrix of the data set is defined as ${{\textbf{K}}_{\rm LRR}} = {\textbf{Z}^\prime }$, where ${\textbf{Z}^\prime }$ is the normalized ${\boldsymbol Z}$. Finally, the weighted summation of the ${{\textbf{K}}_{\rm RBF}}$ and ${{\textbf{K}}_{\rm LRR}}$ is determined as follows:

$${{\textbf{K}}_{{\rm s} \text {-} {\rm LRR}}} = \rho {{\textbf{K}}_{\rm RBF}} + \left( {1 - \rho } \right){{\textbf{K}}_{\rm LRR}},$$
where $\rho \in [ {0,1} ]$ is a scalar parameter. Then the SVM classifier is trained using ${{\textbf{K}}_{{\rm s} \text {-} {\rm LRR}}}$. To summarize, the main procedure of the proposed semisupervised kernel construction method is shown in Table 1.
Tables Icon

Table 1. Procedure of LRR Kernel Construction

4. DATA SETS AND EXPERIMENTS

A. Data Sets

To validate the effectiveness of the proposed method, two popular HSI data sets are used in our experiments. The first is the Indiana Pines image captured by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor. It is of size ${145}\;{\rm pixels} \times {145}\;{\rm pixels}$ and contains 16 classes. The number of spectral bands is 224, which is initially reduced to 200 by removing 24 water absorption bands. The second is the University of Pavia image acquired by the Reflective Optic System Imaging Spectrometer (ROSIS) sensor from university of Pavia, Italy. It contains of ${610}\;{\rm pixels} \times {340}\;{\rm pixels}$, 9 classes, and 103 bands after discarding noisy and water absorption bands [1]. The name and the number of samples in each class for both data sets are shown in Table 2.

Tables Icon

Table 2. Name and Number of Samples in Both Data Sets

B. Experimental Setup

The effectiveness of the proposed kernel construction method was evaluated in terms of overall classification accuracy (OA) and Kappa coefficient $\kappa$, using a SVM classifier and compared with different existing kernel construction methods, including (1) LLE, (2) sparse representation for ${\ell _1}$-graph [17], (3) ${{\textbf{K}}_{\rm RBF}}$ [33], which outputs positive similarities in the range [0–1], and thus eases the combination with the other kernels without normalization of the values, (4) ${{\textbf{K}}_{{\rm S} \text {-} {\rm bag}}}$, which refers to the weighted summation of ${{\textbf{K}}_{\rm RBF}}$ and ${{\textbf{K}}_{\rm bag}}$ proposed in Ref. [24]. The number of clusters $k$ in the $k$-means algorithm used to build ${{\textbf{K}}_{\rm bag}}$ is set to 10, Eq. (5) ${{\textbf{K}}_{{\rm S} \text {-} {\rm MSbag}}}$, which refers to the weighted summation of ${{\textbf{K}}_{\rm RBF}}$ and ${{\textbf{K}}_{\rm MSbag}}$, which have been proposed in Ref. [15]. Similar to what has been done in Ref. [15], the number of times that the EM-GMM clustering method is run and the number of clusters were set to 20. The number of neighbors in LLE was set to 13 and 20 for the Indian Pines data set and the University of Pavia data set, respectively. For sparse representation for construction ${\ell _1}$-graph, the parameters were set similar to [17]. The SVM classifier is trained using the deformed kernel in all cases. To investigate the ability of the low-rank representation kernel, the experiments were also done using ${{\textbf{K}}_{\rm LRR}}$ only.

Since a common prescription in machine learning for a low number of samples situation is to use a low number of folds [15], the parameters $\rho$ and $\sigma$ were optimized by three-fold cross-validation. The parameter $\rho$ was tuned between (0,1) in steps of 0.05, and $\sigma$ was varied between $[ {0.05,2} ] \times d$ ($d$ here represents the mean distance between all labeled samples) in our experiments. The optimal ${\lambda _1}$, ${\lambda _2}$ were tuned in the range of $[ {{{10}^{ - 2}},{{10}^{ - 1}}} ]$ and $[{10^{ - 3}},{10^{ - 2}},{10^{ - 1}}]$ by a grid search. Finally, ${\lambda _1}$ and ${\lambda _2}$ were set to 0.1 and 0.01, respectively. As SVM is a binary classifier, a one-against-one strategy was adopted for multi-class scheme. The SVM free parameters, i.e., $C$ and $\sigma$, were optimized by three-fold cross-validation in the same range $\{ {{{10}^{ - 3}}, \ldots ,{{10}^3}} \}$. Finally, $C = 100$ and $\sigma = 0.01$ were selected for SVM.

In our experiments, both the OA and kappa statistic in several ill-posed classification cases are reported. Therefore, the labeled training samples with $l = \{ {10,20,30,40} \}$ per class in the University of Pavia data set and $l = \{ {5,15,25,30} \}$ labeled samples per class in the Indiana Pines data set were randomly selected in our experiments. For the Indian Pines data set, at a maximum half of total samples in the grass-pasture-mowed and oats classes have been selected as labeled training samples. In the experiments, $u = 1000$ unlabeled samples are randomly selected to construct the LRR kernel. The sensitivity study to the number of unlabeled samples is reported in Section 4.E. Because of random selection of training samples, a total number of 15 realizations were carried out, and the best results were reported.

C. Numerical Results

Table 3 shows the best OA(%) and kappa coefficient for both data sets with different kernel construction methods using the SVM classifier. These results demonstrate that the proposed method produces better classification performance than the other kernel construction methods using a SVM classifier. In addition, it can be seen that the OA is improved as the labeled samples increase. The improvement is particularly significant when a low number of labeled samples is used. These results confirm the validity of the proposed kernel construction method for classification of HSI data in the limited labeled samples situation as a common case in their analysis. For example, it can observed that the proposed method yields 51.31%, 75.66%, 89.86%, 53.74%, 26.08%, and 23.08% higher classification results than ${{\textbf{K}}_{{\rm LLE}}}$, ${{\textbf{K}}_{{\ell _1}}}$, ${{\textbf{K}}_{\rm RBF}}$, ${{\textbf{K}}_{{\rm S} \text {-} {\rm bag}}}$, ${{\textbf{K}}_{{\rm S} \text {-} {\rm MSbag}}}$, and ${{\textbf{K}}_{{\rm LRR}}}$, respectively, for five labeled samples per class using a SVM classifier.

Tables Icon

Table 3. Best OA% (Kappa Statistic) of Different Kernel Construction Methods Used by SVM versus the Number of Labeled Samples for the Indiana Pines Data Set and the University of Pavia Data Set

Figure 1(a) shows the OA(%) with respect to varied labeled samples for different kernel construction methods. As shown in Fig. 1(a), the proposed method outperforms the compared methods for classification of HSIs. It is shown that with the increase of labeled samples, the classification performance of all methods increases. This is due to the advantages of low-rank representation with comparison with LLE, sparse graph, $ k $-means, and EM-GMM clustering algorithms.

 figure: Fig. 1.

Fig. 1. Best OA% of the kernel functions versus the number of labeled samples based on a SVM classifier for the (a) Indiana Pines data set and (b) University of Pavia data set.

Download Full Size | PDF

 figure: Fig. 2.

Fig. 2. Best classification maps of the different kernel functions used by a SVM classifier for Indian Pines data set (top row) and University of Pavia data set (bottom row): (a) ground truth, (b) ${{\textbf{K}}_{{\rm LLE}}}$, (c) ${{\textbf{K}}_{{\ell _1}}}$, (d) ${{\textbf{K}}_{\rm RBF}}$, (e) ${{\textbf{K}}_{{\rm bag}}}$, (f) ${{\textbf{K}}_{{\rm MS} - {\rm bag}}}$, (g) ${{\textbf{K}}_{{\rm LRR}}}$, and (h) ${{\textbf{K}}_{{\rm S} - {\rm LRR}}}$.

Download Full Size | PDF

Also, the results for University of Pavia data set confirm the effectiveness of the proposed method. About 21.17%, 24.45%, 32.07%, 17.56%, 12.62%, and 6.56% gain in OA are observed in comparison with ${{\textbf{K}}_{{\rm LLE}}}$, ${{\textbf{K}}_{{\ell _1}}}$, ${{\textbf{K}}_{\rm RBF}}$, ${{\textbf{K}}_{{\rm S} \text {-} {\rm bag}}}$, ${{\textbf{K}}_{{\rm S} \text {-} {\rm MSbag}}}$, and ${{\textbf{K}}_{{\rm LRR}}}$, respectively, for 10 labeled samples per class. The OA(%) with respect to varied labeled samples for different kernel construction methods is shown in Fig. 1(b). It is shown that the proposed method outperforms the compared methods for classification of HSIs. It can be observed that with the increase of labeled samples, the classification performance of all methods increases.

Figure 2 shows the best classification maps for different kernel methods using the SVM classifier for the Indiana Pines data set (top row) and the University of Pavia data set (bottom row), where only $l = 80$ per class and $u = 1000$ were used. The numerical results shown in Table 3 are confirmed by inspecting these classification maps, where better integration of local and global information in kernel construction by the proposed method and smoother classification maps are obtained. This confirms that the proposed semisupervised algorithm can be very useful when a low number of labeled samples is available.

D. Importance of LRR Kernels

The experimental results obtained by training the SVM using ${{\textbf{K}}_{\rm LRR}}$ only with different numbers of labeled samples have been reported in Table 3 for both data sets. In fact, in this step of our experiments, only the low-rank representation coefficients that reveal the membership of the samples (in an unsupervised way) are used to train the SVM. It can be observed from Table 3 that using ${{\textbf{K}}_{\rm LRR}}$ alone led to better results than using the standard RBF kernel when few labeled samples are available. It confirms the interest of integrating unsupervised information in a situation where few labeled samples are available. Also, it can be observed from the results of both data sets that the performance of using ${{\textbf{K}}_{\rm LRR}}$ only can be comparable with the bagged kernel and MS-bagged kernel. This is due to the advantages of low-rank representation with comparison with the $ k $-means and EM-GMM clustering algorithms. Also, it provides higher performance than LLE and sparse methods.

E. Sensitivity to the Weight Parameter

It was shown in our experiments that when only few labeled samples are available, the weight for ${{\textbf{K}}_{\rm LRR}}$ in Eq. (6) needs to be more important to obtain accurate classification results. Really, the parameter $\rho$ can be small in limited labeled samples situations with sufficient unlabeled samples. This is due to the fact that the kernel obtained by solving Eq. (5) is able to capture the good information of the manifold data, but as $u$ is increased, the improvement is saturated with a constant number of labeled samples. It was observed that with more labeled samples, the effect of the ${{\textbf{K}}_{\rm LRR}}$ needs to be lower. This suggests that the proposed semisupervised algorithm can be very useful when low number of labeled samples is available. In addition, it confirms the advantages of the proposed ${{\textbf{K}}_{\rm LRR}}$ in comparison with ${{\textbf{K}}_{{\rm S} \text {-} {\rm bag}}}$ and ${{\textbf{K}}_{{\rm S} \text {-} {\rm MSbag}}}$.

F. Sensitivity to the Number of Unlabeled Samples

Selecting the number of unlabeled samples plays an important role for constructing of ${{\textbf{K}}_{\rm LRR}}$ and affects the performance of classification using ${{\textbf{K}}_{\rm LRR}}$ with SVM classifier. Selection of a small number of unlabeled samples is not sufficient to exploit the local and global information of the data, while a large number of unlabeled samples will lead to problems in time complexity and memory space [36]. To investigate the sensitivity of the proposed method to the number of unlabeled samples, 20 labeled samples from each class were selected to compose ${{\textbf{X}}_{\rm labeled}}$, and the number of unlabeled samples was evaluated from $u = 500$ to 5000 with a step of 500. The best results are reported for Indiana Pines data set and University of Pavia data set in Fig. 3. As shown in Fig. 3, the results (OA%) are increased at first, then the improvement saturates as more unlabeled samples are used for constructing ${{\textbf{K}}_{\rm LRR}}$ and training SVM.

 figure: Fig. 3.

Fig. 3. OA% versus the number of unlabeled samples for both data sets.

Download Full Size | PDF

G. Computational Cost

Here the computational cost of the proposed method and the other selected method for construction kernels are presented. The complexity of different kernels is as follows: ${{\textbf{K}}_{\rm LLE}}$: ${\cal O}( {D{N^2}} ) + {\cal O}( {DN{k^3}} ) + {\cal O}( {D{N^2}} )$, ${{\textbf{K}}_{{\ell _1}}}$: ${\cal O}( {{D^2} + DN} )$, ${{\textbf{K}}_{\rm RBF}}$: ${\cal O}( {{D^2}N} )$, ${{\textbf{K}}_{{\rm S} \text {-} {\rm bag}}}$: ${\cal O}( {tkND} )$, ${{\textbf{K}}_{{\rm S} \text {-} {\rm MSbag}}}$: ${\cal O}( {NG{D^3}} )$, ${{\textbf{K}}_{\rm LRR}}$: ${\cal O}( {k{N^2}r} )$, ${{\textbf{K}}_{{\rm S} \text{-} {\rm LRR}}}$: ${\cal O}( {k{N^2}r + {D^2}N} )$, where $D$ is the size of dimension, $N$ is the total number of samples, $t$ denotes the number of iterations of the $k$-means clustering algorithm, $k$ is the number of clusters, $r$ is the rank, and $G$ denotes the number of Gaussian components in the ${{\textbf{K}}_{{\rm MS} \text{-} {\rm bag}}}$ procedure.

5. CONCLUSION

This paper proposed a semisupervised kernel construction method for classification of HSI data. The proposed kernel was built by combining the standard RBF kernel based on available labeled information and the LRR kernel, which was obtained using all available data (labeled and unlabeled samples). The LRR kernel can overcome to the difficulties of clustering methods, which have been used for kernel construction, such as $k$-means for construction bagged kernel and EM-GMM for MS-bagged kernel. The experimental results demonstrate that the proposed method is very effective and useful for semisupervised classification where limited labeled samples are available.

Disclosures

The authors declare that there are no conflicts of interest related to this paper.

REFERENCES

1. L. Fang, S. Li, X. Kang, and J. A. Benediktsson, “Spectral–spatial classification of hyperspectral images with a superpixel-based discriminative sparse model,” IEEE Trans. Geosci. Remote Sens. 53, 4186–4201 (2015). [CrossRef]  

2. D. H. Foster and K. Amano, “Hyperspectral imaging in color vision research: tutorial,” J. Opt. Soc. Am. A 36, 606–627 (2019). [CrossRef]  

3. L. Pan, H. Li, H. Meng, W. Li, Q. Du, and W. J. Emery, “Hyperspectral image classification via low-rank and sparse representation with spectral consistency constraint,” IEEE Geosci. Remote Sens. Lett. 14, 2117–2121 (2017). [CrossRef]  

4. W. Du, M. Lv, Q. Hou, and L. Jing, “Semisupervised dimension reduction based on pairwise constraint propagation for hyperspectral images,” IEEE Geosci. Remote Sens. Lett. 13, 1880–1884 (2016). [CrossRef]  

5. X. Jin, Y. Gu, and T. Liu, “Intrinsic image recovery from remote sensing hyperspectral images,” IEEE Trans. Geosci. Remote Sens. 57, 224–238 (2019). [CrossRef]  

6. J. Li, P. R. Marpu, A. Plaza, J. M. Bioucas-Dias, and J. A. Benediktsson, “Generalized composite kernel framework for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens. 51, 4816–4829 (2013). [CrossRef]  

7. P. Quesada-Barriuso, F. Argüello, and D. B. Heras, “Spectral–spatial classification of hyperspectral images using wavelets and extended morphological profiles,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 7, 1177–1185 (2014). [CrossRef]  

8. G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens. 43, 1351–1362 (2005). [CrossRef]  

9. S. A. Ahmadi, N. Mehrshad, and S. M. Razavi, “Semisupervised dimensionality reduction for hyperspectral images based on the combination of semisupervised learning and metric learning,” Imag. Sci. J. 66, 320–327 (2018). [CrossRef]  

10. F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images with support vector machines,” IEEE Trans. Geosci. Remote Sens. 42, 1778–1790 (2004). [CrossRef]  

11. Y. Gu and K. Feng, “Optimized Laplacian SVM with distance metric learning for hyperspectral image classification,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 6, 1109–1117 (2013). [CrossRef]  

12. X. Jia, B. Kuo, and M. M. Crawford, “Feature mining for hyperspectral image classification,” Proc. IEEE 101, 676–697 (2013). [CrossRef]  

13. M. Borhani and H. Ghassemian, “Kernel multivariate spectral–spatial analysis of hyperspectral data,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 8, 2418–2426 (2015). [CrossRef]  

14. J. Arenas-Garcia, K. B. Petersen, G. Camps-Valls, and L. K. Hansen, “Kernel multivariate analysis framework for supervised subspace learning: a tutorial on linear and kernel multivariate methods,” IEEE Signal Process. Mag. 30(4), 16–29 (2013). [CrossRef]  

15. E. Izquierdo-Verdiguier, L. Gómez-Chova, L. Bruzzone, and G. Camps-Valls, “Semisupervised kernel feature extraction for remote sensing image analysis,” IEEE Trans. Geosci. Remote Sens. 52, 5567–5578 (2014). [CrossRef]  

16. Z. Xue, P. Du, J. Li, and H. Su, “Simultaneous sparse graph embedding for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens. 53, 6114–6133 (2015). [CrossRef]  

17. A. Soltani-Farani, H. R. Rabiee, and S. A. Hosseini, “Spatial-aware dictionary learning for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens. 53, 527–541 (2015). [CrossRef]  

18. Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature extraction and classification of hyperspectral images based on convolutional neural networks,” IEEE Trans. Geosci. Remote Sens. 54, 6232–6251 (2016). [CrossRef]  

19. R. Luo, W. Liao, X. Huang, Y. Pi, and W. Philips, “Feature extraction of hyperspectral images with semisupervised graph learning,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 9, 4389–4399 (2016). [CrossRef]  

20. W. Liao, R. Bellens, A. Pizurica, W. Philips, and Y. Pi, “Classification of hyperspectral data over urban areas using directional morphological profiles and semi-supervised feature extraction,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 5, 1177–1190 (2012). [CrossRef]  

21. B. Mojaradi, H. Abrishami-Moghaddam, M. J. V. Zoej, and R. P. W. Duin, “Dimensionality reduction of hyperspectral data via spectral feature extraction,” IEEE Trans. Geosci. Remote Sens. 47, 2091–2105 (2009). [CrossRef]  

22. L. Bruzzone, M. Chi, and M. Marconcini, “A novel transductive SVM for semisupervised classification of remote-sensing images,” IEEE Trans. Geosci. Remote Sens. 44, 3363–3373 (2006). [CrossRef]  

23. L. Huo, P. Tang, Z. Zhang, and D. Tuia, “Semisupervised classification of remote sensing images with hierarchical spatial similarity,” IEEE Geosci. Remote Sens. Lett. 12, 150–154 (2015). [CrossRef]  

24. D. Tuia and G. Camps-Valls, “Semisupervised remote sensing image classification with cluster kernels,” IEEE Geosci. Remote Sens. Lett. 6, 224–228 (2009). [CrossRef]  

25. J. Chen and J. Yang, “Robust subspace segmentation via low-rank representation,” IEEE Trans. Cybern. 44, 1432–1445 (2014). [CrossRef]  

26. L. Fei, Y. Xu, X. Fang, and J. Yang, “Low rank representation with adaptive distance penalty for semi-supervised subspace classification,” Pattern Recogn. 67, 252–262 (2017). [CrossRef]  

27. E. J. Candes and Y. Plan, “Matrix completion with noise,” Proc. IEEE 98, 925–936 (2010). [CrossRef]  

28. E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Found. Comput. Math. 9, 717–772 (2009). [CrossRef]  

29. P. Li, J. Yu, M. Wang, L. Zhang, D. Cai, and X. Li, “Constrained low-rank learning using least squares-based regularization,” IEEE Trans. Cybern. 47, 4250–4262 (2017). [CrossRef]  

30. X. Wang and F. Liu, “Weighted low-rank representation-based dimension reduction for hyperspectral image classification,” IEEE Geosci. Remote Sens. Lett. 14, 1938–1942 (2017). [CrossRef]  

31. X. Lu, Y. Wang, and Y. Yuan, “Graph-regularized low-rank representation for destriping of hyperspectral images,” IEEE Trans. Geosci. Remote Sens. 51, 4009–4018 (2013). [CrossRef]  

32. Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for low-rank representation,” in 24th International Conference on Neural Information Processing Systems, Granada, Spain (Curran Associates Inc., 2011), pp. 612–620.

33. Y. Motai, “Kernel association for classification and prediction: a survey,” IEEE Trans. Neural Netw. Learn. Syst. 26, 208–223 (2015). [CrossRef]  

34. S. Yang, S. Yan, C. Zhang, and X. Tang, “Bilinear analysis for kernel selection and nonlinear feature extraction,” IEEE Trans. Neural Netw. 18, 1442–1452 (2007). [CrossRef]  

35. M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: a geometric framework for learning from labeled and unlabeled examples,” J. Mach. Learn. Res. 7, 2399–2434 (2006).

36. J. Xia, J. Chanussot, P. Du, and X. He, “(Semi-) supervised probabilistic principal component analysis for hyperspectral remote sensing image classification,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 7, 2224–2236 (2014). [CrossRef]  

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (3)

Fig. 1.
Fig. 1. Best OA% of the kernel functions versus the number of labeled samples based on a SVM classifier for the (a) Indiana Pines data set and (b) University of Pavia data set.
Fig. 2.
Fig. 2. Best classification maps of the different kernel functions used by a SVM classifier for Indian Pines data set (top row) and University of Pavia data set (bottom row): (a) ground truth, (b)  ${{\textbf{K}}_{{\rm LLE}}}$ , (c)  ${{\textbf{K}}_{{\ell _1}}}$ , (d)  ${{\textbf{K}}_{\rm RBF}}$ , (e)  ${{\textbf{K}}_{{\rm bag}}}$ , (f)  ${{\textbf{K}}_{{\rm MS} - {\rm bag}}}$ , (g)  ${{\textbf{K}}_{{\rm LRR}}}$ , and (h)  ${{\textbf{K}}_{{\rm S} - {\rm LRR}}}$ .
Fig. 3.
Fig. 3. OA% versus the number of unlabeled samples for both data sets.

Tables (3)

Tables Icon

Table 1. Procedure of LRR Kernel Construction

Tables Icon

Table 2. Name and Number of Samples in Both Data Sets

Tables Icon

Table 3. Best OA% (Kappa Statistic) of Different Kernel Construction Methods Used by SVM versus the Number of Labeled Samples for the Indiana Pines Data Set and the University of Pavia Data Set

Equations (6)

Equations on this page are rendered with MathJax. Learn more.

min Z , E Z + γ E 2 , 1 , s . t . X = AZ + E ,
min Z , E Z + P ( Z , E ) , s . t . X = XZ + E ,
S i j ( i , j ) = e x i x j 2 σ ,
1 2 i , j = 1 n z i z j 2 2 S i j = i = 1 n z i T z i D i i i , j = 1 n z i T z j S i j = t r a c e ( ZD Z T ) t r a c e ( ZS Z T ) = t r a c e ( ZL Z T ) ,
min Z , E Z + λ 1 E 1 + + λ 2 t r a c e ( Z L Z T ) s . t . X = XZ + E , Z = Z T , Z > 0 ,
K s - L R R = ρ K R B F + ( 1 ρ ) K L R R ,
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.