Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

SAHIS-Net: a spectral attention and feature enhancement network for microscopic hyperspectral cholangiocarcinoma image segmentation

Open Access Open Access

Abstract

Cholangiocarcinoma (CCA) poses a significant clinical challenge due to its aggressive nature and poor prognosis. While traditional diagnosis relies on color-based histopathology, hyperspectral imaging (HSI) offers rich, high-dimensional data holding potential for more accurate diagnosis. However, extracting meaningful insights from this data remains challenging. This work investigates the application of deep learning for CCA segmentation in microscopic HSI images, and introduces two novel neural networks: (1) Histogram Matching U-Net (HM-UNet) for efficient image pre-processing, and (2) Spectral Attention based Hyperspectral Image Segmentation Net (SAHIS-Net) for CCA segmentation. SAHIS-Net integrates a novel Spectral Attention (SA) module for adaptively weighing spectral information, an improved attention-aware feature enhancement (AFE) mechanism for better providing the model with more discriminative features, and a multi-loss training strategy for effective early stage feature extraction. We compare SAHIS-Net against several general and CCA-specific models, demonstrating its superior performance in segmenting CCA regions. These results highlight the potential of our approach for segmenting medical HSI images.

© 2024 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

Cholangiocarcinoma (CCA), a highly lethal adenocarcinoma of the hepatobiliary system, poses a significant clinical challenge due to its aggressive nature and late-stage presentation [1]. While curative options like surgery and liver transplantation exist for select patients, 5-year survival rates remain abysmally low [2]. The insidious nature of early-stage CCA often delays diagnosis until advanced stages, significantly impacting operability and prognosis. Consequently, early detection has become a paramount concern for clinicians [3]. Imaging modalities such as endoscopic ultrasound [4] and Magnetic Resonance Imaging [5] play a crucial role in preoperative diagnosis, but definitive confirmation demands microscopic pathological examination, the current gold standard [6]. Yet, this process is both laborious and time-consuming [7].

Traditional histopathological examination relies on color-based pathological images, which offer limited information about tissue composition and pathology. Hyperspectral imaging (HSI) capture hundreds or even thousands of spectral bands, providing rich information about the tissues [8]. Recent research has explored the potential of HSI in revolutionizing the diagnosis of various cancers, including CCA [9], brain tumors [10], thyroid and salivary glands tumors [11]. HSI holds significant potential for more accurate diagnoses by leveraging the unique spectral signatures of healthy and diseased tissues. However, HSI systems are inherently more complex and expensive than their traditional counterparts, making image acquisition particularly challenging in the medical setting [12]. Additionally, extracting meaningful insights from a wealth of spectral information is not as straightforward as from conventional images. Decoding the spectral information often requires advanced algorithms and machine learning techniques to identify subtle differences in the spectral signatures [13]. These challenges are reflected in the limited number of studies addressing medical image segmentation with HSI data [14].

While previous works have explored support vector machines (SVM) for organ segmentation in hyperspectral medical images [15], the advent of deep learning has spurred a revolution. HLCA-UNet [16], a recent network for CCA segmentation, employs hierarchical feature extraction and channel attention mechanism to deliver encoder features to decoder. Similarly, the multi-scale U-Net [17] effectively utilizes hyperspectral information for tumor discrimination in ex vivo breast samples, holding potential for real-time margin assessment during surgery.

Certain bands in HSI images may exhibit low contrast and noise, obscuring vital details, while others may possess redundant information, inflating data size and processing complexity [18]. Spectral channel enhancement emerges as a key technique to address these issues, boosting the usability and interpretability of HSI data. Squeeze-and-excitation (SE) networks [19] employ global average pooling to condense feature maps into a vector, followed by two fully connected layers to generate channel-wise attention scores. The Convolutional Block Attention Module (CBAM) [20] further refines this concept by introducing an additional max-pooling descriptor, both of which are fed into a shared MLP to generate the final channel attention map. ECA-Net [21] proposes efficient channel attention (ECA) by utilizing a single $1D$ convolution of size $k$ to compute attention scores, maintaining local cross-channel interaction with improved efficiency.

Despite the advancements achieved by the aforementioned networks, limitations persist. Traditional machine learning approaches often struggle to match the efficacy of deep learning methods. HLCA-UNet is trained on full-resolution images and incurs substantial training and inference time. Directly feeding hyperspectral images into the multi-scale U-Net can limit the efficient extraction of relevant information. Additionally, while SE and CBAM achieve impressive results, their structural complexity surpasses that of ECA. Furthermore, ECA exhibit significant sensitivity to initialization. With a typical learning rate of 0.001, the learned weights often remain closely tied to their initial values, hindering the adaptability.

This work makes three main contributions. First, we introduce a two-branch U-Net architecture named Histogram Matching U-Net (HM-UNet) for efficient batch preprocessing of images. HM-UNet achieves a remarkable $6.6 \times$ speedup compared to conventional histogram matching (HM) method. This significantly improves the efficiency of image preprocessing pipelines. Second, we introduce a novel Spectral Attention (SA) module. This approach leverages the simplicity and efficiency of $1\times 1$ depthwise convolution combined with a cosine annealing learning rate (CLR) schedule. This combination offers a simpler and more efficient alternative to existing spectral weighting methods while maintaining good performance. Finally, we present Spectral Attention based Hyperspectral Image Segmentation Net (SAHIS-Net), a spectral attention and feature enhancement network specifically designed for microscopic hyperspectral CCA image segmentation. SAHIS-Net integrates our novel SA module, improved attention-aware feature enhancement (AFE) and multi-loss training strategy. SAHIS-Net achieves superior segmentation accuracy in the multidimensional choledoch database [9] compared to existing methods, contributing to improved diagnosis and treatment of CCA.

The rest of the paper is organized as follows. The proposed method is described in detail in Sec. 2. In Sec. 3, the experimental results of the proposed networks are presented and discussed. Sec. 4 concludes the paper.

2. Methods

2.1 Image preprocessing

The original multispectral image exhibits a non-linear brightness trend across its bands, transitioning from dark to bright and back to dark as the band sequence increases, as depicted in Fig. 6 and 5. This characteristic poses a challenge for neural networks seeking to effectively utilize information from both early and late bands. To address this issue, we propose a two-branch U-Net architecture for batch image pre-processing based on histogram matching.

Histogram matching is a technique that aims to transform an image’s intensity distribution to match that of another reference image. This process entails three key steps: (1) calculating the histograms for both the source and reference images; (2) identifying corresponding intensity levels in each histogram with equal cumulative distribution functions (CDF) value; and (3) applying the derived mapping function to each pixel of the source image. The pseudo code for histogram matching is as follows (Algorithm 1).

Tables Icon

Algorithm 1. Histogram Matching

HM-UNet, as depicted in Fig. 1, features four layers and diverges from the typical U-Net [22] by incorporating two encoding branches. These branches independently process the source and reference images, extracting relevant features at each level. These feature maps are subsequently concatenated at corresponding levels before being fed into the decoding branch, which ultimately generates the intensity-transformed image. To train the model, we extracted 12 equally spaced bands (e.g., 1st, 5th, 10th, $\cdots$, 55th) from each hyperspectral image and their corresponding histogram-matched counterparts as source images and desired outputs, respectively. The 27th band served as the reference image for each pair and provided the reference histogram for matching. This is justified by two factors. As depicted in Fig. 6, this band possesses the highest average grayscale value, indicating a brighter image with potentially richer detail. Additionally, it exhibits the highest image contrast, which facilitates the differentiation of fine structures. HM-UNet was trained for 100 epochs using the Adam optimizer coupled with mean squared error (MSE) loss function to reconstruct the histogram matched images from the original counterparts.

 figure: Fig. 1.

Fig. 1. Architecture of HM-UNet.

Download Full Size | PDF

While extending HM-UNet to accommodate 60-band inputs and outputs with corresponding loss functions presents the possibility of end-to-end joint training with the segmentation network, this approach was not pursued in the current study. This decision considered the computational demands of the segmentation network and the availability of pre-processed images. Notably, pre-processing the images once streamlines the workflow by eliminating the need for redundant generation during training.

2.2 SAHIS-Net architecture

2.2.1 Overall architecture

This work introduces SAHIS-Net, a spectral attention and feature enhancement network built upon our previously proposed “2K-Fold-Net” structure and EF$^3$-Net [23]. The overall architecture of SAHIS-Net is depicted in Fig. 2. SAHIS-Net leverages a 4-layer “4-Fold-Net” structure consisting of two nested sub-U-Nets. Each double convolution block (DC) employs two consecutive convolutional layers with $3\times 3$ kernel size and stride 1. Batch normalization (BN) and ReLU activation follow each convolutional layer. The input undergoes an SA module, assigning distinct weights to each spectral band to prioritize informative channels. To focus on crucial information and bridge the semantic gap between the encoder and decoder features [24], improved AFE modules are embedded within and between the two sub-U-Nets. The first sub-U-Net generates a coarse output via a $1\times 1$ convolutional layer with Sigmoid activation. This output is then combined with the spectral-weighted input and fed into the second sub-U-Net for further refinement. To enhance feature learning in the early stages of the network, both the coarse and final outputs are trained with the binary cross-entropy (BCE) loss function, weighted a Coarse $:$ Final ratio of $0.4:0.6$. In our method, the number of filters in the first encoder is set to 22.

 figure: Fig. 2.

Fig. 2. Architecture of SAHIS-Net.

Download Full Size | PDF

2.2.2 Spectral attention

Deep neural networks often employ channel attention modules to achieve spectral attention by selectively focusing on informative channels. In this work, we propose a novel approach that leverages the simplicity and efficiency of $1\times 1$ depthwise convolution coupled with CLR.

Depthwise convolution offers several advantages for our purposes. Firstly, depthwise convolution applies a single spatial filter to each input channel without mixing information across channels. This inherent channel separation makes it highly suitable for spectral attention tasks. Secondly, depthwise convolution is computationally efficient, reducing the number of parameters compared to SE networks. Notably, we do not employ bias in the depthwise convolution layer.

CLR provides a powerful mechanism for adjusting the learning rate over training. It starts with a high learning rate and gradually decreases it in a smooth, cosine-like fashion, culminating in a minimal learning rate near the end of training. This approach has been shown to improve training performance by preventing early stagnation and promoting convergence. In our implementation, we set the initial learning rate of CLR to 0.1 and decrease it to 0.001 over 100 epochs.

It should be noted that the CLR strategy was merely used for optimizing SA block, while the fixed learning rate (learning rate = 1e-3) was used for optimizing the rest of the network.

2.2.3 Feature enhancement

This work presents a simplified version of the AFE [25] module employed in EF$^3$-Net. For channel-wise feature enhancement (CFE), we adopt the gMLP [26] strategy of initializing fully-connected layers weights near zero and biases to one, ensuring training stability. In the spatial-wise feature enhancement (SFE), the ReLU activation between the two convolutional layers is removed for streamlined processing. Notably, the AFE module no longer concatenates the enhanced features with the original input; instead, the AFE module combines CFE and SFE sequentially, followed by residual addition with the original input. Given the 4-layer “4-Fold-Net” structure and relatively limited feature channels of SAHIS-Net, we embed the improved AFE modules in the skip connections within and between the two sub-U-Nets, rather than introducing different modules in each layer as in EF$^3$-Net.

The architecture of the AFE module and its sub-modules are depicted in Fig. 3 and mathematically represented as follows:

$$\boldsymbol{Y}_{CFE} = \boldsymbol{U} \otimes \alpha_{sig}\left\{ \text{FC}\left[ \alpha_{ReLu} \left( \text{FC}( \text{AvePool}(\boldsymbol{U}) ) \right) \right] \right\}.$$
$$\boldsymbol{Y}_{SFE} = \boldsymbol{U} \circledast \alpha_{sig}\left\{ \text{DConv}_{3\times3}\left[ \text{Conv}_{3\times3}(\boldsymbol{U}) \right] \right\}.$$
$$\boldsymbol{Y}_{AFE} = \boldsymbol{U} \oplus f_{SFE}\left[ f_{CFE}(\boldsymbol{U})\right].$$

Here, $\boldsymbol {Y}$ and $\boldsymbol {U}$ are respectively the output and input. $\text {AvePool}(\cdot )$ and $\text {FC}(\cdot )$ respectively represent the global average-pooling and fully-connected layer. $\alpha _{sig}(\cdot )$ and $\alpha _{ReLu}(\cdot )$ are respectively the sigmoid and ReLU activation functions. The operator “$\otimes$” stands for channel-wise multiplication. $\text {Conv}_{3\times 3}(\cdot )$ and $\text {DConv}_{3\times 3}(\cdot )$ respectively represent the traditional and depth-wise convolutional layer with the kernel size of $3\times 3$. The operator “$\circledast$” and “$\oplus$” stands for pixel-wise multiplication and pixel-wise addition, respectively.

 figure: Fig. 3.

Fig. 3. Architecture of the improved AFE and related modules: (a) CFE, (b) SFE and (c) AFE.

Download Full Size | PDF

2.2.4 Multi-loss training

To guide the network towards effective early stage feature extraction, SAHIS-Net employs a multi-loss training strategy. This approach tackles the challenge of vanishing gradients in deep networks by providing additional supervision at intermediate stages of the network. Specifically, two loss terms are calculated: one for the coarse output generated by the first sub-U-Net and another for the final output. By applying loss to the coarse output, we encourage the network to learn discriminative features early on, even before reaching the final stage. This helps to establish a strong foundation for subsequent processing and enhances the overall accuracy of the model.

The weights assigned to each loss term play a crucial role in balancing their influence during training. In this work, SAHIS-Net is weighted a Coarse $:$ Final ratio of $0.4:0.6$. We conduct experiments in Sec. 3.5 to compare the impact of different weight assignments on accuracy.

3. Experiments and results

3.1 Data preparation

The multidimensional choledoch database is provided by the Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University. It contains 880 scenes of multidimensional images captured from choledoch tissues stained with hematoxylin and eosin. Among these images, 689 scenes contain part of cancer areas, 49 scenes are full of cancer areas, and 142 scenes are images without cancer areas. The microscopic hyperspectral pathology images of CCA are acquired by a microscopy hyperspectral imaging (MHSI) system under a 20$\times$ objective. The whole system is shown in Fig. 4. MHSI system includes an optical microscope, a halogen lamp as the light source, a color CCD, a gray scientific complementary metal oxide semiconductor (sCMOS), an acousto-optic tunable filter (AOTF) adapter, a specific pathogen free (SPF) Model AOTF driver and a computer. Each hyperspectral image consists $1280 \times 1024$ spatial dimensions and 60 spectral bands ranging from 550 nm to 1000 nm. All images are annotated by experienced pathologists manually.

 figure: Fig. 4.

Fig. 4. Schematic diagram of the MHSI system.

Download Full Size | PDF

In this work, we select the 689 images containing part of cancer areas to evaluate the performance of the methods. The resolution of the images are resized to $256 \times 192$.

3.2 Experimental settings

All the models were implemented in Keras with Tensorflow 2.4 backend, on a desktop with an AMD R7 3700X CPU, 64 GB RAM, and an NVIDIA GTX 1080Ti GPU. We used the Adam optimizer with default settings (learning rate = 0.001, $\beta _{1}$=0.9, $\beta _{2}$=0.999, $\epsilon =10^{-8}$, decay=0) and the binary cross-entropy (BCE) loss function to train all the segmentation models, i.e.,

$$\mathcal{L}_{BCE}(p,y) ={-}[y\times \log (p)+(1-y)\times \log(1-p)],$$
where $y$ is the true binary label; and $p$ is the predicted probability of the positive class.

To better distinguish the performance, data augmentation method is not employed. Unless otherwise specified, the “He_normal” initializer was used to set the initial random weights.

The overall performance of the models was further assessed through $5$-fold cross-validation. The datasets were randomly partitioned into five roughly equal subsets, each subset served in turn as the validation set, while the remaining four subsets were combined for training. This process was repeated five times, with each model trained for 100 epochs with a batch size of 8. Following each epoch, the model’s performance was assessed using the mIoU metric. The model achieving the highest mIoU value was then retained for subsequent evaluation on the remaining six metrics. This approach guaranteed that all seven metrics were derived from the same epoch, thereby providing a holistic perspective on the model’s overall performance.

The image preprocessing U-Net and HM-UNet were trained using the Adam optimizer coupled with MSE loss function, i.e.,

$$\mathcal{L}_{MSE}(p,y) = \frac{1}{n} \sum_{i = 1}^{n} (p-y)^2,$$
where $n$ is the number of samples.

We evaluate the model’s performance using MSE and Structural Similarity Index Measure (SSIM) to retain the best model.

3.3 Evaluation metrics

Image segmentation tasks are commonly evaluated using the mean intersection over union (mIoU) and Dice coefficient (Dice) metrics, which emphasize precision and penalize mistakes. Both metrics measure the overlap between the ground truth and segmentation results. These metrics are respectively defined as:

$$\text{mIoU}(A,B) = \frac{A\cap{B}}{A\cup{B}}, \,\, \text{Dice}(A,B) = \frac{2(A\cap{B})}{A+{B}},$$
where $A,\, B$ are respectively the set of ground truth and segmentation results.

To better assess the accuracy of the size and shape of the segmented results, the average symmetric surface distance (ASSD) was used. The ASSD is defined as the average of the distances between all points on the boundary of the predicted segmentation and all points on the boundary of the ground truth. These metrics are respectively defined as:

$$ASSD(P, G) = \frac{1}{|P \vert + |G \vert } (\sum _{p \in P} \mathop{min} _{g \in G} \| p - g\Vert + \sum _{g \in G} \mathop{min} _{p \in P} \| g - p\Vert)$$
where $P,\, G$ are respectively the sets of points on the boundary of the predicted segmentation and ground truth; $\| p - q\Vert$ represents the Euclidean distance between pixels $p$ and $q$; $| \cdot \vert$ represents the number of points in the set. Since the scale of the four datasets are not available, we calculated the ASSD with the unit of number of pixels.

The area under the curve (AUC) is a metric that evaluates the ability of a model to distinguish between positive and negative instances. It is calculated from a receiver operating characteristic (ROC) curve, and a higher AUC indicates better discrimination ability between the two classes.

In addition to the metrics mentioned above, several other popular metrics were also used to evaluate the performance, including Accuracy, Precision and Recall. These metrics are respectively defined as:

$$\text{Accuracy} = \frac{\text{TP}+\text{TN}}{\text{TP}+\text{FP}+\text{TN}+\text{FN}},\,\,\text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}},\,\, \text{Recall} = \frac{\text{TP}}{\text{TP}+\text{FN}},$$
where TP, TN, FP, FN represent true positive, true negative, false positive, and false negative, respectively.

3.4 Comparisons on data preprocessing methods

We compared our proposed HM-UNet preprocessing method with brightness enhancement and U-Net based processing.

The brightness enhancement method focused on improving overall image brightness by leveraging the band with the highest average grayscale value, denoted as $G_m$. We first calculated the average grayscale value $G_i$ of the remaining $i$-th band. Subsequently, a coefficient $r_i$ was obtained by dividing $G_m$ by $G_i$ for $i$-th band. Finally, the grayscale value of each pixel in the $i$-th band was multiplied by its corresponding $r_i$, thus amplifying the brightness information.

The U-Net based method utilized a lightweight 4-layer U-Net architecture. For training purposes, we sampled 12 bands at equal intervals (e.g., 1st, 5th, 10th, $\cdots$, 55th) from each hyperspectral image, alongside their corresponding brightness-enhanced counterparts generated by the aforementioned brightness enhancement method. These pairs of images formed the training dataset. The U-Net was trained for 100 epochs using the Adam optimizer coupled with MSE loss function to reconstruct the brightness-enhanced images from the original counterparts.

The effectiveness of each preprocessing method was evaluated by training a standard U-Net with 32 filters in the first layer on images processed by each method. U-Net with a simple structure can better reflect the effectiveness of different methods on accuracy. Table 1 presents the results obtained through $5$-fold cross-validation. Comparative analysis reveals that both histogram matching and HM-UNet preprocessing yielded better performance, with an approximately 0.0378 increase in mIoU compared to the original baseline. Additionally, HM-UNet preprocessing method significantly improved boundary segmentation accuracy, with a 2.64 decrease in ASSD compared to the baseline. Brightness enhancement also showed improvement in segmentation performance, while the U-Net-based method exhibited minimal improvement. Visualizations in Fig. 5 further support these findings. U-Net processed images, particularly in the first and 50th bands, exhibited lower contrast than those preprocessed with other methods. Histogram matching enhanced both brightness and contrast compared to brightness enhancement method. Additionally, compared to the histogram matched images, images processed by HM-UNet exhibited reduced noise in the front and back bands. It is important to acknowledge that preprocessing methods inevitably alter the original spectral information, they play a crucial role in highlighting important features and emphasizing relevant details that might be obscured in raw data.

 figure: Fig. 5.

Fig. 5. Comparison of results derived from different preprocessing methods. The numerical annotation indicates the corresponding spectral band for each column.

Download Full Size | PDF

 figure: Fig. 6.

Fig. 6. Statistical analysis of hyperspectral image bands: Average and standard deviation of grayscale values before and after histogram matching processing.

Download Full Size | PDF

Tables Icon

Table 1. Testing results of data preprocessing methods.

It should be pointed out that although the preprocessing methods change the original spectral information, it highlights important features in an image and emphasize relevant details.

HM-UNet also demonstrated a significant speed advantage over the conventional HM method for hyperspectral image processing. HM-UNet processed a single scene in 216 milliseconds, compared to 1.42 seconds for the conventional HM method, achieving a $6.6\times$ speedup. A fair comparison of HM-UNet and conventional HM method was conducted by running both on the same desktop. However, a key distinction exists: HM utilized the CPU, while HM-UNet leveraged the GPU, which excels in parallel processing. Implementing traditional HM on a GPU is not straightforward due to limited support from existing libraries. Although potential exists to optimize HM’s runtime through increased parallelization, data transfer between CPU and GPU likely remains a bottleneck. This factor, coupled with the inherently parallel nature of neural networks, suggests an inherent advantage for HM-UNet in terms of computational efficiency.

Furthermore, the HM-UNet achieved an MSE of 0.0010 and an SSIM of 0.9638 after normalization (division of each pixel’s grayscale value by 255). These quantitative metrics substantiate the efficacy of the HM-UNet for image processing tasks.

3.5 Multi-loss training

To investigate the optimal weight distribution for the multi-loss training strategy, we employed a “4-Fold-Net” [23] with BN added after each convolutional layer. Six weight distributions ranging from a Coarse $:$ Final ratio of $0:1.0$ to $0.8:0.2$ were evaluated to comprehensively explore the impact on training performance. Table 2 presents the detailed results obtained through $5$-fold cross-validation, while Fig. 7 depicts the change in mIoU with varying weight distributions.

 figure: Fig. 7.

Fig. 7. The trend in mIoU as the weight distribution changes.

Download Full Size | PDF

Tables Icon

Table 2. Testing results of different weights.

The final output of $0.4:0.6$ weight distribution achieved the best overall performance, reaching 0.5763 in mIoU and 10.02 in ASSD. Meanwhile, the $0.5:0.5$ distribution produced the highest mIoU for the coarse output (0.5731). From Fig. 7, except for this specific case, the final output consistently outperformed the coarse output across all weight distributions. Overall, supervising both the coarse and final outputs consistently led to superior final segmentation results compared to solely supervising the final output.

Based on these findings, we employed the $0.4:0.6$ weight distribution for training SAHIS-Net.

3.6 Ablation study

We undertook ablation studies to elucidate the contributions of our proposed designs. Recognizing that certain modules can enhance accuracy, their interactions may be complex, we built upon our previously established architecture for further refinement in subsequent stages. In this experiment, we set the size $k$ of ECA to 3 due to the minimal spectral difference between adjacent bands. While larger $k$ values cover a wider range, it does not reflect the advantage of ECA’s low structural complexity. Since increasing model parameters generally enhances learning capacity, we constrained the parameter count to 1.3M $\pm 7{\%}$ for fair comparison across designs. Table 3 presents the design efforts and procedures, and obtained seven evaluated metrics. We mainly focus on mIoU and ASSD, as they best illustrate performance improvements.

Tables Icon

Table 3. Effectiveness of the designs, non-bold means that the design is not adopted.

The BaseNet was a 4-layer U-Net with BN. Integrating SE, ECA, and SA module individually demonstrated accuracy gains, with SA method achieving the highest mIoU, AUC, accuracy, precision and recall, while ECA contributed most to ASSD reduction. As shown in Fig. 8, all three methods prioritized the first three bands, which correspond to the bands with the lowest average gray values in the original image, highlighting the importance of proposed preprocessing method. For the last 20 bands, where noise increases and useful information diminishes, SE assigned high weights to some bands, while ECA and SA exhibited overall downward trends. ECA method struggled with low learning rate, resulting in minimal weight changes before and after training. In contrast, the CLR-based SA exhibited significant weight adjustments.

 figure: Fig. 8.

Fig. 8. Initial and trained weights of (a) SE, (b) ECA and (c) SA. (d) Representative images corresponding to the top six most informative bands obtained from SA.

Download Full Size | PDF

Further incorporating the “2K-Fold-Net” structure and expanding the network into a W-shape significantly improved both mIoU ($+0.036$) and ASSD ($-3.36$). Implementing a multi-loss training strategy yielded further gains, raising mIoU by 0.0418 and lowering ASSD by 1.77. Finally, we evaluated the efficacy of AFE, improved AFE, and CBAM in feature enhancement. Both AFE variations outperformed CBAM in terms of mIoU and ASSD, with no significant difference between themselves. However, due to the simpler structure and lower computational cost in training and inference of improved AFE, we opted for its incorporation.

3.7 Comparison experiments and discussion

This section benchmarks SAHIS-Net against several prominent medical image segmentation networks, including general networks (U-Net, ResUnext [27], UNeXt [28], TransUNet [29], EF$^{3}$-Net and MAEF-Net [30]) and a specialized network (HLCA-UNet). While FLOPs are commonly used to gauge computational complexity, this metric alone often fails to accurately reflect actual training and inference speeds. Models with comparable FLOPs can exhibit significant runtime discrepancies [31] due to factors like memory access cost, fragmentation, target platform, and driver optimization [32]. To provide a more comprehensive evaluation, Table 4 presents the time costs of training and inference, alongside inference frames per second (FPS) measured at a batch size of 1, differing from the batch size of 8 used for training and inference speed evaluation. Additionally, model size and GPU RAM consumption are included for further comparison.

Tables Icon

Table 4. Number of parameters (denoted as “$N_p$”), the time cost of training and inference (denoted as “$T_t$” and “$T_i$” respectively) with the unit of millisecond per step (abbreviated as ms/step), inference frames per second (FPS), Models size and GPU RAM consumption of the compared networks.

As shown in the Table 4, the SAHIS-Net outperformed EF$^3$-Net by not only reducing the number of parameters by over 80%, but also by significantly reducing the training and inference time to 57.3% and 33.1%, respectively. While UNeXt performed best in training speed, HLCA-UNet performed best in inference time and FPS. Selecting the input bands and inputting only six bands can improve the inference time and FPS of SAHIS-Net to a level comparable to HLCA-UNet.

The testing results are presented in Table 5. The results clearly demonstrated that SAHIS-Net outperformed all the other networks in most metrics, while U-Net achieves best recall. Overall, hybrid architecture networks, which combined ConvNets with Transformer, MLP or attention mechanism outperformed pure convolutional networks, such as U-Net, ResUnext and HLCA-UNet. UNeXt suffered from the under-representation of complex features caused by the only one convolutional layer in each encoder and decoder. EF$^3$-Net and MAEF-Net achieved comparable results, while SAHIS-Net achieved about 0.005 increase in mIoU. Selecting the input bands and inputting only six bands decreased mIoU and ASSD by 0.006 and 0.2, respectively.

Tables Icon

Table 5. Testing results of different models.

To further assess the efficacy of SA, we selected the top six most informative bands (1st, 2nd, 3rd, 14th, 28th, and 31st) based on their band weights obtained from SA. Figure 8(d) visually demonstrates the distinct contributions of these bands through three sets of corresponding images. The first three bands (1st, 2nd, 3rd) exhibit enhanced background clarity. The 14th band highlights the cancer area and its outer boundary, while the 28th and 31st bands provide sharper visualization of the inner boundaries. These observations suggest that SA effectively identifies bands that containing distinctive features relevant to segmentation.

These six bands were then input to SAHIS-Net, forming a variant denoted as “SAHIS-Net-6 bands”. This allowed us to directly compare the performance of SAHIS-Net with and without SA module. While SAHIS-Net-6 bands exhibited comparable inference time and FPS to HLCA-UNet, it exhibited a minimal decrease in mIoU of 0.006 and an increase in ASSD of 0.2 compared to the full-band SAHIS-Net. These results suggest the efficacy of SA and offer a promising avenue for improving the efficiency of SAHIS-Net while maintaining acceptable accuracy.

To better demonstrate the results, Fig. 9 presents representative segmentation results. In the first case, U-Net, ResUnext, MultiResUNet, UNeXt, and TransUNet exhibited misinterpretations in the upper left corner. EF$^3$-Net and MAEF-Net struggled with noise and inaccuracies in the lower left corner. HLCA-UNet exhibited over-segmentation within the cancerous region. SAHIS-Net delivered the most accurate result, free from both misinterpretations and noise. However, selectively inputting six bands to the SAHIS-Net-6 bands slightly impacted edge detection. In the second case, U-Net, UNeXt, TransUNet, EF$^3$-Net, and MAEF-Net failed to separate the two distinct cancer areas. MultiResUNet only segmented the lower cancerous region. SAHIS-Net successfully distinguished the two areas but demonstrated mild under-segmentation in the upper one. Again, the results of SAHIS-Net-6 bands had a noticeable effect on boundary tracking.

 figure: Fig. 9.

Fig. 9. Qualitative comparisons. (a) RGB image corresponding to the input HSI image. (b) Ground Truth. Predictions of (c) U-Net, (d) ResUnext, (e) MultiResUNet, (f) UNeXt, (g) TransUNet, (h) EF$^{3}$-Net, (i) MAEF-Net, (j) HLCA-UNet, (k) SAHIS-Net and (l) SAHIS-Net-6 bands.

Download Full Size | PDF

4. Conclusion

In this work, we first introduce HM-UNet, a two-branch architecture enabling efficient batch pre-processing of hyperspectral images. HM-UNet achieves a remarkable $6.6\times$ speedup compared to traditional methods, significantly improving computational efficiency. Second, we propose a novel spectral attention module that utilizes $1\times 1$ depthwise convolution and a CLR schedule to enhance the usability and interpretability of HSI data. This module effectively highlights relevant spectral features and improves model performance. Finally, the SA module, along with improved AFE and multi-loss training strategy, are integrated into SAHIS-Net, a deep learning architecture specifically designed for microscopy hyperspectral CCA image segmentation. SAHIS-Net demonstrates superior performance compared to both general and specialized networks, offering promising potential for improved CCA diagnosis.

However, limitations remain. First, the SA module currently requires manual band selection for optimal performance, limiting its automation potential. Second, while SAHIS-Net achieves superior segmentation results, there remains room for improvement. Third, compared to HLCA-UNet, SAHIS-Net exhibits a larger model size and slower training and inference times, and requires further computational optimization.

Future research directions include extending SAHIS-Net to handle other HSI segmentation tasks beyond CCA, leveraging the effectiveness of the SA module. Furthermore, incorporating boundary feature propagation mechanisms could potentially enhance SAHIS-Net’s ability to accurately track sinuous boundaries.

Acknowledgments

We thank Qingli Li and Qing Zhang at the Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University for generously providing the multidimensional Choledoch Database.

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

The multidimensional Choledoch Database underlying the results presented in this paper are available at [9]. The codes will be available at [33].

References

1. P. J. Brindley, M. Bachini, S. I. Ilyas, et al., “Cholangiocarcinoma,” Nat. Rev. Dis. Primers 7(1), 65 (2021). [CrossRef]  

2. S. A. Khan, H. C. Thomas, B. R. Davidson, et al., “Cholangiocarcinoma,” The Lancet 366(9493), 1303–1314 (2005). [CrossRef]  

3. B. Doherty, V. E. Nambudiri, and W. C. Palmer, “Update on the diagnosis and treatment of cholangiocarcinoma,” Curr. Gastroenterol. Rep. 19(1), 2–8 (2017). [CrossRef]  

4. A. Strongin, H. Singh, M. A. Eloubeidi, et al., “Role of endoscopic ultrasonography in the evaluation of extrahepatic cholangiocarcinoma,” Endosc. Ultrasound 2(2), 71 (2013). [CrossRef]  

5. K. S. Jhaveri and H. Hosseini-Nik, “Mri of cholangiocarcinoma,” J. Magn. Reson. Imaging 42(5), 1165–1179 (2015). [CrossRef]  

6. L. Sun, M. Zhou, Q. Li, et al., “Diagnosis of cholangiocarcinoma from microscopic hyperspectral pathological dataset by deep convolution neural networks,” Methods 202, 22–30 (2022). [CrossRef]  

7. S. Aishima, Y. Kubo, Y. Tanaka, et al., “Pathological features and prognosis of combined hepatocellular and cholangiocarcinoma by world health organization classification,” in Lab. Invest., vol. 93 (2013), pp. 396A–397A.

8. M. A. Calin, S. V. Parasca, D. Savastru, et al., “Hyperspectral imaging in the medical field: Present and future,” Appl. Spectrosc. Rev. 49(6), 435–447 (2014). [CrossRef]  

9. Q. Zhang, Q. Li, G. Yu, et al., “A multidimensional choledoch database and benchmarks for cholangiocarcinoma diagnosis,” IEEE Access 7, 149414–149421 (2019). [CrossRef]  

10. H. Fabelo, S. Ortega, A. Szolna, et al., “In-vivo hyperspectral human brain image database for brain cancer detection,” IEEE Access 7, 39098–39116 (2019). [CrossRef]  

11. M. Halicek, J. D. Dormer, J. V. Little, et al., “Tumor detection of the thyroid and salivary glands using hyperspectral imaging and deep learning,” Biomed. Opt. Express 11(3), 1383–1400 (2020). [CrossRef]  

12. S. Ortega, M. Halicek, H. Fabelo, et al., “Hyperspectral and multispectral imaging in digital and computational pathology: a systematic review,” Biomed. Opt. Express 11(6), 3195–3233 (2020). [CrossRef]  

13. Y. Zhang, S. Yu, X. Zhu, et al., “Explainable liver tumor delineation in surgical specimens using hyperspectral imaging and deep learning,” Biomed. Opt. Express 12(7), 4510–4529 (2021). [CrossRef]  

14. S. Seidlitz, J. Sellner, J. Odenthal, et al., “Robust deep learning-based semantic organ segmentation in hyperspectral images,” Med. Image Anal. 80, 102488 (2022). [CrossRef]  

15. F. Cervantes-Sanchez, M. Maktabi, H. Köhler, et al., “Automatic tissue segmentation of hyperspectral images in liver and head neck surgeries using machine learning,” Artif. Intell. Surg. 1, 22–37 (2021). [CrossRef]  

16. H. Gao, M. Yang, X. Cao, et al., “A high-level feature channel attention unet network for cholangiocarcinoma segmentation from microscopy hyperspectral images,” Mach. Vis. Appl. 34(5), 72 (2023). [CrossRef]  

17. E. Kho, B. Dashtbozorg, L. L. De Boer, et al., “Broadband hyperspectral imaging for breast tumor detection using spectral and spatial information,” Biomed. Opt. Express 10(9), 4496–4515 (2019). [CrossRef]  

18. B. Du, M. Zhang, L. Zhang, et al., “Pltd: Patch-based low-rank tensor decomposition for hyperspectral images,” IEEE Trans. Multimedia 19(1), 67–79 (2017). [CrossRef]  

19. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 7132–7141.

20. S. Woo, J. Park, J.-Y. Lee, et al., “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), (2018), pp. 3–19.

21. Q. Wang, B. Wu, P. Zhu, et al., “Eca-net: Efficient channel attention for deep convolutional neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2020), pp. 11534–11542.

22. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, (Springer, 2015), pp. 234–241.

23. Y. Zhang and J. Dong, “2k-fold-net and feature enhanced 4-fold-net for medical image segmentation,” Pattern Recognit. 127, 108625 (2022). [CrossRef]  

24. N. Ibtehaz and M. S. Rahman, “Multiresunet: Rethinking the u-net architecture for multimodal biomedical image segmentation,” Neural Netw. 121, 74–87 (2020). [CrossRef]  

25. J. Ji, B. Zhong, and K.-K. Ma, “Image interpolation using multi-scale attention-aware inception network,” IEEE Trans. Image Process. 29, 9413–9428 (2020). [CrossRef]  

26. H. Liu, Z. Dai, D. So, et al., “Pay attention to mlps,” in Advances in Neural Information Processing Systems, vol. 34 (2021), pp. 9204–9215.

27. C. Bled and F. Pitie, “Assessing advances in real noise image denoisers,” in Proceedings of the 19th ACM SIGGRAPH European Conference on Visual Media Production, (2022), pp. 1–9.

28. J. M. J. Valanarasu and V. M. Patel, “Unext: Mlp-based rapid medical image segmentation network,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2022), pp. 23–33.

29. J. Chen, Y. Lu, Q. Yu, et al., “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv, arXiv:2102.04306 (2021). [CrossRef]  

30. Y. Zhang and J. Dong, “Maef-net: Mlp attention for feature enhancement in u-net based medical image segmentation networks,” IEEE J. Biomed. Health Inform. 28(2), 846–857 (2024). [CrossRef]  

31. W. Wen, C. Wu, Y. Wang, et al., “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, vol. 29 (Curran Associates, Inc., 2016).

32. N. Ma, X. Zhang, H.-T. Zheng, et al., “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of the European Conference on Computer Vision (ECCV), (2018), pp. 116–131.

33. Y. Zhang and J. Dong, “SAHIS-Net: a spectral attention and feature enhancement network for microscopic hyperspectral cholangiocarcinoma image segmentation: code,” Github, 2024, https://github.com/raik7/SAHIS-Net.

Data availability

The multidimensional Choledoch Database underlying the results presented in this paper are available at [9]. The codes will be available at [33].

9. Q. Zhang, Q. Li, G. Yu, et al., “A multidimensional choledoch database and benchmarks for cholangiocarcinoma diagnosis,” IEEE Access 7, 149414–149421 (2019). [CrossRef]  

33. Y. Zhang and J. Dong, “SAHIS-Net: a spectral attention and feature enhancement network for microscopic hyperspectral cholangiocarcinoma image segmentation: code,” Github, 2024, https://github.com/raik7/SAHIS-Net.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (9)

Fig. 1.
Fig. 1. Architecture of HM-UNet.
Fig. 2.
Fig. 2. Architecture of SAHIS-Net.
Fig. 3.
Fig. 3. Architecture of the improved AFE and related modules: (a) CFE, (b) SFE and (c) AFE.
Fig. 4.
Fig. 4. Schematic diagram of the MHSI system.
Fig. 5.
Fig. 5. Comparison of results derived from different preprocessing methods. The numerical annotation indicates the corresponding spectral band for each column.
Fig. 6.
Fig. 6. Statistical analysis of hyperspectral image bands: Average and standard deviation of grayscale values before and after histogram matching processing.
Fig. 7.
Fig. 7. The trend in mIoU as the weight distribution changes.
Fig. 8.
Fig. 8. Initial and trained weights of (a) SE, (b) ECA and (c) SA. (d) Representative images corresponding to the top six most informative bands obtained from SA.
Fig. 9.
Fig. 9. Qualitative comparisons. (a) RGB image corresponding to the input HSI image. (b) Ground Truth. Predictions of (c) U-Net, (d) ResUnext, (e) MultiResUNet, (f) UNeXt, (g) TransUNet, (h) EF$^{3}$-Net, (i) MAEF-Net, (j) HLCA-UNet, (k) SAHIS-Net and (l) SAHIS-Net-6 bands.

Tables (6)

Tables Icon

Algorithm 1. Histogram Matching

Tables Icon

Table 1. Testing results of data preprocessing methods.

Tables Icon

Table 2. Testing results of different weights.

Tables Icon

Table 3. Effectiveness of the designs, non-bold means that the design is not adopted.

Tables Icon

Table 4. Number of parameters (denoted as “ N p ”), the time cost of training and inference (denoted as “ T t ” and “ T i ” respectively) with the unit of millisecond per step (abbreviated as ms/step), inference frames per second (FPS), Models size and GPU RAM consumption of the compared networks.

Tables Icon

Table 5. Testing results of different models.

Equations (8)

Equations on this page are rendered with MathJax. Learn more.

Y C F E = U α s i g { FC [ α R e L u ( FC ( AvePool ( U ) ) ) ] } .
Y S F E = U α s i g { DConv 3 × 3 [ Conv 3 × 3 ( U ) ] } .
Y A F E = U f S F E [ f C F E ( U ) ] .
L B C E ( p , y ) = [ y × log ( p ) + ( 1 y ) × log ( 1 p ) ] ,
L M S E ( p , y ) = 1 n i = 1 n ( p y ) 2 ,
mIoU ( A , B ) = A B A B , Dice ( A , B ) = 2 ( A B ) A + B ,
A S S D ( P , G ) = 1 | P | + | G | ( p P m i n g G p g + g G m i n p P g p )
Accuracy = TP + TN TP + FP + TN + FN , Precision = TP TP + FP , Recall = TP TP + FN ,
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.