STCS-Net: a medical image segmentation network that fully utilizes multi-scale information

Pengchong Ma; Pengchong Ma; Guanglei Wang; Guanglei Wang; Tong Li; Tong Li; Haiyang Zhao; Yan Li; Hongrui Wang

doi:10.1364/BOE.517737

1. Introduction

Medical image segmentation is a key research area in the field of medical image processing and analysis. It is dedicated to accurately delineating regions of special significance through semi-automatic or automated methods, extracting relevant features, and providing a reliable foundation for clinical diagnosis and pathological research to assist physicians in making more precise diagnoses [1]. Globally, medical image analysis remains a pivotal domain for diagnosing and treating major diseases, with computer-aided diagnosis drawing considerable attention in pathological research and clinical practice by relying on the results of medical image segmentation [2]. However, traditional manual analysis poses challenges such as time consumption, subjectivity, and error susceptibility, while classical segmentation algorithms face limitations in generalizing to complex datasets and are adversely affected by the quality of image acquisition [3]. Therefore, constructing robust and versatile models is crucial to achieving sufficient robustness on challenging images and applicability across various biomedical applications.

With the development of deep learning and the widespread application of various neural networks in the field of computer vision, image processing is experiencing continuous and vigorous growth. Since 2012, Convolutional Neural Networks (CNN) have been the predominant force in the field of computer vision, serving as feature extractors in various visual tasks [4]. In the domain of medical image segmentation, the encoder-decoder architecture has proven effective in restoring detailed information of segmented objects, gradually establishing itself as the benchmark in the medical image segmentation industry. The primary purpose of the convolution operations employed in the encoder-decoder structure is to extract local features from an image by collecting information from neighboring pixels. Typically, the stacking of convolutional layers and consecutive sampling operations continuously expands the receptive field to determine the rough boundaries of objects. Fully Convolutional Networks (FCN) [5] achieve end-to-end mapping from input images to output images. UNet [6] balances local information and contextual information by combining features from different hierarchical levels. Based on this, many variants of UNet have been proposed, such as UNet++ [7], ResUNet [8], UNet3+ [9], Ege-unet [10] etc. Furthermore, researchers have endeavored to introduce different structures into networks to obtain multi-scale information or leverage various attention mechanisms to highlight the most relevant information in feature maps. For instance, pyramid convolution utilizes differently sized convolutional kernels to extract multi-scale information, adapting to targets of different scales [11]. CENet [12] enhances the network's ability to process semantic information in deep feature maps by incorporating a multi-scale mechanism in the bottleneck part of the network. Attention UNet [13] integrates the concept of attention mechanisms into the encoder-decoder network, improving the network's capability to extract important features. Despite the considerable success of the UNet and its variants, these methods are based on convolution operations, exhibiting a strong inductive bias, resulting in locality and translation invariance, and facing challenges in building long-range dependencies. However, global information is equally crucial for the precise localization of medical image lesion areas or organ segmentation [14].

In recent years, Transformer has emerged as a novel network architecture and has achieved significant success in natural language processing. The introduction of Transformer into the field of computer vision, particularly through the Vision Transformer (ViT), has addressed the limited receptive field problem of CNNs [15]. The Transformer maps feature maps to three output feature vectors: Query (Q), Key (K), and Value (V). It utilizes multi-head self-attention mechanisms to allocate attention resources to V based on the matrix multiplication of Q and K, resulting in attention maps. Subsequently, many variants of Transformer-based networks have been developed. Swin Transformer [16] improves computational efficiency by using a sliding window approach. Transformer-XL [17] learns dependencies beyond fixed lengths without disrupting temporal consistency and resolves the context fragmentation issue. Image Transformer [18] excels in efficiently handling high-resolution input images, especially through the introduction of new relative position encoding methods. These methods all excel in capturing rich global information. However, due to the importance and complexity of medical image information, focusing solely on local or global information is insufficient [19–23]. Approaches like TransUnet and UCTransNet fuse CNN with Transformer to capture long-distance dependencies between features [24,25]. Nevertheless, the global information obtained by Transformer may contain redundant details. If not appropriately processed before integration with locally captured information by CNN, this can lead to a decline in model accuracy. This is particularly critical when dealing with segmented targets of varying sizes and shapes, requiring the network to establish robust long-range dependencies to effectively understand and capture the global structure and relationships of the targets.

Therefore, we propose a medical image segmentation architecture (STCS-Net) aimed at facilitating extensive interaction between local information and global contextual information in the decoder, thereby establishing long-range dependencies in the information. By obtaining global information through the encoder and utilizing the designed skip connection module, we capture local information from the global information in the encoder, enhancing feature representation. Additionally, leveraging the decoder emphasizes the importance of information interaction between channels, allowing the acquisition of mutually corroborated multi-scale information. This comprehensive approach fully utilizes diverse information, integrating local details captured by CNN with the global relationships established by Transformer across the entire image. This approach enables consideration of both global and local information at different levels, allowing the model to complement and correct information from different scales, thereby enhancing a comprehensive understanding of the image content and mitigating information loss to some extent. The integrated architecture design contributes to the improved extraction of the overall structure and semantic relationships of the image.

The main contributions of this paper are as follows:

1) This paper proposes a Similarity Attention Decoding Block (SADB), a class attention decoding module, designed to effectively capture information and emphasize boundary features in image segmentation tasks. This module utilizes adaptive pyramid convolution to gather spatial and channel information, where the pyramid convolution integrates multi-scale feature information. Additionally, the introduced class attention interaction structure among convolutions allows for interaction between channels. Furthermore, the parallel operation of depthwise convolution and the channel attention module (SE) further extracts features, enabling the network to better understand image content and thereby improve performance.
2) In this paper, an Information Enhancement Module (CDD) is introduced within the skip connections, aimed at highlighting regions of interest while suppressing irrelevant areas. It incorporates both local and global information, enabling the decoder to acquire diversified features. By alternating between regular convolution and depthwise convolution, this method introduces nonlinear transformation capabilities to some extent, enhancing the processing capability for image edge textures, thus achieving more precise localization of image boundaries. Additionally, the inclusion of dilated convolutions expands the receptive field, supporting a higher-level contextual understanding.
3) Based on the aforementioned points, this paper proposes a U-shaped network based on Transformer-CNN (STCS-Net). The encoder consists of the Transformer part, while the skip connections reconstruct feature maps from the encoder layer by layer to achieve information enhancement. The decoder then comprehensively utilizes information from both the encoder and skip connections to accurately output semantic segmentation maps.

2. Related work

2.1 Convolution-based image segmentation networks

With the development of deep learning in the field of computer vision, there is continuous innovation in medical image segmentation techniques. The purpose of medical image segmentation is to delineate specific regions in an image, providing a reliable basis for clinical diagnosis and pathological research. Since the introduction of CNN, they have become the mainstream in image processing, leveraging weight sharing in local receptive fields to achieve excellent performance with fewer parameters. In 2015, FCN [5] addressed the need for cropping or resizing input images in traditional CNN models. FCN replaced fully connected layers with fully convolutional layers, enabling the network to accept input images of any size. This architecture also allows the network to predict at the pixel level, achieving pixel-level semantic segmentation. Subsequently, structures like UNet [6], SegNet [26], and others, incorporating encoder-decoder and skip-connection structures, have excelled in image segmentation. They extract high-level semantic information during encoding and restore abstracted high-level semantic information to the input feature map size during decoding, enabling pixel-level classification. Skip connections link shallow feature maps with deep feature maps, facilitating information transfer across layers. Variants of UNet emerged, such as DenseNet [27], MultiResUNet [28], ResNet [29], which address gradient vanishing issues by modifying the skip-connection part and introducing different multiscale feature fusion mechanisms, as well as utilizing residual links that add a path from input to output. ResBCU-Net [30] further utilizes bidirectional long short-term memory networks to address the issue of gradient vanishing during training. BLA-Net [31] handles complex images by employing dynamic deformable convolutions and capturing multiscale information. Another noteworthy aspect is the introduction of attention mechanisms. SENet [32], proposed in 2018, effectively establishes inter-channel dependencies by compressing feature maps. Both CBAM [33] and BAM [34] combine channel attention and spatial attention to enhance the model's representational and perceptual capabilities regarding input data. In these networks, modeling is based on convolution, effectively acquiring and fusing local features through weight sharing and local receptive fields [6]. However, they fall short in utilizing global contextual information and cannot model long-range dependencies [15].

2.2 Transformer-based image segmentation networks

Despite the significant success of CNN in the field of computer vision, their limitations have constrained further development and widespread application. To better capture long-range dependencies between features, the computational approach of Transformers has garnered widespread attention. Initially designed and applied in natural language processing [14], Transformers have been introduced into the computer vision domain in recent years, with the proposal of Vision Transformer (ViT). ViT treats images as sequential data, considering each region or pixel in the image as a position in the sequence, enabling the utilization of self-attention mechanisms to capture feature information within the image [15]. However, Transformer networks suffer from the drawbacks of large computational and parameter requirements. To address this issue, Swin Transformer proposed a solution by limiting self-attention within a local window and expanding the receptive field through moving windows [16]. Additionally, CSWin Transformer improved computational efficiency by parallel computing self-attention within vertical and horizontal cross-shaped windows [35]. Transformers lack the inherent inductive bias of CNN. To address this, efforts have been made to integrate the strengths of both CNN and Transformers into hybrid networks. In the field of medical image segmentation, these efforts generally fall into three types [36]: Transformers as encoders, as seen in UNETR [37], VT-UNet [38], and SwinUNETR [39], where sequence-to-sequence modeling is applied as the initial embedding for medical images, offering the advantage of directly generating tokenized patches for feature representation; Transformers as auxiliary encoders, exemplified by TransUNet [24], AFTer-UNet [40], TransClaw [41], etc., which benefit from the inherent inductive bias obtained through CNN encoding, allowing for a significantly reduced computational load when performing global self-attention computation on lower resolutions; Fusion models combining Transformer and CNN encoders in parallel, such as TransFuse [42], FusionNet [43], and ScaleFormer [44], aiming to capture both global and local information simultaneously for improved learning of representations. However, such fusion designs often introduce additional encoding branches, leading to increased model complexity. Among the various models mentioned, most focus on improvements during the encoding phase to obtain superior feature representations. However, the decoder plays a crucial role in determining the final image output. The ability of the decoder to effectively leverage features from the encoder and skip connections, as well as further filter and integrate essential features, is equally important. To address this issue, we propose STCS-Net. It leverages the CDD module in skip connections for information enhancement and thoroughly processes information from both the encoder and skip connections using the decoder.

3. Methods

3.1 STCS-Net

This paper introduces STCS-Net for medical image segmentation, as illustrated in Fig. 1, consisting of three main components. Firstly, a Swin Transformer encoder with a multi-level self-attention mechanism is employed to capture global information. Secondly, two crucial modules proposed in this paper, namely the CDD module and SADB, are introduced. The CDD module, situated within the skip connections, aims to enhance feature representation by focusing on specific regions of interest and suppressing irrelevant parts for the task. This enhances the expressive capability of the features. The SADB serves as the decoder, effectively leveraging information from the encoder and skip connections. It can obtain cross-checked multi-scale information and emphasize important features to accurately capture image boundaries, thereby improving segmentation performance.

Fig. 1. STCS-Net consists of a network encoder with four layers of SwinTransformer blocks of varying depths. The skip connections include four CDD blocks, and the decoder comprises three SADB blocks. The final layer of the network is a standard convolutional layer used to output the final image.

Download Full Size | PDF

3.2 Information enhancement module

Traditional UNet employs a simple concatenation operation in the feature fusion process without introducing an explicit mechanism to reinforce crucial features. When dealing with complex medical images containing irregular pathological areas or organs, which often have rich boundary information, the straightforward concatenation operation may convey excessive redundant information to the decoder. This can lead the model to overly rely on irrelevant features, impacting segmentation results [13,45]. To address this issue, this paper introduces the CDD module in the skip connection to effectively filter features and reduce redundancy. Additionally, to ensure sufficient information for the decoder, the receptive field of the CDD module is increased, ensuring accurate and rich information is obtained during decoding.

The CDD module enhances information acquisition, effectively aggregates and transmits critical features, and enhances the multi-layer information interaction capability. The alternating use of standard convolution and depth convolution enhances the ability to obtain local information in skip connections, enabling the network to more effectively localize image boundaries and preserve important local details. Introducing the dilated convolution strategy helps prevent excessive local information transmission leading to global information loss, thereby maintaining a more comprehensive contextual understanding. By focusing on local information supplemented by global information, and integrating various types of convolution operations, the network gains powerful feature extraction capabilities.

The specific content is shown in Fig. 2, and the structure consists of standard convolution, depth convolution, and dilated convolution. For the input feature map ${F_0} \in {R^{C \times H \times W}}$, it first undergoes standard convolution with a kernel size of $3 \times 3$, followed by depth convolution with a kernel size of $3 \times 3$. The output feature map is denoted as ${F_3} \in {R^{C \times H \times W}}$. This process is repeated three times, and the computation is expressed by (1):

(1)$$\begin{array}{l}{F_1} = \textrm{Conv}1({conv2({{F_0}} )} )+ {F_0}\\{F_2} = \textrm{Conv}1({conv2({{F_1}} )} )+ {F_1}\\{F_3} = \textrm{Conv}1({conv2({{F_2}} )} )+ {F_2}\end{array}$$

Fig. 2. CDD Module, consisting of standard convolution, depthwise convolution, and dilated convolution, used for information enhancement.

Download Full Size | PDF

Here, $Conv1$ represents depthwise convolution, and $Conv2$ represents standard convolution.

The intermediate feature map ${F_3} \in {R^{C \times H \times W}}$ is input into convolutional layers with three different dilation rates and 1 × 1 convolution. This processes the local receptive field at each position, extracting local features from the input image through learned weight parameters of the convolutional kernels. During this process, feature values on each distinct dilated convolution path correspond to their respective feature representations. Finally, the feature representations with different spatial characteristics are weighted and summed element-wise. At this point, the resulting feature representation $C \in {R^{64 \times H \times W}}$ incorporates both local information from the intermediate features and global information from dilated convolutions. The computation is expressed by (2):

(2)$$L = {F_3} + \mathop \sum \limits_{j = 3}^6 Convj({{F_3}} )\times {W_j}$$

Here, $Conv3\sim Conv5$ represent dilated convolutions with kernel sizes of 3 × 3 and dilation rates of 2∼4, respectively. $Conv6$ denotes a 1 × 1 convolution. The parameter set W is trainable, and during the training process, the values of W are updated through backpropagation.

3.3 Similarity attention decoding block

Medical images are typically complex and diverse. During the image segmentation process, the decoder directly influences the final output. Failure to effectively integrate information and highlight crucial content can lead to inaccurate segmentation of target and background pixels [46]. For example, tasks such as skin lesion detection and lung segmentation often face challenges such as widespread distribution, irregular shapes, or blurred organ boundaries [20–22]. Therefore, it is crucial to integrate different information from the encoder and skip connections, obtain multi-scale feature maps, and highlight important information. We designed a class attention decoding block as shown in Fig. 3, which includes a class attention pyramid structure and a fusion channel attention structure. The class attention pyramid structure acquires corrected multi-scale information, while the fusion channel attention structure highlights important features. Together, they enhance performance.

Fig. 3. Decoder SADB Module. It includes a pyramid structure with class attention and information fusion structures.

Download Full Size | PDF

Firstly, the input feature map X of SADB is obtained by concatenating the output from the upper layer and the skip connection of the current level. The skip connection of the current level provides enhanced information, while the output from the upper layer contains high-level semantic information. Therefore, the input of SADB contains both coarse-grained and fine-grained semantic information, avoiding information loss.

Secondly, X is input to the similarity attention pyramid structure, which includes pyramid convolutions and pyramid interaction structures. When X is input to paths with different convolution kernels, each path can learn corresponding feature representations. Smaller convolution kernels are suitable for capturing subtle features in the image, such as edges and textures. Larger convolution kernels are suitable for extracting broader contextual information, such as the layout and shape of the image. Although the pyramid convolution structure allows the capture of features at multiple spatial levels, the features are relatively independent. Therefore, channel-wise similarity attention interaction is introduced in the pyramid convolution structure, enabling neighboring convolution kernels to complement each other in the channel dimension. This aids in enhancing the model's perception of the correlation between different channels, thereby improving the network's expressive power in handling complex features. The specific implementation is expressed by (3).

(3)$$\begin{array}{l}{\hat{X}_1} = ConvA(X )\times \sigma ({ConvB(X )} )\\{\hat{X}_2} = ConvB(X )\times \sigma ({ConvA(X )} )\\{\hat{X}_3} = ConvC(X )\times \sigma ({ConvD(X )} )\\{\hat{X}_4} = \; ConvD(X )\times \sigma ({ConvC(X )} )\\\sigma = Sigmoid(X )= \frac{1}{{1 + \textrm{exp}({ - X} )}}\end{array}$$

Where X is the input of SADB and also the input for each path, $\hat{X}$ is the output of the pyramid convolution for each path. $Conv1\sim Conv4$ represent standard convolutions with kernel sizes $1 \times 1$, $3 \times 3$, $5 \times 5$, $7 \times 7$, respectively. $exp$ is the abbreviation for the exponential function, representing the natural exponential.$\sigma$ is the $Sigmoid$ function, which has the property of mapping inputs to the range (0, 1). The multiplication “$\times$” denotes element-wise multiplication of the attention weights tensor for each channel with the corresponding channel feature of the current path.

The feature map $X$ is passed through $Conv1\sim Conv4$ to obtain respective feature values ${X_1}$∼${X_4}$. In order to generate weight matrices capable of correcting adjacent paths, the $Sigmoid$ operation is performed on ${X_1}$∼${X_4}$ to compress their feature values to a range between 0 and 1, resulting in attention weight tensors ${S_1}\sim {S_4}$ with the same number of input channels and image size. Then, element-wise multiplication is performed on the corresponding channels of features, i.e., ${X_1}\ast {S_2}$, ${X_2}\ast {S_1}$, ${X_3}\ast {S_4}$, ${X_4}\ast {S_3}$, ultimately generating the corrected feature maps ${\hat{X}_1}\sim {\hat{X}_4}$. In this way, the feature vectors generated by normalizing the global weights contain weight information for each position of all channels, and the multiplication with the feature vectors of adjacent paths not only contains the information of the current path but also obtains corrected information.

$Conv1$ interacts with $Conv2$, and $Conv3$ interacts with $Conv4$, i.e., the feature maps generated by 1 × 1 convolution kernels interact with those generated by 3 × 3 convolution kernels, and those generated by 5 × 5 convolution kernels interact with those generated by 7 × 7 convolution kernels. This paper divides the former two into small convolution kernels and the latter two into large convolution kernels. In general, feature captured by small convolution kernels are more suitable for capturing image boundaries and details, while large convolution kernels are more suitable for capturing image shapes and layouts. As pointed out in [47], semantic segmentation requires pixel-wise prediction, i.e., simultaneously completing classification and localization tasks, so the ability of large convolution kernels to extract semantic information is equally important. While retaining all channels, the complete feature map is scaled. If the feature matrices learned by two small convolution kernels have relatively large feature values at a certain position, after scaling and element-wise multiplication, that position will maintain relatively high-weight feature values. If the feature values of the current path are relatively large while those of the other path are relatively small, after scaling and element-wise multiplication, the feature values of the current path will decrease relatively; and vice versa. This method, which does not rely on a single path but combines information from two paths, has more advantages in paying attention to image boundary information and also increases the robustness of the network.

Image boundaries often exhibit problems such as low color contrast between target and background regions, a lack of distinct texture differences, and irregular shapes. In the process of image segmentation, networks often struggle to segment or incorrectly segment these areas. When introducing the class attention pyramid structure, as information interaction occurs between channels, and channel information typically contains a significant amount of boundary information, each scale of the class attention pyramid includes multiple levels of feature maps. These feature maps are obtained from different channels and, through the pyramid's channel interaction structure, receive cross-validated information from multiple channels. This strengthens the capture of boundary information.

After obtaining the feature maps ${\hat{X}_1}$, ${\hat{X}_2}$, ${\hat{X}_3}$, ${\hat{X}_4}$, multiply them by the para meter W. Use broadcasting multiplication to scale the feature values, then concatenate them with the original feature X to obtain the feature vector $O \in {R^{64 \times H \times W}}$. The specific implementation is expressed by (4).

(4)$$O = \; Cat({W_1} \times {\hat{X}_1},{W_2} \times {\hat{X}_2},{W_3} \times {\hat{X}_3},{W_4} \times {\hat{X}_4})$$

Where “ $Cat$ “ represents the concatenation operation, and “ W “ is a learnable parameter. The purpose of introducing learnable parameters is to enhance the model's generalization capability.

In order to address the potential performance bias introduced by the assumption of independence between channel and spatial information in the interaction process [48], this paper introduces learnable parameters denoted as “ W “. Existing methods typically assume that these pieces of information are independent of each other, but this simplification may lead to inaccurate weight allocation for less important information. The introduction of learnable parameters “W” aims to make the model more flexible in learning the complex relationships between information, thus adapting better to different data features and enhancing generalization capabilities.

The feature map $O \in {R^{64 \times H \times W}}$ contains rich spatial and channel information, encompassing both global and local details. While learnable parameters W have been employed in each path, allowing adjustment of the proportion of the current path, there is no explicit feature expression among the outputs of these paths. Due to interactions between small convolutional kernels and small kernels, as well as large kernels with large ones, there are still some barriers to information interaction, such as insufficient fusion of various information and lack of emphasis on important information. To address this, Squeeze-and-Excitation(SE) and deep convolution are introduced to integrate their respective information on two paths separately, and the final fusion is achieved through element-wise addition.

The SE module involves the processes of Squeeze and Excitation. In the Squeeze phase, the feature map $O \in {R^{64 \times H \times W}}$ is first subjected to global average pooling. By averaging the spatial features, a globally smoothed feature is obtained, resulting in a global average feature vector of dimension N. This averaging procedure is given by (5).

(5)$${M_{avg}}(O )= \frac{1}{{H \times W}}\mathop \sum \limits_{i = 1}^H \; \mathop \sum \limits_{j = 1}^W F({i,j} )$$

Here, ${M_{avg}}$ represents global average pooling, which calculates the average value over a subregion of the feature map and then slides this subregion to obtain the feature mapping ${M_{avg}} \in {R^{64 \times 1 \times 1}}$. F(i,j) represents the pixel value at position (i,j) in the input feature map O, where i denotes the row index and j denotes the column index. $\frac{1}{{H \times W}}$ normalizes the result of global average pooling to ensure that the numerical range of the output feature map is within a certain range.

In the Excitation phase, a fully connected layer is introduced to learn the weights between channels. To capture the dependencies between channels effectively, a simple gating mechanism with a $Sigmoid$ activation is employed. This mechanism allows for the learning of non-linear interactions between channels while ensuring that multiple channels can be emphasized. The tensor ${N_1} \in {R^{\frac{{64}}{r} \times 1 \times 1}}$ obtained from pooling is passed through a fully connected layer for dimension reduction, followed by ReLU activation, resulting in a new tensor ${N_1} \in {R^{\frac{{64}}{r} \times 1 \times 1}}$, where $r = 16$. It is then fed into a second fully connected layer for dimension increase, and a $Sigmoid$ function is applied for non-linear activation, producing the tensor ${N_2} \in {R^{64 \times 1 \times 1}}$. Finally, the obtained feature vector ${N_2}$ is multiplied using broadcasting to generate the vector $G \in {R^{64 \times H \times W}}$. The computation is expressed by (6).

(6)$$G = \sigma [{F{C_2}({Relu({F{C_1}(N )} )} )} ]\times O$$

$F{C_1}$ represents the fully connected layer for dimension reduction, $F{C_2}$ is the fully connected layer for dimension increase, “$\sigma $“ represents the sigmoid operation, and ${\times} $ denotes element-wise multiplication.

In order to accurately localize image boundaries, it is essential to have sufficient spatial information while extracting boundary information from the image. Depthwise convolution is an effective convolution operation that can capture rich spatial information by introducing relatively fewer model parameters. In this paper, we utilize three repetitions of depthwise convolution to enhance the acquisition of spatial information, resulting in the output feature map $D \in {R^{64 \times H \times W}}$. The specific computation is expressed by (7).

(7) $$D = DW({DW({DW(O )} )} )$$

Here, “$DW$” represents the depthwise convolution with a convolution kernel size of 3 × 3.

Combine the selected multi-scale information with enhanced spatial information by performing element-wise addition. After undergoing two $3 \times 3$ standard convolutions and one $1 \times 1$ standard convolution, finally, perform element-wise addition with the original SADB input features to allow comprehensive fusion of various information, ultimately obtaining the output feature vector $U \in {R^{64 \times H \times W}}$. The specific expression is shown in (8) as follows:

(8) $$U = Conv8({Conv7({Conv7({D + O} )} )} )+ X$$

Where $Conv7$ represents a $3 \times 3$ standard convolution $Conv8$ represents a $1 \times 1$ standard convolution, and “$ + $ “ denotes element-wise addition.

3.4 Feature fusion

The paragraph describes the feature fusion strategy of SADB, which is distinct from other networks. SADB incorporates an interactive structure in the pyramid convolution to handle high-level semantic information from the decoder and reinforced information from the CDD in the skip connection. This design facilitates the free flow of information between multiple convolution kernels in the pyramid convolution, rather than relying on simple sequential or parallel convolution operations. This structure particularly excels in edge processing, enabling the network to better distinguish between target and background regions. The subsequent introduction of SE and depthwise convolution operations further emphasizes the acquisition of multiscale information, achieving selective emphasis on crucial features. The inclusion of the CDD module in the skip connection enables selective information propagation, making the network more flexible in combining global and local information. This multilevel information interaction ensures that the network comprehensively understands the image at different scales, thereby enhancing overall accuracy.

In Fig. 4, taking the example of segmenting two skin lesions and one lung segment, we present the heatmaps and segmentation results after passing through the CDD module and SADB, respectively. We clearly observe that without the processing of the CDD module, the heatmap from the global information of the SwinTransformer encoder lacks clear color contrast between the target and background regions (where a redder color indicates higher attention). However, after passing through our CDD module, the color contrast between the target and background in the heatmap significantly increases. The background region is not overly ignored, showing a uniformly intense red color distinct from the target, indicating that our module not only highlights important regions but also preserves global information, to some extent, preventing information loss. The areas that are suppressed and emphasized are annotated, and the boundary part of the image is illustratively outlined.

Fig. 4. Feature Fusion Diagram, demonstrating the heatmap changes after passing through the CDD module and the heatmap and feature map changes after passing through the SADB module. The input to the CDD block is the output of the Swin Transformer block. The input to the SADB block is the output of the current-level CDD block and the output of the previous-level decoder.

Download Full Size | PDF

The optimized skip-connection features and the fused features from the SADB are input into our decoder, where they undergo processing through the class attention pyramid structure. From the heatmap, it is evident that there is a clear color contrast between the target and background regions. Emphasizing the importance of multi-channel and multi-scale information in handling image boundaries, we observe that the colors become significantly redder at the edges of the image, indicating high attention from the class attention pyramid structure in our decoder. At this stage, the feature map has taken on a preliminary shape relative to the input. After passing through our fusion channel attention structure, the feature map, which initially contains multi-scale information, undergoes further filtering to highlight important details. The final output represents the results of the decoder at this stage. Compared to the input feature map, it is evident that the output feature map from the decoder reduces segmentation errors in the target region while weakening the background. In the handling of the image boundary region, the heatmap in the intermediate process of the decoder shows extremely high attention. The resulting heatmap from the network output demonstrates a uniformly colored target region, a subdued background, and clear boundaries in the predicted image.

These visual results clearly indicate the excellent performance of our decoder in image segmentation tasks. It effectively acquires and processes multi-scale information, emphasizing critical details, which contributes to improved accuracy in segmenting target areas and enhances focus on image boundaries.

4. Experiments and result

4.1 Datasets and setting

This study conducted comparative experiments on two datasets for skin lesion segmentation and lung segmentation, both of which are part of a common medical dataset. Additionally, ablation experiments were performed on all datasets to validate the correctness and effectiveness of the STCS-Net construction approach. Skin lesion data were provided by the International Skin Imaging Collaboration (ISIC), and we selected ISIC-2016 and ISIC-2018 for experimentation. The ISIC-2016 dataset comprises a total of 1279 images, with 863 images for training, 216 for validation, and 200 for testing. The ISIC-2018 dataset includes a total of 2594 images, with 1995 for training, 499 for validation, and 100 for testing. The Lung dataset is used in competitions such as LUNA and Kaggle Data Science Bowl 2017, involving processing and attempting to find lesions in lung CT images. The lung dataset ‘Lung’ consists of 801 images, with 498 for training, 213 for validation, and 90 for testing.

The experimental workstation used an NVIDIA GeForce RTX 3090 (24 G) graphics card, and training was conducted using the pytorch deep learning framework. All images were resized to 448 × 448 before being input into the network. The Adam optimizer was utilized with a learning rate of 0.001. The loss function chosen was the binary cross-entropy loss (BCELoss), and the training batch size was set to 16. The maximum number of training epochs was set to 150. All models were collected and organized based on the premise of convergence.

4.2 Experimental detail

To comprehensively evaluate the segmentation performance of the proposed model, common evaluation metrics were employed in the experiments, as shown in Eqs. (13)–(17). These metrics include the Dice Similarity Coefficient (DICE), Intersection Over Union (IOU), Accuracy (ACC), Volume Overlap Error (VOE), and Relative Volume Difference (RVD). The formulas for these metrics are expressed as follows:

(13)$$DICE = \frac{{2TP}}{{2TP + FP + FN}}$$

(14)$$IOU = \frac{{TP}}{{TP + FP + FN}}$$

(15)$$ACC = \frac{{TP + TN}}{{TP + FP + FN + TN}}$$

(16)$$\; VOE = 1 - \frac{{A \cap B}}{{A \cup B}}$$

(17)$$RVD = \frac{{|A |- |B |}}{{|B |}}$$

Here, True Positive (TP) and True Negative (TN) represent the number of pixels correctly segmented in the target region and background region, respectively. False Positive (FP) denotes the number of background pixels wrongly marked. False Negative (FN) represents the number of pixels incorrectly predicted as the background region. A and B denote the predicted segmented region in the segmentation map and the true label, respectively.

When performance metrics improve, it indicates that the model performs better overall on the image segmentation task.

An increase in the DICE coefficient and IOU indicates a higher degree of overlap between the segmentation result and the ground truth. This suggests that the segmentation algorithm more accurately locates the boundaries of the target structure, resulting in clearer and more precise segmentation outputs. The improvement in DICE coefficient and IOU indicates that the model has a better understanding of the morphology and positional variations of the target structure. Furthermore, the reduction in false segmentation errors, indicated by the increase in DICE coefficient and IOU, decreases the likelihood of missegmentation, making the segmentation output more reliable.

An improvement in precision means that the proportion of true positives among all samples classified as positive increases. This indicates that the model more accurately excludes negative samples from positive ones, thereby reducing the number of false positives. This implies that non-target areas are correctly excluded from the segmentation output, thereby enhancing the accuracy and reliability of the segmentation results.

VOE and RVD are both metrics used to assess the volumetric differences between the segmentation result and the ground truth. The closer they are to 0, the closer the segmentation result is to the ground truth label in terms of volume.

4.3 Comparing with state-of-the-art on the ISIC2016 dataset

We compared our proposed STCS-Net with ten state-of-the-art methods on the ISIC2016 dataset, including UNet [6], FAT-Net [20], DCSAU-Net [23], VCMix-Net [48], TransUNet [24], UCTransNet [25], ScaleFormer [44], LGI Net [49], TCI-Net [50], and SwinUnet [51]. All the aforementioned methods are the latest segmentation networks specifically designed for medical imaging. To ensure fair experimentation, all competing networks were run under the same computational environment. Table 1 presents the segmentation results of these methods on the dataset.

Table 1. Statistical comparison with different state-of-the-art methods on the ISIC2016 dataset

View Table | View all tables in this article

The experimental results clearly show that our method achieved the highest levels in most metrics, with DICE, IOU, and ACC reaching 90.60%, 84.22%, and 89.09%, respectively. Compared to the original UNet, our method improved by 13.86%, 16.92%, and 8.79%. DCSAU-Net, utilizing the PFC and CSA modules [23], obtained features at different scales, enhancing the richness of both local and global information to some extent. However, this method solely extracts features through convolution, making it challenging to effectively capture global features. Moreover, it lacks a clear information correction mechanism and fails to integrate information scaling brought by attention into an overall perspective. In contrast, our STCS-Net continuously corrects and highlights important information during the decoding process, resulting in better results when handling image boundary information.

We have provided visual comparisons of segmentation results from several different competitive networks, as shown in Fig. 5. Upon observation, our method outperforms most competitors. In skin lesion images, the boundaries of the lesion areas are often blurry, with low contrast to the surrounding healthy areas, and significant variations in the global proportion of lesion areas. If the model cannot consider multiple types of information comprehensively, it is prone to segmentation errors in complex boundaries. For example, UNet [6], SwinUnet [51], and VCMix-Net [48] can only capture relatively singular features. Although other networks fuse convolution with Transformer, the fusion process does not effectively utilize information between different channels, leading to insufficient information exchange.

Fig. 5. Shows the segmentation results of different networks on the ISIC2016 dataset. From left to right, it represents the original image, ground truth, and segmentation results of different networks. The red areas indicate segmentation errors.

Download Full Size | PDF

4.4 Comparing with state-of-the-art on the ISIC2018 dataset

For the ISIC2018 dataset, our proposed STCS-Net was compared with ten state-of-the-art methods. To ensure fairness, the same networks used for the ISIC2016 dataset were employed for comparison, demonstrating the advantages of our network. As shown in Table 2, our method obtained the highest scores in most metrics, achieving 87.13%, 79.04%, and 0.0008 in DICE, IOU, and VOE, respectively. Compared to the original UNet, our method demonstrated improvements of 7.68%, 8.11%, and 0.0037 in DICE, IOU, and VOE, respectively. During this process, a significant increase in the number of parameters was not introduced, confirming that our method improves segmentation performance without significantly increasing parameter count.

Table 2. Statistical comparison with different state-of-the-art methods on the ISIC2018 dataset

View Table | View all tables in this article

Our network achieved the highest standards in terms of DICE and IOU on both skin datasets. This indicates that our method is effective in obtaining and processing various types of information, highlighting important features. It demonstrates excellent performance in handling complex boundaries and successfully captures crucial information.

For cases where the precision did not reach optimal results, it may be due to both the network architecture and the characteristics of the dataset itself. Images in the ISIC dataset often contain large lesion areas, with some images having clear boundaries while others are extremely blurred. This presents a high demand for the model's complexity. In our model architecture, the decoder is designed to enable channel-wise information interaction in pyramid convolutions. This significantly enhances the model's ability to perceive complex boundaries. However, for images with clearer boundaries in the dataset, the model may overly learn noise or local features, leading to overfitting. Additionally, our approach combines CNN with Transformer, integrating global and local information before feature extraction. This fusion process may lead to imbalanced information interaction, limiting segmentation effectiveness. Nevertheless, our method achieved the highest DICE, IOU, and precision scores on the ISIC 2016 skin dataset, providing reliable support for clinical decision-making.

Figure 6 illustrates the segmentation results of different networks on the ISIC2018 dataset. From left to right, it represents the original image, the ground truth label, and the segmentation results of different networks, where the red regions indicate areas of incorrect segmentation.

4.5 Comparing with state-of-the-art on the lung dataset

The experimental results reveal that on the Lung dataset, our proposed STCS-Net was compared with eight state-of-the-art methods, including UNet [6], FAT-Net [20], DCSAU-Net [23], VCMix-Net [48], TransUNet [24], ScaleFormer [44], TCI-Net [50], and SwinUnet [51]. These methods represent the leading edge in medical image segmentation networks. As shown in Table 3, our approach achieved the highest scores in four metrics: DICE, IOU, VOE, and RVD, with percentages of 98.11%, 96.31%, 0.0006, and 0.0006, respectively. Compared to the original UNet, our method demonstrated improvements in DICE, IOU, RVD, and VOE by 0.72%, 1.38%, 0.0004, and 0.0063, respectively. Furthermore, our method generally outperformed most competitors in the majority of the metrics.

Table 3. Statistical comparison with different state-of-the-art methods on the Lung dataset

View Table | View all tables in this article

The ScaleFormer [44] effectively integrates CNN and Transformer, achieving good segmentation results. However, the network incorporates a large number of convolutional and Transformer blocks, resulting in a high parameter count. In VCMix-Net [48], local information and local-global information are obtained through tensor shifting and linear projection. However, compared to our network, it lacks the ability to capture multi-scale information adequately. Therefore, when dealing with the segmentation of complex regions, VCMix-Net faces challenges in accurately locating and segmenting boundaries. Our network achieves a good balance between performance and the number of parameters. All network visualizations are shown in Fig. 7.

Figure 7 shows the segmentation results of different networks on the Lung dataset. From left to right, it represents the original image, the ground truth labels, and the segmentation results of different networks, where the red color indicates areas of incorrect segmentation.

4.6 Ablation study for STCS-Net

To demonstrate the effectiveness of our proposed network, we conducted ablation experiments comparing the performance with different configurations. SwinUnet was used as the base network (BaseNet) [51], ensuring the depth at each stage was consistent with our network. We incorporated the Enhanced Information Module CDD into the skip connections of the base network and replaced the decoder part with SADB. All networks were trained under the same experimental environment. The experiment design is shown in Table 4, and the ablation experiments were conducted on the ISIC2016, ISIC2018, and Lung datasets. The specific experiment configurations were as follows: (1) BaseNet with modified depth; (2) Enhanced Information Module CDD added to skip connections; (3) Decoder part modified with SADB module for obtaining multi-scale information accurately and emphasizing important features; (4) Our proposed STCS-Net, incorporating both CDD and SADB modules.

Table 4. Result of Ablation Study

View Table | View all tables in this article

STCS-Net achieved the best performance by integrating the CDD and SADB methods on all three datasets. As shown in Table 4, compared to the base network, STCS-Net improved DICE and IOU by 3.78% and 5.1%, respectively, on the ISIC2016 dataset; on the ISIC2018 dataset, DICE and IOU increased by 2.08% and 2.51%; and on the Lung dataset, DICE and IOU improved by 0.59% and 1.12%, respectively.

To provide a more intuitive and compelling demonstration of the effectiveness of the CDD and SADB modules, we visualized some feature maps from the ISIC dataset, as the images in this dataset exhibit significant diversity. This is illustrated in Fig. 8.

Fig. 6. Illustrates the segmentation results of different networks on the ISIC2018 dataset. From left to right, it represents the original image, the ground truth label, and the segmentation results of different networks, where the red regions indicate areas of incorrect segmentation.

Download Full Size | PDF

Fig. 7. Shows the segmentation results of different networks on the Lung dataset. From left to right, it represents the original image, the ground truth labels, and the segmentation results of different networks, where the red color indicates areas of incorrect segmentation.

Download Full Size | PDF

Fig. 8. Shows the feature comparison at each stage of the ablation experiment, with each column corresponding to a experimental phase. The red areas indicate mis-segmented regions.

Download Full Size | PDF

Compared to BaseNet, we observed good performance in most cases when only the CDD module was added, but there were also instances of performance degradation. The encoder and decoder of BaseNet adopt a Transformer structure, which can acquire rich global information. The fusion of standard convolution, depthwise convolution, and dilated convolution in the CDD module enables the extraction of rich local and spatial information, enhancing information representation. Therefore, the fusion of local and global information after adding the CDD module contributes to performance improvement. However, since the CDD module is directly added to the skip connections of BaseNet without sufficient information filtering, it may lead to insufficient information fusion.

When only the SADB module was added, all performance metrics outperformed BaseNet. The global information in the encoder is filtered and corrected by the decoder through the SADB. In the SADB structure, not only can feature vectors interact across multiple scales in the channel dimension, but information extracted at different scales can also be amplified or attenuated, which is beneficial for handling complex images. However, due to the rich global information received by the decoder, and the inclusion of a large amount of redundant information in the global information, the decoder cannot fully exploit its function, lacking a process for integrating and reinforcing information.

When both the CDD and SADB modules were added, the performance metrics reached their highest values. The encoder acquires global information, the CDD module enhances information in the skip connections, and the decoder can better filter and correct information, collectively constituting our STCS-Net.

Both the CDD and SADB modules extensively utilize standard convolution and depthwise convolution. This is because the Transformer-based encoder in BaseNet can obtain rich global information, but excessively rich global information is detrimental to image segmentation. Global information often contains a large amount of redundant information, such as noise and excessive background information. Therefore, it is necessary to design modules that can effectively remove redundant information and capture sufficient local information. The use of standard convolution and depthwise convolution meets this requirement well. They can capture local information in the image through the local receptive field of the convolution kernel. By sliding the convolution kernel over the image, different parts of the image can be gradually analyzed, and information can be extracted from local features. Additionally, convolution operations are highly parallelizable, which means they can efficiently process images, increasing the speed of the model.

5. Discussion

Semantic segmentation has found extensive applications in the field of medical imaging. The encoder-decoder architecture has been highly successful in this domain. However, most research has focused on optimizing the encoder while overlooking the importance of the decoder. This paper specifically addresses the role of skip connections and the decoder part, aiming to correct previous information and further filter abstract high-level features to generate the final target output. The SADB decoder corrects information at multiple scales and emphasizes important features. By highlighting channel interactions at different scales, feature information is complemented and corrected, contributing to improved segmentation outcomes. The CDD module enhances information for images of different resolutions, increasing sensitivity to various details and features, thereby enhancing the model's performance and adaptability. In STCS-Net, the use of the SADB module achieves mutual calibration of multi-scale information and highlights important features, resulting in excellent segmentation performance. The encoder in STCS-Net adopts a Transformer structure, which is advantageous for handling large datasets due to the powerful fitting capability of the self-attention mechanism. In contrast, the CNN structure in the decoder generally performs well on smaller datasets, as CNN have the ability to generalize patterns and extract rules from smaller datasets to ensure model stability. The design of STCS-Net combines the strengths of both structures, providing an innovative approach to medical image segmentation.

In summary, this paper focuses on addressing the challenge of medical image segmentation, with a particular emphasis on highlighting the crucial role of the decoder throughout the segmentation process. Through improvements in skip connections and the decoder, effective capture and mutual correction of multi-scale information are achieved, simultaneously emphasizing key information and enhancing perceptual ability for boundary regions. On three datasets, ISIC2016, ISIC2018, and Lung, our network achieves the highest scores in DICE and IOU metrics, providing an innovative approach for automated segmentation of medical images. However, the network applies Transformer to compute attention weights, resulting in higher complexity. Future research directions will concentrate on integrating the strengths of CNN into the self-attention computation of Transformer and minimizing overfitting to improve segmentation performance.

6. Conclusion

This paper introduces a novel network called STCS-Net, designed to capture mutually calibrated multi-scale information and highlight key features in the network. The goal is to provide an effective solution for challenging tasks in medical image segmentation. While CNN excel at extracting local information, Transformers, with their self-attention mechanism, can capture rich global information. STCS-Net can effectively handle various types of information in the decoding stage, incorporating structures for inter-channel calibration and selectively emphasizing critical features. This allows for complementary feature information, enabling more precise information retrieval. The performance of the STCS-Net model was evaluated on three datasets: ISIC2016, ISIC2018, and Lung. Extensive experiments demonstrated the effectiveness of STCS-Net in medical image segmentation tasks, surpassing current state-of-the-art methods.

Funding

National Natural Science Foundation of China (61473112); Hebei Provincial Natural Science Fund Key Project (F2017201222); Hebei Provincial Natural Science Fund Project (F2023201035).

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61473112), the Hebei Provincial Natural Science Fund Key Project (F2017201222), the Hebei Provincial Natural Science Fund Project (F2023201035).

Disclosures

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

Code is presented in [52].

References

1. G. Litjens, T. Kooi, B. E. Bejnordi, et al., “A survey on deep learning in medical image analysis,” Med. Image Anal. 42, 60–88 (2017). [CrossRef]

2. Y. LeCun, L. Bottou, Y. Bengio, et al., “Gradient-based learning applied to document recognition,” Proc. IEEE 86(11), 2278–2324 (1998). [CrossRef]

3. D. Riccio, N. Brancati, M. Frucci, et al., “A new unsupervised approach for segmenting and counting cells in high-throughput microscopy image sets,” IEEE J. Biomed. Health Inform. 23(1), 437–448 (2018). [CrossRef]

4. D. Jha, P. H. Smedsrud, M. A. Riegler, et al., “Resunet++: An advanced architecture for medical image segmentation,” 2019 IEEE international symposium on multimedia (ISM). IEEE, 2019: 225–2255.

5. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3431–3440.

6. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer International Publishing, 2015: 234–241.

7. Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, et al., “Unet++: A nested u-net architecture for medical image segmentation,” Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4. Springer International Publishing, 2018: 3–11.

8. Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual u-net,” IEEE Geoscience and Remote Sensing Letters 15(5), 749–753 (2018). [CrossRef]

9. H. Huang, L. Lin, R. Tong, et al., “Unet 3+: A full-scale connected unet for medical image segmentation,” ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 1055–1059.

10. J. Ruan, M. Xie, J. Gao, et al., “Ege-unet: an efficient group enhanced unet for skin lesion segmentation,” proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (Springer, 2023).

11. I. C. Duta, L. Liu, F. Zhu, et al., “Pyramidal convolution: Rethinking convolutional neural networks for visual recognition,” arXiv, arXiv:2006.11538, (2020). [CrossRef]

12. Z. Gu, J. Cheng, H. Fu, et al., “Ce-net: Context encoder network for 2d medical image segmentation,” IEEE Trans. Med. Imaging 38(10), 2281–2292 (2019). [CrossRef]

13. O. Oktay, J. Schlemper, L. L. Folgoc, et al., “Attention u-net: Learning where to look for the pancreas,” arXiv, arXiv:1804.03999, (2018). [CrossRef]

14. A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Advances in neural information processing systems30, 1 (2017).

15. A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An image is worth 16 × 16 words: Transformers for image recognition at scale,” arXiv, arXiv:2010.11929, (2020). [CrossRef]

16. J. Ruan, M. Xie, J. Gao, et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” Proceedings of the IEEE/CVF International Conference on Computer Vision (2023).

17. Z. Dai, Z. Yang, Y. Yang, et al., “Transformer-xl: Attentive language models beyond a fixed-length context,” arXiv, arXiv:1901.02860, (2019). [CrossRef]

18. N. Parmar, A. Vaswani, J. Uszkoreit, et al., “Image transformer,” International Conference on Machine Learning PMLR, 2018: 4055–4064.

19. C. F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” Proceedings of the IEEE/CVF international conference on computer vision. 2021: 357–366.

20. H. Wu, S. Chen, G. Chen, et al., “FAT-Net: Feature adaptive transformers for automated skin lesion segmentation,” Med. Image Anal. 76, 102327 (2022). [CrossRef]

21. M. E. Celebi, N. Codella, and A. Halpern, “Dermoscopy image analysis: overview and future directions,” IEEE J. Biomed. Health Inform. 23(2), 474–478 (2019). [CrossRef]

22. D. P. Fan, G. P. Ji, T. Zhou, et al., “Pranet: Parallel reverse attention network for polyp segmentation,” International conference on medical image computing and computer-assisted intervention. Springer International Publishing, Cham: 2020: 263–273.

23. Q. Xu, Z. Ma, H. E. Na, et al., “DCSAU-Net: A deeper and more compact split-attention U-Net for medical image segmentation,” Comput. Biol. Med. 154, 106626 (2023). [CrossRef]

24. J. Chen, Y. Lu, Q. Yu, et al., “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv, arXiv:2102.04306, (2021). [CrossRef]

25. H. Wang, P. Cao, J. Wang, et al., “Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer,” Proceedings of the AAAI conference on artificial intelligence. 2022, 36(3): 2441–2449.

26. V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence 39(12), 2481–2495 (2017). [CrossRef]

27. G. Huang, Z. Liu, L. Van Der Maaten, et al., “Densely connected convolutional networks,” Proceedings of the IEEE conference on computer vision and pattern recognition2017: 4700–4708.

28. N. Ibtehaz and M. S. Rahman, “MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation,” Neural Networks 121, 74–87 (2020).

29. K. He, X. Zhang, S. Ren, et al., “Deep residual learning for image recognition,” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770–778.

30. N. Badshah and A. Ahmad, “ResBCU-Net: Deep learning approach for segmentation of skin images,” Biomedical Signal Processing and Control 71, 103137 (2022). [CrossRef]

31. R. Feng, L. Zhuo, X. Li, et al., “BLA-Net: Boundary learning assisted network for skin lesion segmentation,” Computer Methods and Programs in Biomedicine 226, 107190 (2022). [CrossRef]

32. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” Proceedings of the IEEE conference on computer vision and pattern recognition2018: 7132–7141.

33. S. Woo, J. Park, J. Y. Lee, et al., “Cbam: Convolutional block attention module,” Proceedings of the European conference on computer vision (ECCV). 2018: 3–19.

34. J. Park, S. Woo, J. Y. Lee, et al., “Bam: Bottleneck attention module,” arXiv, arXiv:1807.06514, (2018). [CrossRef]

35. X. Dong, J. Bao, D. Chen, et al., “Cswin transformer: A general vision transformer backbone with cross-shaped windows,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 12124–12134.

36. J. Li, J. Chen, Y. Tang, et al., “Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives,” Med. Image Anal. 102762, 1 (2023).

37. A. Hatamizadeh, Y. Tang, V. Nath, et al., “Unetr: Transformers for 3d medical image segmentation,” Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2022: 574–584.

38. H. Peiris, M. Hayat, Z. Chen, et al., “A robust volumetric transformer for accurate 3D tumor segmentation,” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer Nature Switzerland, Cham: 2022: 162–172.

39. A. Hatamizadeh, V. Nath, Y. Tang, et al., “Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images,” International MICCAI Brainlesion Workshop. Springer International Publishing, Cham: 2021: 272–284.

40. X. Yan, H. Tang, S. Sun, et al., “After-unet: Axial fusion transformer unet for medical image segmentation,” Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2022: 3971–3981.

41. Y. Chang, H. Menghan, Z. Guangtao, et al., “Transclaw u-net: Claw u-net with transformers for medical image segmentation,” arXiv, arXiv:2107.05188, (2021). [CrossRef]

42. Y. Zhang, H. Liu, and Q. Hu, “Transfuse: Fusing transformers and CNN for medical image segmentation,” Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24. Springer International Publishing, 2021: 14–24.

43. X. Meng, X. Zhang, G. Wang, et al., “Exploiting full resolution feature context for liver tumor and vessel segmentation via integrate framework: Application to liver tumor and vessel 3d reconstruction under embedded microprocessor,” arXiv, arXiv:2111.13299, (2021). [CrossRef]

44. H. Huang, S. Xie, L. Lin, et al., “ScaleFormer: revisiting the transformer-based backbones from a scale-wise perspective for medical image segmentation,” arXiv, arXiv:2207.14552, (2022). [CrossRef]

45. C. Li, Y. Tan, W. Chen, et al., “Attention unet++: A nested attention-aware u-net for liver ct image segmentation,” 2020 IEEE international conference on image processing (ICIP). IEEE, 2020: 345–349.

46. D. Liu, Y. Gao, Q. Zhangli, et al., “Transfusion: multi-view divergent fusion for medical image segmentation with transformers,” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer Nature Switzerland, Cham: 2022: 485–495.

47. C. Peng, X. Zhang, G. Yu, et al., “Large kernel matters–improve semantic segmentation by global convolutional network,” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4353–4361.

48. H. Zhao, G. Wang, Y. Wu, et al., “VCMix-Net: A hybrid network for medical image segmentation,” Biomedical Signal Processing and Control 86, 105241 (2023). [CrossRef]

49. L. Liu, Y. Li, Y. Wu, et al., “LGI Net: Enhancing local-global information interaction for medical image segmentation,” Comput. Biol. Med. 167, 107627 (2023). [CrossRef]

50. X. Bian, G. Wang, Y. Wu, et al., “Tci-unet: transformer-cnn interactive module for medical image segmentation,” Biomed. Opt. Express 14(11), 5904–5920 (2023). [CrossRef]

51. H. Cao, Y. Wang, J. Chen, et al., “Swin-unet: Unet-like pure transformer for medical image segmentation,” European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 205–218.

52. HP. Ma, G. Wang, T. Li, et al., “STCS-Net: A medical image segmentation network that fully utilizes multi-scale information: code,” Github, 2024, https://github.com/good-ddcc/STCS-Net.

Method	Years	DICE(%)	IOU(%)	ACC(%)	VOE	RVD	Params(M)
UNet [6]	2015	76.74	67.30	80.30	0.0572	0.8670	31.04
SwinUnet [51]	2021	86.82	79.12	86.61	0.0197	0.2943	41.34
TransUnet [24]	2021	89.33	82.37	88.99	-0.0037	0.1623	105.28
ScaleFormer [44]	2022	88.60	81.59	88.61	0.0224	0.3332	113.68
UCTransNet [25]	2022	89.68	82.78	88.98	0.0019	0.1701	66.24
FAT-Net [20]	2022	87.87	80.56	87.21	0.0146	0.3748	29.62
DCSAU-Net [23]	2023	86.40	78.91	85.46	0.0322	0.3534	2.60
VCMix-Net [48]	2023	87.87	80.97	87.30	0.0265	0.4549	39.52
LGI Net [49]	2023	90.24	84.00	89.02	0.0140	0.2343	22.28
TCI-Net [50]	2023	89.36	82.60	88.89	0.0120	0.2321	33.41
STCS-Net	-	90.60	84.22	89.09	0.0323	0.2846	23.25

Method	Years	DICE(%)	IOU(%)	ACC(%)	VOE	RVD	Params(M)
UNet [6]	2015	79.45	70.93	84.22	-0.0045	0.1390	31.04
SwinUnet [51]	2021	84.15	75.32	88.42	0.0009	0.1074	41.34
TransUnet [24]	2021	85.46	76.41	95.76	-0.0169	-0.1508	105.28
ScaleFormer [44]	2022	82.67	73.69	85.80	-0.0067	0.1552	113.68
UCTransNet [25]	2022	85.43	76.50	90.73	-0.0103	-0.0414	66.24
FAT-Net [20]	2022	81.93	73.02	87.15	-0.0080	0.1243	29.62
DCSAU-Net [23]	2023	83.79	75.47	87.71	-0.0107	0.0095	2.60
VCMix-Net [48]	2023	82.89	73.76	87.89	0.0050	0.1628	39.52
LGI Net [49]	2023	85.98	77.42	91.68	-0.0025	0.1252	22.28
TCI-Net [50]	2023	86.53	78.21	92.32	-0.0055	0.0547	33.41
STCS-Net	-	87.13	79.04	90.16	0.0008	0.1273	23.25

Method	Years	DICE(%)	IOU(%)	ACC(%)	VOE	RVD	Params(M)
UNet [6]	2015	97.39	94.93	97.09	-0.0010	0.0069	31.04
SwinUnet [51]	2021	96.02	92.43	96.07	0.0007	0.0014	41.34
TransUnet [24]	2021	97.47	95.10	97.78	-0.0018	-0.0056	105.28
ScaleFormer [44]	2022	98.09	96.30	97.81	0.0018	0.0066	113.68
FAT-Net [20]	2022	97.81	95.74	98.20	-0.0022	-0.0073	29.62
DCSAU-Net [23]	2023	96.28	92.88	97.30	0.0065	-0.0191	2.60
VCMix-Net [48]	2023	97.46	95.06	97.77	0.0026	-0.0054	39.52
TCI-Net [50]	2023	97.42	95.00	97.16	0.0027	0.0063	33.41
STCS-Net	-	98.11	96.31	98.10	0.0006	0.0006	23.25

Method	ISIC2016		ISIC2018		Lung
Method	DICE(%)	IOU(%)	DICE(%)	IOU(%)	DICE(%)	IOU(%)
BaseNet	86.82	79.12	85.05	76.53	97.52	95.19
+CDD	88.25	80.78	83.53	74.72	95.48	91.44
+SADB	87.97	81.05	85.90	77.53	97.70	95.53
STCS-Net	90.60	84.22	87.13	79.04	98.11	96.31

Method	Years	DICE(%)	IOU(%)	ACC(%)	VOE	RVD	Params(M)
UNet [6]	2015	76.74	67.30	80.30	0.0572	0.8670	31.04
SwinUnet [51]	2021	86.82	79.12	86.61	0.0197	0.2943	41.34
TransUnet [24]	2021	89.33	82.37	88.99	-0.0037	0.1623	105.28
ScaleFormer [44]	2022	88.60	81.59	88.61	0.0224	0.3332	113.68
UCTransNet [25]	2022	89.68	82.78	88.98	0.0019	0.1701	66.24
FAT-Net [20]	2022	87.87	80.56	87.21	0.0146	0.3748	29.62
DCSAU-Net [23]	2023	86.40	78.91	85.46	0.0322	0.3534	2.60
VCMix-Net [48]	2023	87.87	80.97	87.30	0.0265	0.4549	39.52
LGI Net [49]	2023	90.24	84.00	89.02	0.0140	0.2343	22.28
TCI-Net [50]	2023	89.36	82.60	88.89	0.0120	0.2321	33.41
STCS-Net	-	90.60	84.22	89.09	0.0323	0.2846	23.25

STCS-Net: a medical image segmentation network that fully utilizes multi-scale information

Abstract

1. Introduction

2. Related work

2.1 Convolution-based image segmentation networks

2.2 Transformer-based image segmentation networks

3. Methods

3.1 STCS-Net

3.2 Information enhancement module

3.3 Similarity attention decoding block

3.4 Feature fusion

4. Experiments and result

4.1 Datasets and setting

4.2 Experimental detail

4.3 Comparing with state-of-the-art on the ISIC2016 dataset

4.4 Comparing with state-of-the-art on the ISIC2018 dataset

4.5 Comparing with state-of-the-art on the lung dataset

4.6 Ablation study for STCS-Net

5. Discussion

6. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (8)

Tables (4)

Equations (13)

Biomedical Optics Express