BFE-Net: bilateral fusion enhanced network for gastrointestinal polyp segmentation

Kaixuan Zhang; Dingcan Hu; Xiang Li; Xiaotong Wang; Xiaoming Hu; Chunyang Wang; Jinlin Yang; Jinlin Yang; Nini Rao; Nini Rao

doi:10.1364/BOE.522441

1. Introduction

Gastric and colorectal polyps are prevalent digestive tract conditions, with untreated cases posing a risk of malignancy [1]. Gastric cancer has an annual incidence of over 1,000,000 cases worldwide [2], ranking it among the top four cancers globally, with a mortality rate of 12.4% [3]. Colorectal cancer ranks third in global cancer incidence and second in cancer-related deaths, experienced nearly 2 million new cases and 100,000 deaths in 2020, contributing to 10.0% and 9.4% of total cancer incidence and mortality [4,5], respectively.

Gastrointestinal endoscopy is an optical imaging technology. It employs flexible optical fibers to adeptly guide light into the digestive tract, encompassing both the stomach and intestines [6,7]. Utilizing image sensors, the system captures reflections from the mucosa within the tract, seamlessly converting them into electronic signals. These signals undergo a meticulous process of amplification, filtering, and digitization, culminating in the enhancement of endoscopic image quality and the production of detailed visualizations of the digestive tract mucosa [8]. Clinically, gastrointestinal endoscopy is the primary method for polyp detection, followed by biopsy confirmation and appropriate treatment. Minimally invasive surgical techniques like endoscopic mucosal resection (EMR) [9], endoscopic submucosal dissection (ESD) [10], and laparoscopic rectal resection (LRR) [11,12] are common for benign polyps, while chemotherapy, radiation, and subsequent surgery are employed for malignancies. Precise delineation of polyp lesions is crucial, but manual delineation accuracy is subject to clinician experience, coordination, fatigue, and time constraints [13]. Inherent variations in polyps, including low contrast, shape, size, texture, and lighting, further contribute to inaccuracies. Therefore, an automatic polyp segmentation system is urgently needed to enhance surgical precision, particularly in treating gastric and colorectal cancer.

Automatic semantic segmentation, labeling each image pixel based on common visual characteristics, plays a vital role in gastrointestinal polyp segmentation. Traditional methods relying on low-level features such as texture [14] and geometry [15] are prone to false positives and negatives. With the advent of deep learning, convolutional neural networks (CNNs) like U-Net [16], U-Net++ [17], and R2U-Net [18] have made significant strides in polyp segmentation. While U-shaped networks prevail in semantic segmentation due to its simple and flexible design, Fan et al. [19] introduced reverse attention modules for fine segmentation, and He et al. [20] proposed a dual-branch hybrid network using swin-transformer [21] and U-Net [16]. Zhou et al. [22] addressed polyp scale variations but required additional boundary information. Despite improvements in accuracy and generalization with deep learning, precise polyp boundary localization remains challenging. U-shaped networks may lose detailed information due to down-sampling. In recent years, methods based on Transformers [23] have gained wide attention. Transformers excel at modeling long-range dependencies and pay attention to global context, achieving better results in computer vision tasks [24–27]. However, the Transformer-based methods face high computational complexity. The pyramid vision transformer (PVT) [28], combining Transformer and CNN advantages, provides a solution. Inspired by the dual-branch networks [20,22], we propose the simultaneous transmission of images to PVT and U-Net, optimizing the process of capturing pivotal features while conserving computational resources. To achieve this, we first develop an effective feature enhanced fusion (FEF)method and then introduce an attention decoder (AD) structure that uses a mixed attention mechanism to decode deep feature information. Finally, we construct a bilateral fusion enhanced network (BFE-Net) to achieve precise segmentation of polyp lesion regions.

In summary, the main contributions of this paper are as follows:

• We propose a bilateral fusion enhanced network (BFE-Net) for accurate segmentation of gastric and colorectal polyp lesions.
• We develop the feature enhanced fusion (FEF) module, which efficiently fuses features from PVT and U-Net using a multimodal feature fusion mechanism.
• We introduce the attention decoder (AD) module, which utilizes a mixed attention mechanism to decode deep feature and improve the localization precision of lesion regions.
• We evaluate the model on five challenging benchmark gastrointestinal polyp datasets, including Kvasir-SEG [29], CVC-ClinicDB [30], CVC-ColonDB [31], CVC-300 [32], and ETIS [33], as well as ISIC2018 [34], BUSI [35] and CAMO [36] for generalization ability of model. We use professional evaluation metrics to ensure high reliability of model performance. Experimental results demonstrate that our model segments polyp regions accurately, has strong generalizability and outperforms other state-of-the-art methods in the endoscopic images of gastrointestinal polyps.

1.1 Related work

1.1.1 Deep learning technologies in medical image segmentation

In recent years, CNNs have proven successfully in various medical image processing challenges, particularly in medical image segmentation tasks [16,37]. Notably, the encoder-decoder framework has been beneficial, Zhao et al. [38] innovatively employed a pyramid pooling module with multiple scales within the encoder subnetwork, enhancing feature representation for medical image segmentation. Xue et al. [39] incorporated a multi-scale L1 loss function, compelling the model to assimilate both global and local features. This integrated framework synergistically enhances the efficacy of disease classification and lesion segmentation. Gu et al. [40] adopted atrous dense convolution and residual multi-kernel pooling to form a context encoder network (CE-Net) based on the pyramid pooling strategy, efficiently capturing high-level information while preserving spatial details crucial for medical image segmentation. Jha et al. [41] aimed to enhance U-Net performance by introducing the DoubleU-Net structure, combining two stacked U-Net architectures with atrous spatial pyramid pooling for contextual information. However, the inherent inductive bias in convolutional architectures hampers the understanding of long-range dependencies in images. To address this limitation, Valanarasu et al. [42] proposed the gated axial attention model, extending the existing architecture with additional control mechanisms in the self-attention module. They further introduced a local-global training strategy (LoGo) to effectively train models on medical images, improving segmentation performance. To mitigate the computational complexity of self-attention mechanisms, they introduced a tokenized MLP block, reducing parameters and computational complexity while effectively labeling and projecting convolutional features in the latent space [43]. He et al. [44] proposed an efficient hierarchical hybrid vision Transformer (H2Former) structure for medical image segmentation, seamlessly integrating the strengths of CNN, multi-scale channel attention, and Transformer mechanisms. Overall, the extraction of image context plays a pivotal role in improving segmentation performance.

In the specific context of medical image segmentation, the transition to organ-specific methods becomes essential. Therefore, we now delve into gastric and colorectal polyp segmentation methods.

1.1.2 Gastric polyp segmentation methods

In the domain of gastric polyp segmentation, Yan et al. [45] proposed a comprehensive evaluation method that combines subjective considerations with objective information. Their findings identified UNet++ [17], employing a MobileNet v2 [46] encoder, as the optimal segmentation model for gastric polyps. Alternatively, Sun et al. [47] applied adversarial training to enhance gastric polyp segmentation. Despite its proficiency in handling polyp boundary information, this method faces challenges in boundary learning. Addressing these challenges, Li et al. [48] introduced a boundary-guided two-stage learning framework for accurate segmentation of small polyps in gastroscopy images. However, the efficacy of this method is offset by increased resource consumption. The persistent challenges in gastric polyp segmentation stem from the low contrast between polyps and surrounding tissue, as well as variations in size and shape.

To transition from gastric to colorectal polyp segmentation, recognizing overarching challenges in gastrointestinal polyp segmentation is essential. Challenges extend from unique gastric polyp characteristics to broader colorectal polyp considerations. Subsequent exploration of colorectal polyp segmentation builds upon foundational knowledge from gastric polyps.

1.1.3 Colorectal polyp segmentation methods

In the realm of colorectal polyp segmentation, U-shaped networks have demonstrated notable success. Zhou et al. [17] extended U-Net with U-Net++, incorporating nested more dense skip connections, significantly enhancing performance in colon polyp segmentation datasets. However, drawbacks are observed in term of computation time. Sun et al. [49] introduced dilated convolutions to learn high-level semantic features without resolution reduction, designing a simple decoder with fewer parameters than traditional structure to reduce the computation time. Nevertheless, on the ETIS dataset, the dice coefficient is only 62.54%. large-scale medical segmentation models rooted in U-shaped networks have evolved [50,51], exemplified by nnU-Net [50], which streamlines segmentation by automatically configuring network structures based on input images and annotations, and STU-Net [51], adapting nnU-Net to larger parameter segmentation models.

Diverging from the U-shaped network paradigm, Hemin et al. [52] employed a deeper feature extractor based on Mask R-CNN for colorectal polyp segmentation and detection. They emphasized that better training datasets improve network performance and reduce the need for deeper or more complex CNN feature extractors. Yin et al. [53] introduced the dual-context relationship network (DCR-Net) with two parallel context relationship modules focusing on global and local information. This architecture incorporated an adaptive feature fusion mechanism to enhance colon polyp segmentation and provided a queue to store the contextual embedding, albeit with increased memory consumption. Zhao et al. [54] proposed the multi-scale subtraction network (MS-Net), designing a subtraction unit (SU) to generate differential features between adjacent levels in the encoder to obtain rich multi-scale disparity information. This method can provide precise lesion localization in colonoscopy images, but the SU may lead to the removal of small polyps. Addressing the challenging problem of small polyp segmentation, Banik et al. [55] introduced a multi-modal feature fusion strategy, utilizing wavelet pooling and weighted level sets to extract features of different modalities and fuse them at the pixel level. Experimental results demonstrated the advantages of this method in segmenting small polyps. Considering the inconsistency in colors of samples collected under different conditions, Wei et al. [56] proposed a color exchange strategy to eliminate the color differences, forcing the model to focus more on target shape and structure. Moreover, they designed a shallow attention module to filter out the background noise of shallow features and retain small polyp areas. Recently, meta-learning based algorithms have been applied in the field of medical image segmentation, aiming to solve the problems of data scarcity and data shift [57–59]. Khadka et al. [58] introduced meta-learning algorithms, leveraging weights from diverse but smaller training samples, resulting in a 2% – 4% improvement in dice coefficient on gastrointestinal endoscopy and dermoscopy datasets.

Most multi-branch methods intuitively mine information through fusion without interactive learning between branches. To address this, we present BFE-Net, integrating PVT and U-Net for feature enhancement fusion, aiming to balance global contextual information capture and spatial detail consideration for gastrointestinal polyp segmentation.

2. Materials and methods

2.1 Datasets

We conducted model performance evaluation on five widely adopted gastrointestinal polyp datasets, including Kvasir-SEG [29], CVC-ClinicDB [30], CVC-ColonDB [31], CVC-300 [32], and ETIS [33], which are detailed in Supplement 1 Table S1. To assess the generalization ability of our model, we further conducted experiments on additional publicly available two medical image datasets and one natural image datasets, including ISIC2018 [34] (International Skin Imaging Collaboration, https://challenge.isic-archive.com/), BUSI [35] (Breast ultrasound images, https://www.kaggle.com/datasets/aryashah2 k/breast-ultrasound-images-dataset), and CAMO [36] (the natural image dataset with Camouflaged Object, https://sites.google.com/view/ltnghia/research/camo). Details of the datasets used in the generalization experiments are presented in Table 1.

Table 1. The dataset details in the generalization experiments

View Table | View all tables in this article

2.2 Method

2.2.1 Overview

The proposed BFE-Net architecture, as depicted in Fig. 1, consists of three key components: (1) the U-Net and PVT branches; (2) the FEF module; and (3) the AD module.

Fig. 1. The general framework of the proposed BFE-Net

Download Full Size | PDF

In this architecture, the original gastrointestinal endoscopic images are initially fed into the U-Net and PVT branches separately and simultaneously. The U-Net is responsible for extracting local spatial information at various scales, while the PVT branch focuses on capturing global information. The feature representations generated by the U-Net and PVT at the level i, are denoted as ${u_i}$ and ${p_{i\; }}({i = 1,2,3,4} )$ respectively, as illustrated in Fig. 1. In this case, the resolution of the first-level features is $\frac{H}{4} \times \frac{W}{4}$ (H and W are the height and width of the feature map respectively), with a general expression for resolution being $\frac{H}{{{2^{i + 1}}}} \times \frac{W}{{{2^{i + 1}}}}$ (when $i \ge $2). To effectively fuse the features extracted by PVT and U-Net, we develop an efficient FEF module. FEF employs a multi-modal feature fusion mechanism [20] to deeply merge the features from both branches. Furthermore, to enhance the precise localization of polyp regions and recover image details, we introduce an attention AD module. The AD module utilizes a mixed attention mechanism to aggregate deep feature information. The features output from the FEF module are passed to the AD module via a residual structure, which outputs segmentation losses and the final segmentation results. The three key component details of BFE-Net are descripted as follows.

2.2.2 Two branches

The encoder of U-Net branch consists of five convolutional groups, indicated by the blue region in Fig. 1. Each convolutional group consists of 1 max-pooling layer and 2 convolutional layers. In a detail, the resolution of the feature map is first reduced by half using a max-pooling layer with a kernel size $2 \times 2$, for reducing computational cost. Subsequently, the feature extraction is performed with two convolutional layers with kernel size $3 \times 3$ and a stride of 1. After each convolutional layer, a batch normalization (BN) layer and ReLU activation function are applied. For an input image $X \in {\mathrm{\mathbb{R}}^{H \times W \times 3}}$, this process can be expressed by the Eq. (1).

(1)$${u_i} = {W_2}({W_1}({Maxpool(X )} ),i = 0,1,2,3,4$$

where $Maxpool(\cdot )$ represents max-pooling with a kernel size $2 \times 2$, and ${W_1}$ and ${W_2}$ denote two convolution units. Each convolution unit consists of a convolutional layer with a kernel size of $3 \times 3$, batch normalization layer, and ReLU activation function.

The PVT branch generates four multiscale feature maps at different stages, denoted as ${p_{i\; }}({i = 1,2,3,4} )$. Among these feature maps, ${p_1}$ contains information about fine polyp appearances, while ${p_2}$, ${p_{3}}$ and ${p_4}$ provide higher-level feature representations, as illustrated in the orange area in Fig. 1. Each level of feature map extracted by U-Net and PVT branches has the same dimensions and resolution, and therefore the resolutions of four level feature maps are $\left[ {\frac{H}{2},\frac{W}{2},\frac{C}{2}} \right],\; \; \left[ {\frac{H}{4},\frac{W}{4},C} \right],\; \; \left[ {\frac{H}{8},\frac{W}{8},2C} \right],\; \; \left[ {\frac{H}{{16}},\frac{W}{{16}},5C} \right]$ and $\left[ {\frac{H}{{32}},\frac{W}{{32}},8C} \right]$ respectively, with C being set to 64 in this paper. Finally, the feature maps extracted from two branches are separately transferred into the FEF module for interactive information fusion.

2.2.3 Feature enhanced fusion module

We have introduced the FEF module for efficient fusion of feature information from the PVT and U-Net branches, whose architecture is depicted in Fig. 2. Different from using stepwise down-sampling to fuse features of different scales [20], FEF uses up-sampling operations to ensure the integrity of information. And this module employs multimodal feature fusion mechanisms and linear Hadamard products [60] to achieve interactive feature fusion. Multimodal feature fusion mechanisms help to merge features extracted by U-Net and PVT branches in their respective feature modalities. FEF emphasizes the relevance of information from different feature modalities by feeding intermediate layer information from each modality output to the next layer. Corresponding to the feature extraction of two branches, we designed four FEF modules (as described in Fig. 1, ${F_1} - {F_4}$) to fuse different scales of features from the beginning of the 4^th level of FEF module to outputting the segmentation result in the first level of FEF module. Except for the fourth level of FEF module (${F_4}$), the other three FEF modules introduce fused features ${f_i}$ from the previous level to facilitate cross-modal feature fusion (Fig. 2). We use convolution operations to refine the features from the PVT branch ${p_{i\; }}$, U-Net branch ${u_{i\; }}$, and ${f_{i + 1}},\; i = 1,\; 2,\; 3$, obtaining ${F_{{p_i}}}$, ${F_{{u_i}}}$ and ${F_{{f_{i + 1}}}}$, as described in Eqs. (2)-(4).

(2)$${F_{{p_i}}} = Conv1 \times 1({{p_i}} )$$

(3)$${F_{{u_i}}} = Conv1 \times 1({{u_{i\; }}} )$$

(4)$${F_{{f_{i + 1}}}} = Conv1 \times 1({Up2x({{f_{i + 1}}} )} )$$

where, $Conv1 \times 1(\cdot )$ represents a convolution unit, in which the kernel size of convolutional layer is $1 \times 1$. The purpose of this unit is to adjust the channel count of input feature maps and reduce computational complexity. $Up2x(\cdot )$ indicates the upsampling operation with a scale factor of 2, ensuring that the feature map size of ${F_{{f_{i + 1}}}}$ matches those of ${F_{{p_i}}}$ and ${F_{{u_i}}}$. Subsequently, interaction between different modalities is achieved by linear Hadamard products between the refined features ${F_{{p_i}}}$, ${F_{{u_i}}}$, and ${F_{{f_{i + 1}}}}$, resulting in the interaction feature ${H_i}$, as represented in Eq. (5).

(5)$${H_i} = {F_{{p_i}}}\; \odot \; {F_{{u_i}}}\; \odot \; {F_{{f_{i + 1}}}}$$

where, $\odot $ represents the linear Hadamard product, which is element-wise multiplication of tensors. Finally, the interaction feature ${H_i}$ and relevant features ${p_{i\; }}$, ${u_{i\; }}$ are concatenated along the channel dimension and fused through a residual module to obtain the fused feature ${f_{i\; }},i = 1,2,3,4$, as shown in Eq. (6).

(6)$${f_i} = {W_3}({W_2}({{W_1}({({ReLU({BN({Concat({{H_i},{p_{i\; }},{u_{i\; }}} )} )} )} )} )} )) + {H_i}$$

in this equation, ${W_1}(\cdot )$ and ${W_3}(\cdot )$ are both convolutional units with convolution kernel size of $1 \times 1$, where ${W_3}(\cdot )$ does not include batch normalization and ReLU activation. ${W_2}(\cdot )$ is a convolutional unit with convolution kernel size of $3 \times 3$. $Concat(\cdot )$ concatenates feature maps along the channel dimension. When $i = 4,$ ${F_{{f_{i + 1}}}} = 0.$

Fig. 2. Schematic diagram of feature enhancement fusion module

Download Full Size | PDF

2.2.4 Attention decoder module

We transfer the multi-scale feature information ${f_1}$ fused by the FEF module to the AD module via a residual structure. The function of the AD module is to restore image details and produce the final segmentation output, whose architecture schematic diagram is depicted in Fig. 3. To effectively suppress irrelevant regions and achieve finer feature interaction fusion, we employ both channel attention (CA) module [61] and spatial attention (SA) module [62] to process the fused feature ${f_1}$ from the FEF module. First, we process the fused feature ${f_1}$ through the CA module, where the CA module utilizes average pooling information to excite the features and max-pooling information to retain more relevant details, resulting in the feature ${f_{ca}}$. Then, we further process ${f_{ca}}$ through the SA module. The role of the SA module is to enhance attention to regions of interest while reducing attention to noise in the image, ultimately yielding the feature ${f_{sa}}$. Finally, we apply convolution and upsampling operations to ${f_{sa}}$ to obtain the final segmentation result. The calculation process of AD module is described in Eqs. (7)-(9):

(7)$${f_{ca}} = \sigma ({ML{P_1}({AP({{f_1}} )} )+ ML{P_2}({MP({{f_1}} )} )} )\; \cdot {f_1}$$

(8)$${f_{sa}} = \sigma (Conv({Concat({Mean({{f_{ca}}} )+ Max({{f_{ca}}} )} )} )\cdot {f_{ca}}$$

(9)$$Output = Up4x({out({{f_{sa}}} )} )$$

where $AP(\cdot )$ and $MP(\cdot )$ represent adaptive average pooling and adaptive max pooling layers, respectively. $ML{P_1}(\cdot )$ and $ML{P_2}(\cdot )$ are composed of two convolution layers and one activation function layer, which share parameters. $\sigma (\cdot )$ denotes the sigmoid activation function, and $Up4x(\cdot )$ represents the up-sampling operation with a scale factor of 4.

Fig. 3. Schematic diagram of the attention decoder module

Download Full Size | PDF

In the CA module, max-pooling and average-pooling information are first added, and then they are mapped to the range $\{{ - 1,1} \}$ by the $\sigma (\cdot )$ to obtain the corresponding channel weights. These channel weights are then multiplied with the fused feature ${f_1}$ to perform adaptive feature refinement, as described in Eq. (7). Subsequently, the output feature ${f_{ca}}$ from the CA module is fed into the SA module.

In the SA module, the mean and max information are concatenated along the channel dimension, and then they are mapped to the range $\{{ - 1,1} \}$ by the $\sigma (\cdot )$ to obtain spatial weights. These spatial weights are multiplied with ${f_{ca}}$ to perform adaptive feature refinement, as illustrated in Eq. (8). Finally, ${f_{sa}}$ is adjusted in the number of channels via convolution layers and upsampled to restore the original image's resolution size, as detailed in Eq. (9).

2.2.5 Loss function

The loss function of the BFE-Net network is represented as the Eq. (10).

(10)$$\mathrm{{\cal L}} = {\mathrm{{\cal L}}_{WIoU}}({Out,GT} )+ {\mathrm{{\cal L}}_{WBCE}}({Out,GT} )$$

where $Out$stands for the model's output, $GT$ represents the ground truth, and ${\mathrm{{\cal L}}_{WIoU}}(\cdot )$ and ${\mathrm{{\cal L}}_{WBCE}}(\cdot )$ denote the weighted intersection over union (WIoU) loss and weighted binary cross entropy (WBCE) loss [63], respectively. ${\mathrm{{\cal L}}_{WIoU}}(\cdot )$ constrains the model's output results on a global structural level, while ${\mathrm{{\cal L}}_{WBCE}}(\cdot )$ focuses on the local details. Compared to the standard IoU loss function, ${\mathrm{{\cal L}}_{WIoU}}(\cdot )$ pays more attention to pixels that are challenging to segment. Similarly, in contrast to the standard binary cross entropy (SBCE) loss function, which treats all pixels equally, ${\mathrm{{\cal L}}_{WBCE}}(\cdot )$ takes into consideration the importance of each pixel and assigns higher weights to pixels that are difficult to segment [19].

2.2.6 Evaluation metrics

We evaluate the segmentation performance of the proposed method using internationally recognized metrics, including Mean Intersection over Union (MIoU), Mean Dice (MDice), Accuracy, Recall, Precision, Mean Absolute Error (MAE), and F2-Score, as shown in Eqs. (11)-(19). Here, TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively.

(11)$$Io{U_{polyp}} = \frac{{Out\; \cap \; GT}}{{Out\; \cup \; GT}} = \frac{{TP}}{{TP + FP + FN}}$$

(12)$$IoU\_bg = \frac{{Out\; \cap \; GT}}{{Out\; \cup \; GT}} = \frac{{TN}}{{TN + FP + FN}}$$

(13)$$MIoU = \frac{{IoU\_polyp + IoU\_bg}}{2}$$

(14)$$Accuracy = \frac{{TP + TN}}{{TP + TN + FP + FN}}$$

(15)$$Precision = \frac{{TP}}{{TP + FP}}$$

(16)$$Recall = \frac{{TP}}{{TP + FN}}$$

(17)$$MDice = 2 \times \frac{{({Out\; \cap \; GT} )}}{{Out\; + \; GT}} = 2 \times \frac{{Precision \times Recall}}{{Precision + Recall}}$$

(18)$$MAE = \frac{1}{m}\mathop \sum \limits_{j = 1}^m |{Ou{t_j} - G{T_j}} |$$

(19)$$F2 - Score = 5 \times \frac{{Precision \times Recall}}{{4 \times Precision + Recall}}$$

where, $IoU\_bg$ and $IoU\_polyp$ are the IoU values of the background and polyps in the digestive tract endoscopic image respectively, and MIoU is the average value of $IoU\_bg$ and $IoU\_polyp$, as shown in Fig. 4, the calculation of Dice coefficient is similar way. m is the number of digestive tract endoscopy images in the test set. The prediction result of the model for the j-th image is $Ou{t_j}$, and the true label is $G{T_j}$.

Fig. 4. The calculation of IoU and Dice coefficient. (a) shows the polyp area; (b) shows the background area.

Download Full Size | PDF

3. Experimental results

In this section, we validate the effectiveness of BFE-Net on five gastrointestinal polyp datasets and compare its performance and generalization capabilities with the existing methods.

3.1 Experiment details

Our code is written in Python 3.6, and we implemented BFE-Net using the PyTorch 1.10 deep learning framework. All experiments were conducted on a server equipped with an NVIDIA GPU (A100-PCIE-80GB). Given the variability in the sizes of endoscopic images, we employed a multi-scale strategy during the training phase [19], and the scale factor is set to [0.5, 1, 1.25]. We first normalized the input image and then utilized data augmentation strategies such as random rotation, horizontal and vertical mirroring, and resized the input images to $352 \times 352$. For updating network parameters, we used the AdamW optimizer [64] with a learning rate of 1e-4 and weight decay of 1e-4. The batch size was set to 16, and the number of training epochs was 100.

To enhance the robustness of the algorithm, we partitioned the data into training and test sets. The training set consisted of relatively easy-to-segment images from the Kvasir-SEG and the CVC-ClinicDB datasets, comprising a total of 1450 samples, in which 900 images and 550 images from the Kvasir-SEG and the CVC-ClinicDB datasets respectively. The test set primarily consisted of more challenging polyp images in five datasets, which included the remaining 100 images from the Kvasir-SEG dataset, the remaining 62 images from the CVC-ClinicDB dataset, the complete images in the CVC-ColonDB, ETIS and CVC-300, in total 798 images. The partition details of the training and test sets in the five datasets in this study are presented in Table 2.

Table 2. The partition details of training and test sets in the five gastrointestinal polyp datasets

View Table | View all tables in this article

3.2 Comparative experimental results

In this section, we compared our proposed model with the existing related gastrointestinal polyp segmentation methods, including EnhancedUNet [65], PraNet [19], SANet [56], HarDNet [66], ACSNet [67], DBHNet [20] and CFANet [22]. We also compare BFE-Net against state-of-the-art medical image segmentation methods: U-Net [16], U-Net++ [17], DeepLabV3 [68], ResUNet [18], AttUNet [69].

The performance evaluation results of our method and the compared methods on five databases are shown in Supplement 1 Tables S2 to S6, and the visual representation of the Mean Dice is displayed in Fig. 5 and the Mean IoU is visualized in Supplement 1 Figure S1. For a fair comparison, we applied the same data preprocessing, pre-trained parameters, and evaluation metrics to all experiments.

Fig. 5. A visual display of the Mean Dice of our method and the comparative method on five gastrointestinal polyp data sets. The abscissa represents the epoch and the ordinate indicates the values of the Mean Dice of each method. Different methods are represented by different color. (a) Kvasir-SEG data set; (b) CVC-ClinicDB data set; (c) CVC-ColonDB data set; (d) CVC-300 data set; (e) ETIS data set.

Download Full Size | PDF

As shown in Fig. 5 (a) and Supplement 1 Table S2, on the Kvasir-SEG dataset, our method outperformed the current state-of-the-art method CFANet and the classical PraNet by 1% and 1.3% in terms of Mean Dice. It also performed better than PraNet and CFANet by 5.5% in Mean IoU. On the CVC-ClinicDB dataset (Fig. 5 (b) and Supplement 1 Table S3), our method led CFANet by 1.8% in Mean Dice, SANet, which ranked second, by 1.7%, and the classical PraNet by 1.8%. In terms of Mean IoU, our method was better than the classical PraNet and ahead of the existing method CFANet by 4.9%. On the CVC-ColonDB dataset (Fig. 5 (c) and Supplement 1 Table S4), our method outperformed CFANet and SANet, leading them by 6.7% and 5.9%, respectively, in Mean Dice. In Mean IoU, it surpassed SANet and the classical polyp segmentation network PraNet by 3.2% and 3.4%, respectively. On the CVC-300 dataset (Fig. 5 (d) and Supplement 1 Table S5), our method exceeded PraNet in term of Mean Dice by 2.6%. Furthermore, on the ETIS dataset (Fig. 5 (e) and Supplement 1 Table S6), our method outperformed the existing state-of-the-art method CFANet by 3.7%, particularly ahead of PraNet by 10.3%. In terms of Mean IoU, it surpassed CFANet and the classical PraNet by 3.3% and 4.4%, respectively.

To illustrate the segmentation performance of BFE-Net, we presented the scatter plots of Mean Dice (Fig. 6 (a)), Mean IoU (Fig. 6 (b)), and F2-Score (Fig. 6 (c)) for our method and compared methods on the five datasets. Overall, our model demonstrated excellent performance on all the five gastrointestinal polyp datasets, with all three metrics reaching the best levels. Regarding the F2-Score metric, our method outperformed the existing methods on all five datasets, particularly excelling on the ETIS dataset. Our method achieved over 80% in F2-Score, while the existing methods reached around 70%, indicating that our method outperforms the existing methods by up to 10 percentage points in performance metrics.

Fig. 6. Scatter plots of three evaluation indicators for our method versus the comparative methods on five gastrointestinal datasets. On the same an indicator, the results on different dataset are represented by different color. (a) Mean Dice; (b) Mean IoU; (c) F2-Score. In terms of overall comparison, BFE-Net is the best network.

Download Full Size | PDF

Similarly, we compared our method with the existing methods through visualizing some representative segmentation results, as shown in Fig. 7.

Fig. 7. Visualization of the segmentation results of this method and existing methods. Blue represents the GT area, green represents the correctly predicted area, and red represents the incorrectly predicted area. The first row shows the endoscopic images of polyps with a complicated digestive tract environment; the second row shows the endoscopic images of polyps with interference factors; the third row shows the polyps with high similarity to surrounding tissues; the fourth row shows multi-target polyps; the fifth row shows small target polyps. The first column is the original image, the second column is the real label corresponding to the original image, i. e. ground truth (GT), and the third to seventh columns represent the segmentation results of Ours, SANet, ACSNet, HarDNet, PraNet, R2UNet and UNet respectively. As we can see, the proposed model can accurately localize and segment polyps regardless of their size.

Download Full Size | PDF

When confronted with intricate digestive tract scenarios typified by diminished luminosity, the presence of interfering factors, and subdued contrast between lesion areas and adjacent tissues (depicted in rows 1 to 3 of Fig. 7), our approach exhibited commendable efficacy in lesion segmentation. Particularly noteworthy is its adept segmentation even in instances of low contrast between polyp areas and surrounding tissues, a capability lacking in alternative methodologies, as illustrated in the third row of Fig. 7. This underscores the innovative feature representation learning inherent in our model, facilitating discernment of subtle distinctions in image features and thereby enabling precise segmentation under challenging conditions.

Furthermore, in the context of segmentation tasks involving diminutive and multiple targets, our method aligns closely with ground truth, as evidenced in the fourth and fifth rows of Fig. 7. This accomplishment can be attributed to the sophisticated architectural design of our model, which integrates attention mechanisms and multi-scale feature fusion, affording it the capacity to effectively capture and delineate intricate structures within the image. In contrast, prevailing methodologies such as SANet, PraNet, and others encounter difficulties in lesion segmentation, often conflating adjacent lesions into a singular entity, as depicted in the fourth row of Fig. 7. Moreover, in tasks focusing on small lesion segmentation, extant methods like SANet, PraNet, and R2UNet manifest conspicuous over-segmentation, exemplified in the fifth row of Fig. 7.

To summarize, the visual depictions in Fig. 7 unequivocally affirm the superior performance of our model in the realm of polyp segmentation utilizing endoscopic images, thereby generating high-quality segmentation outcomes. In a concise evaluation, juxtaposed against classical polyp segmentation network methodologies such as PraNet, SANet, and HarDNet, our approach demonstrates markedly enhanced segmentation performance and introduces an innovative paradigm for addressing the inherent complexities in polyp segmentation, substantiating its superior applicability and clinical potential in this domain.

3.3 Ablation experiment results

We conducted three experiments to verify the effectiveness of BFE-Net model proposed in this paper. In the first experiment, the U-Net and PVT are alone applied to segment gastrointestinal polyp areas. The second experiment, labeled as “PVTUNet with FEF”, applied the model of PVT and U-Net branches fused by FEF to validate the function of the FEF module. In the third experiment, PVTUNet with FEF + AD model, namely our BFE-Net, is used to illustrate the role of AD. Table 3 shows the average value and standard deviation (SD) of Dice obtained in the ablation experiments on five datasets.

Table 3. Average values and SDs of Dice obtained in the ablation experiments.^a

View Table | View all tables in this article

From the data in Table 3, we can observe that when segmenting gastrointestinal polyps using only U-Net or PVT, the results are relatively unsatisfactory. On the Kvasir-SEG dataset, the Mean Dice coefficients for U-Net and PVT are 78.7% and 78.4%, respectively, while on the ETIS dataset, they are 42.7% and 53.0%, respectively. Then, with the introduction of the FEF module to fuse the features extracted by both branches, this improvement raised the Mean Dice coefficient to 90.3% on the Kvasir-SEG dataset, representing an improvement of 11.6% compared to the best result from the individual branch. Importantly, on the ETIS dataset, the improvement was as high as 22.6%, demonstrating that the fusion strategy of using the FEF module can significantly enhance segmentation performance. Finally, the “Ours” method, which has introduced the AD module on top of integrating the FEF module, increased the Mean Dice coefficient to 76.9% on the ETIS dataset. Notably, on the CVC-ColonDB dataset, the Mean Dice coefficient reached 81.0%, which represents an increase of 1.3% and 2.2% compared to using the FEF module alone. In summary, our method demonstrated the best segmentation performance on five different datasets, further confirming the effectiveness of the strategy of fusing PVT and U-Net features using the FEF module and adding the AD module. Figure 8. provides a visual representation of the segmentation results for some representative gastrointestinal polyps during the ablation experiments, supporting the observations of the experimental results.

Fig. 8. Visualization of ablation experiment results. Blue represents the GT area, green represents the correctly predicted area, and red represents the incorrectly predicted area. The first row is intestinal polyps, characterized by a relatively complex imaging environment; the second row is gastric polyps, characterized by the presence of reflective spots in the image, the third row is intestinal polyps, characterized by small polyps; the fourth row is intestinal polyps, characterized by multiple target polyps; the fourth row is intestinal polyps, characterized by the presence of interference factors in the image. The first column is the original image, the second column is the real label corresponding to the original image, i. e. ground truth (GT), and the third to sixth columns represent the segmentation results of Ours, U-Net, PVT, PVTUNet with FEF.

Download Full Size | PDF

Upon examination of Fig. 8, it becomes evident that the conventional U-Net architecture exhibits substantial deficiencies in segmenting small objects and multiple targets within complex gastrointestinal environments. Specifically, the U-Net demonstrates a marked struggle with the precise delineation of targets against a complex background, as well as distinguishing between actual lesions and reflective artifacts, as indicated in the first and second rows, respectively. This limitation is further accentuated in the fourth row, where the U-Net's capability to handle multi-object segmentation is challenged. The PVT, while proficient in global information extraction and lesion localization, shows its limitations in detailed segmentation across various scenarios, as depicted in the fifth column of Fig. 8. Contrastingly, the “PVTUNet with FEF” method notably improves the accuracy of lesion localization, which is particularly salient in cases involving multiple lesions. However, the exclusive use of the FEF module can result in false-positive segmentations in the context of small objects, as observed in the third row. The integration of the AD module into our proposed method mitigates these shortcomings by leveraging an attention mechanism that focuses on lesion features, suppresses extraneous noise, and accurately decodes feature information, thereby enhancing segmentation of small lesions. The collective results suggest that our approach (as shown in the second column of Fig. 8), designated as “Ours” achieves segmentation that closely parallels the ground truth labels, substantiating the proposed method's efficacy in addressing the intricate demands of gastrointestinal image segmentation

3.4 Generalization ability study

We also conducted experiments on other publicly available datasets: the ISIC2018 dataset, the BUSI dataset, and the CAMO dataset to evaluate the generalization ability of our model. We used the Mean IoU, Mean Dice, and F2-Score metrics to evaluate the experimental results. To ensure a fair comparison, all experiments used the same data preprocessing methods, pre-trained parameters, and evaluation metrics. The performance evaluation results of our method and compared methods on the three datasets are shown in Table 4 and Fig. 9.

Fig. 9. Bar chart of evaluation metrics of our method and comparative methods on three generalization experimental data sets. (a) shows the evaluation metrics of each network on the ISIC2018 dataset; (b) shows the evaluation metrics of each network on the BUSI dataset; (c) shows the evaluation metrics of each network on the CAMO dataset. In terms of overall comparison, BFENet is the best network.

Download Full Size | PDF

Table 4. The average and standard deviation of the evaluation indicators of our method and the comparative method on three generalization experimental data sets^a

View Table | View all tables in this article

From the data in Table 4, we can observe that our method achieved the best results in terms of Mean Dice, Mean IoU, and F2-Score on the ISIC2018 dataset (Fig. 9 (a)), indicating its excellent segmentation ability for skin lesion segmentation tasks. Notably, our model demonstrated robust performance on the diverse and complex ISIC2018 dataset, which encompasses various types of skin lesion images, posing a challenge to the model's generalization ability in real-world scenarios.

On the BUSI dataset (Fig. 9 (b)), our method achieved the best performance in Mean Dice and had results close to the best in terms of Mean IoU, demonstrating the strong performance of our model on different data modalities. The BUSI dataset comprises breast ultrasound images characterized by low resolution, high noise, and intricate tissue structures, thereby increasing the difficulty of image segmentation. Nevertheless, our model effectively captured these complex structures and yielded satisfactory results, further validating the model's versatility and robustness.

For the CAMO dataset (Fig. 9 (c)), where the target regions are highly similar to the background, posing a significant segmentation challenge, our method obtained the best results in terms of Mean Dice, Mean IoU, and F2-Score, all approaching 80%. It outperformed the second-ranked PraNet by 5.4% in Mean Dice. This underscores our model's excellence in handling challenging datasets, effectively distinguishing between target and background. The distinctive feature of the CAMO dataset lies in the high similarity between background and target, presenting difficulties for traditional segmentation methods. However, our deep learning model successfully overcame this challenge by learning feature representations.

Overall, our model demonstrated outstanding performance on these three datasets, showcasing its superiority across different data modalities and challenging conditions. The visual results in Fig. 9 and the data in Table 4 further affirm the model's generalization ability and applicability, indicating its promising practical potential for various image segmentation tasks.

Similarly, we compared our method with existing methods through visual segmentation results, as shown in Fig. 10. In breast lesion ultrasound image segmentation tasks, our method was closest to the ground truth (as shown in the first and second rows, third column in Fig. 10). In tasks involving camouflaged targets, existing methods such as SANet and PraNet struggled with target segmentation and failed to accurately segment boundary areas (as shown in the third row of Fig. 10). Additionally, in skin lesion segmentation tasks, existing methods like SANet, PraNet, and HarDNet exhibited clear over-segmentation (as shown in the fifth row of Fig. 10). In summary, the visual results in Fig. 10 once again confirm the outstanding performance of our model in other segmentation tasks and its excellent generalization performance.

Fig. 10. Visualization of the segmentation results of this method and the comparison method. The first and second rows are breast ultrasound images; the third and fourth rows are camouflage object images; and the fifth and sixth rows are dermoscopic images. The first column is the original image, the second column is the real label corresponding to the original image, and the third to seventh columns represent the segmentation results of Ours, SANet, ACSNet, HarDNet, PraNet, R2UNet and UNet respectively. As we can see, the proposed model can accurately localize and segment the lesion area.

Download Full Size | PDF

4. Discussions

The BFE-Net model was rigorously evaluated using internationally recognized metrics like Mean Intersection over Union (MIoU), Mean Dice, Accuracy, Recall, Precision, Mean Absolute Error (MAE), and F2-Score. These metrics provide a comprehensive assessment of the model's segmentation accuracy and reliability, as mentioned in section 2.2.6. The model was verified on five challenging gastrointestinal polyp datasets: Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, CVC-300, and ETIS. Additionally, to assess its generalization capabilities, experiments were also conducted on other medical and natural image datasets like ISIC2018 (skin imaging), BUSI (breast ultrasound images), and CAMO (camouflaged objects). The performance of BFE-Net was compared with other state-of-the-art medical image segmentation methods such as U-Net, U-Net++, DeepLabV3, ResUNet, AttU-Net, EnhancedUNet, PraNet, SANet, HarDNet, ACSNet, DBHNet, and CFANet. The model showed superior performance across all datasets. For instance, on the Kvasir-SEG dataset, BFE-Net achieved higher Mean Dice and Mean IoU scores compared to other models, demonstrating its effectiveness in accurately segmenting polyps. The model was not only effective in gastrointestinal polyp segmentation but also showed exceptional generalization capabilities across different data modalities. For example, on the ISIC2018 dataset, BFE-Net outperformed other models in terms of Mean Dice, Mean IoU, and F2-Score, indicating its excellent segmentation ability for skin lesion segmentation tasks. The quality of segmentation by BFE-Net was highlighted through visual comparisons. In tasks involving complex scenarios like small and multiple target polyps, the segmentation results were closest to the ground truth, which indicates the model's capacity to handle challenging segmentation tasks effectively. Ablation experiments confirmed the effectiveness of the Feature Enhanced Fusion (FEF) module and the Attention Decoder (AD) module in BFE-Net. These components significantly contributed to the model's performance, as demonstrated by improvements in Mean Dice scores across various datasets.

We also discuss weighting issues regarding the loss function, some limitations of our model, and future research directions.

4.1 Weight parameters of the weighted loss function

In this paper, we use a loss function $\mathrm{{\cal L}} = \lambda \cdot {\mathrm{{\cal L}}_{WIoU}} + \mu \cdot {\mathrm{{\cal L}}_{WBCE}}$, where $\lambda = \mu = 1$, as shown in Eq. (10). The weights $\lambda $ and $\mu $ need to be determined based on experimental results. The values of $\lambda $ and $\mu $ fall within the range [0, 1], and $\lambda + \mu = 1$. In Table 3, we empirically found that the segmentation results obtained by fusing PVT and U-Net were satisfactory. Therefore, in the initial value setting, we chose $\lambda = \mu = 1$. To further explore the influence of $\lambda $ and $\mu $ values on the experimental results, we conducted a series of experiments and presented the performance comparison results of various methods under different weight combinations of the two parameters in Fig. 11 (for detailed data, please refer to the Supplement 1 Table S7 and Table S8). We observed that our method might require different weight values for $\lambda $ and $\mu $ on different datasets. For example, on the Kvasir-SEG dataset, when $\lambda = 1.0$ and $\mu = 0.0$, our method achieved optimal Mean Dice and Mean IoU metrics. In addition, on the ETIS dataset, our method exhibited significant performance differences for different $\lambda $ and $\mu $ values. When $\lambda = 0.8$ and $\mu = 0.2$, the Mean Dice metric reached 81.7%, and the Mean IoU metric reached 86.5%. This further indicates that in more complex segmentation tasks, such as the ETIS dataset, which includes many small target lesions, ${\mathrm{{\cal L}}_{WBCE}}$ focuses more on local details, considering the importance of each pixel and assigning higher weights to pixels that are difficult to segment. In such cases, ${\mathrm{{\cal L}}_{WIoU}}$ does not need to focus as much on global information.

Fig. 11. Visual diagram of the impact of different $\lambda $ and $\mu $ values on evaluation indicators. (a) A graph showing the impact of different $\lambda $ and $\mu $ values on the Mean Dice index; (b) A graph showing the impact of different $\lambda $ and $\mu $ values on the Mean IoU index.

Download Full Size | PDF

4.2 Limitations

Despite the significant performance advantages demonstrated by the BFE-Net model proposed in this paper for gastrointestinal polyp segmentation tasks, there are still some limitations in edge segmentation. As shown in Fig. 7 (second row, third column, and fourth row, third column), there exists a discernible loss of edge information along the polyp boundaries, leading to minor discrepancies between the generated segmentation map and the actual labels in the vicinity of polyp contours. This issue is evident in the segmentation maps where there are discrepancies between the predicted and actual contours of the polyps. Such inaccuracies are more noticeable in complex cases involving irregularly shaped or small polyps, where precise edge delineation is crucial. Additionally, this limitation could be attributed to the inherent challenges in differentiating polyp edges from surrounding mucosal textures, especially in low-contrast areas. It is noteworthy that notwithstanding the relatively extensive parameterization of our model, its computational intricacy is comparatively diminished when juxtaposed with extant models, as delineated in Table 5.

Table 5. Comparison of the amount of calculations and parameters between our model and existing models^a

View Table | View all tables in this article

4.3 Future direction

It is imperative to acknowledge that, the proficiency of BFE-Net in precisely identifying polyp locations and quantities despite edge segmentation limitations is emphasized. In light of the challenges posed by edge information loss, we propose that future research endeavors should concentrate on devising more resilient polyp segmentation networks. Specifically, the integration of innovative polyp edge-smoothing techniques could play a pivotal role in enhancing the model's capacity to capture and leverage intricate polyp edge and contour details. This strategic focus on refining edge segmentation could potentially lead to more nuanced and accurate polyp segmentation results, ultimately advancing the capabilities of medical imaging and diagnosis in the domain of gastrointestinal health.

5. Conclusion

In this paper, we proposed a bilateral hybrid network aimed at improving the segmentation performance of small and multi-target polyps in gastrointestinal endoscopy images. We introduced the FEF module in our novel approach to fuse advantageous features extracted by U-Net and PVT, bridging the gap in U-Net's ability to capture global context information and PVT's deficiency in capturing spatial details effectively. To more precisely locate and suppress irrelevant noise, we employed the Hybrid Attention Decoder (AD) as a replacement for traditional decoder structures. Experimental results demonstrated that the FEF and AD modules played a positive role in improving segmentation performance.

Overall, our method achieved the best results across multiple evaluation metrics on five different gastrointestinal polyp segmentation datasets, significantly outperforming existing segmentation models. Additionally, our method also achieved the best results on three different publicly available datasets, indicating the exceptional generalization capability of our method across different data modalities. However, our experiments also revealed that there is still room for performance improvement. In endoscopic image segmentation tasks, lesion boundaries often exhibit complexity due to factors such as possible highlights and reflections, artifacts, and the high similarity of lesion regions to surrounding tissues. These factors explain why our model has relatively poor learning ability at the boundary positions, highlighting areas for potential improvement in performance.

Finally, we hope that this study can inspire more new thinking to promote the clinical application of intelligent polyp segmentation methods.

Funding

National Natural Science Foundation of China (62271127); Medico-Engineering Cooperation Funds from the University of Electronic Science and Technology of China (HXDZ22005, ZYGX2022YGRH011); Sichuan Natural Science Foundation (23NSFSC0627).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data underlying the results presented in this paper are available in Ref. [19].

Supplemental document

See Supplement 1 for supporting content.

Reference

1. M.-Y. Cai, L. Zhu, X.-Y. Xu, et al., “Endoscopic mucosal resection of gastrointestinal polyps with a novel low-temperature plasma radio frequency generator: a non-inferiority multi-center randomized control study,” Surg. Endosc. 37(4), 3272–3279 (2023). [CrossRef]

2. J. Ferlay, M. Colombet, I. Soerjomataram, et al., “Cancer statistics for the year 2020: An overview,” Int. J. Cancer 149(4), 778–789 (2021). [CrossRef]

3. M. C. S. Wong, J. Huang, P. S. F. Chan, et al., “Global incidence and mortality of gastric cancer, 1980-2018,” JAMA Netw. Open 4(7), e2118457 (2021). [CrossRef]

4. E. Morgan, M. Arnold, A. Gini, et al., “Global burden of colorectal cancer in 2020 and 2040: incidence and mortality estimates from GLOBOCAN,” Gut 72(2), 338–344 (2023). [CrossRef]

5. J. Weng, S. Li, Z. Zhu, et al., “Exploring immunotherapy in colorectal cancer,” J. Hematol. Oncol. 15(1), 95 (2022). [CrossRef]

6. T. Keen and C. Brooks, “Principles of gastrointestinal endoscopy,” Surg. 41(2), 100–105 (2023). [CrossRef]

7. Y. Hazewinkel and E. Dekker, “Colonoscopy: basic principles and novel techniques,” Nat. Rev. Gastroenterol. Hepatol. 8(10), 554–564 (2011). [CrossRef]

8. J. Wei, S. Zhao, and Y. Bai, “The impact of texture and color enhancement imaging on adenoma and sessile serrated lesion detection: much more to explore,” Gastroenterology. (2024).

9. H. Ono, H. Kondo, T. Gotoda, et al., “Endoscopic mucosal resection for treatment of early gastric cancer,” Gut. 48(2), 225–229 (2001). [CrossRef]

10. T. Gotoda, H. Yamamoto, and R. M. Soetikno, “Endoscopic submucosal dissection of early gastric cancer,” J. Gastroenterol. 41(10), 929–942 (2006). [CrossRef]

11. M. Rottoli, S. Bona, R. Rosati, et al., “Laparoscopic rectal resection for cancer: effects of conversion on short-term outcome and survival,” Ann. Surg. Oncol. 16(5), 1279–1286 (2009). [CrossRef]

12. S. A. Antoniou, G. A. Antoniou, O. O. Koch, et al., “Robot-assisted laparoscopic surgery of the colon and rectum,” Surg. Endosc. 26(1), 1–11 (2012). [CrossRef]

13. M. Ishioka, H. Osawa, T. Hirasawa, et al., “Performance of an artificial intelligence-based diagnostic support tool for early gastric cancers: Retrospective study,” Dig. Endosc. Den. 35(4), 483–491 (2022). [CrossRef]

14. M. Fiori, P. Musé, and G. Sapiro, “A complete system for candidate polyps detection in virtual colonoscopy,” Int. J. Patt. Recogn. Artif. Intell. 28(07), 1460014 (2014). [CrossRef]

15. T. Shibata, A. Teramoto, H. Yamada, et al., “Automated detection and segmentation of early gastric cancer from endoscopic images using mask R-CNN,” Appl. Sci. 10(11), 3842 (2020). [CrossRef]

16. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer Assisted Intervention, MICCAI 2015 (Springer International Publishing, 2015), pp. 234–241.

17. Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, et al., “UNet++: A nested U-Net architecture for medical image segmentation,” Deep Learn. Med. Image Anal. Multimodal Learn. Clin. Decis. Support. 11045, 3–11 (2018). [CrossRef]

18. M. Z. Alom, M. Hasan, C. Yakopcic, et al., “Recurrent residual convolutional neural network based on U-Net (R2U-Net) for Medical Image Segmentation,” arXiv, arXiv:1802.06955 (2018). [CrossRef]

19. D.-P. Fan, G.-P. Ji, T. Zhou, et al., “PraNet: parallel reverse attention network for polyp segmentation,” in Medical Image Computing and Computer Assisted Intervention, MICCAI 2020 (Springer International Publishing, 2020), pp. 263–273.

20. D. He, Y. Zhang, H. Huang, et al., “Dual-branch hybrid network for lesion segmentation in gastric cancer images,” Sci. Rep. 13(1), 6377 (2023). [CrossRef]

21. Z. Liu, Y. Lin, Y. Cao, et al., “Swin transformer: hierarchical vision transformer using shifted windows,” in IEEE CVF International Conference Computer Vision, ICCV 2021 (IEEE, 2021), 9992–10002.

22. T. Zhou, Y. Zhou, K. He, et al., “Cross-level feature aggregation network for polyp segmentation,” Pattern Recognition 140, 109555 (2023). [CrossRef]

23. A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” arXiv, arXiv:1706.03762 (2017). [CrossRef]

24. J. Beal, E. Kim, E. Tzeng, et al., “Toward transformer-based object detection,” arXiv, arXiv:2012.09958 (2020). [CrossRef]

25. A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An image is worth 16 × 16 words: transformers for image recognition at scale,” arXiv, arXiv:2010.11929, (2020). [CrossRef]

26. P. Sun, J. Cao, Y. Jiang, et al., “TransTrack: multiple object tracking with transformer,” arXiv, arXiv:2012.15460 (2021). [CrossRef]

27. N. Carion, F. Massa, G. Synnaeve, et al., “End-to-end object detection with transformers,” in European Conference on Computer Vision, ECCV 2020 (Springer International Publishing, 2020), pp. 213–229.

28. W. Wang, E. Xie, X. Li, et al., “Pyramid vision transformer: a versatile backbone for dense prediction without convolutions,” arXiv, arXiv:2102.12122 (2021). [CrossRef]

29. D. Jha, P. H. Smedsrud, M. A. Riegler, et al., “Kvasir-SEG: a segmented polyp dataset,” in MultiMedia Modeling 2020 (Springer International Publishing, 2020), pp. 451–462.

30. J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, et al., “WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,” Comput. Med. Imaging Graph. 43, 99–111 (2015). [CrossRef]

31. N. Tajbakhsh, S. R. Gurudu, and J. Liang, “Automated polyp detection in colonoscopy videos using shape and context information,” IEEE Trans. Med. Imaging 35(2), 630–644 (2016). [CrossRef]

32. D. Vázquez, J. Bernal, F. J. Sánchez, et al., “A benchmark for endoluminal scene segmentation of colonoscopy images,” J. Healthc. Eng. 2017, 1–9 (2017). [CrossRef]

33. J. Silva, A. Histace, O. Romain, et al., “Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer,” Int. J. Comput. Assist. Radiol. Surg. 9(2), 283–293 (2014). [CrossRef]

34. N. C. F. Codella, D. Gutman, M. E. Celebi, et al., “Skin lesion analysis toward melanoma detection: A challenge at the 2017 International symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC),” in 2018 IEEE 15th International Symposium on Biomedical Imaging, ISBI 2018 (IEEE, 2018), pp. 168–172.

35. W. Al-Dhabyani, M. Gomaa, H. Khaled, et al., “Dataset of breast ultrasound images,” Data Brief 28, 104863 (2020). [CrossRef]

36. T.-N. Le, T. V. Nguyen, Z. Nie, et al., “Anabranch network for camouflaged object segmentation,” Comput. Vis. Image Underst. 184, 45–56 (2019). [CrossRef]

37. Y. Li, H. Zhao, X. Qi, et al., “Fully convolutional networks for panoptic segmentation with point-based supervision,” IEEE Trans. PATTERN Anal. Mach. Intell. 45(4), 4229–4244 (2023). [CrossRef]

38. H. Zhao, J. Shi, X. Qi, et al., “Pyramid scene parsing network,” in IEEE Computer Society 2017 (IEEE, 2017), pp. 6230–6239.

39. Y. Xue, T. Xu, and X. Huang, “Adversarial learning with multi-scale loss for skin lesion segmentation,” in IEEE 15th International Symposium on Biomedical Imaging, ISBI 2018 (IEEE, 2018), pp. 859–863.

40. Z. Gu, J. Cheng, H. Fu, et al., “CE-Net: context encoder network for 2d medical image segmentation,” IEEE Trans. Med. Imaging 38(10), 2281–2292 (2019). [CrossRef]

41. D. Jha, M. A. Riegler, D. Johansen, et al., “DoubleU-Net: a deep convolutional neural network for medical image segmentation,” in 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems, CBMS 2020 (IEEE, 2020), pp. 558–564. [CrossRef]

42. J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, et al., “Medical transformer: gated axial-attention for medical image segmentation,” in Medical Image Computing and Computer Assisted Intervention, MICCAI 2021 (Springer International Publishing, 2021), pp. 36–46.

43. J. M. J. Valanarasu and V. M. Patel, “UNeXt: MLP-based rapid medical image segmentation network,” arXiv, arXiv:2203.04967 (2022). [CrossRef]

44. A. He, K. Wang, T. Li, et al., “H2Former: an efficient hierarchical hybrid transformer for medical image segmentation,” IEEE Trans. Med. Imaging 42(9), 2763–2775 (2023). [CrossRef]

45. T. Yan, Y. Y. Qin, P. K. Wong, et al., “Semantic segmentation of gastric polyps in endoscopic images based on convolutional neural networks and an integrated evaluation approach,” Bioengineering 10(7), 806 (2023). [CrossRef]

46. M. Sandler, A. Howard, M. Zhu, et al., “MobileNetV2: inverted residuals and linear bottlenecks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018 (IEEE, 2018), pp. 4510–4520.

47. Y. Sun, Y. Li, P. Wang, et al., “Lesion segmentation in gastroscopic images using generative adversarial networks,” J. Digit. Imaging 35(3), 459–468 (2022). [CrossRef]

48. S. Li, X. Tang, B. Cao, et al., “Boundary guided network with two-stage transfer learning for gastrointestinal polyps segmentation,” Expert Syst. Appl. 240, 122503 (2024). [CrossRef]

49. X. Sun, P. Zhang, D. Wang, et al., “Colorectal polyp segmentation by U-Net with dilation convolution,” arXiv, arXiv:1912.11947 (2019). [CrossRef]

50. F. Isensee, P. F. Jaeger, S. A. A. Kohl, et al., “nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation,” Nat. Methods 18(2), 203–211 (2021). [CrossRef]

51. Z. Huang, H. Wang, Z. Deng, et al., “STU-Net: scalable and transferable medical image segmentation models empowered by large-scale supervised pre-training,” arXiv, arXiv:2304.06716 (2023). [CrossRef]

52. H. A. Qadir, Y. Shin, J. Solhusvik, et al., “Polyp detection and segmentation using mask R-CNN: does a deeper feature extractor CNN always perform better?” in 13th International Symposium on Medical Information and Communication Technology, ISMICT2019, pp. 1–6.

53. Z. Yin, K. Liang, Z. Ma, et al., “Duplex contextual relation network for polyp segmentation,” in 2022 IEEE 19th International Symposium on Biomedical Imaging, ISBI 2022 (IEEE, 2022), pp. 1–5.

54. X. Zhao, L. Zhang, and H. Lu, “Automatic polyp segmentation via multi-scale subtraction network,” in Medical Image Computing and Computer Assisted Intervention, MICCAI 2021 (Springer International Publishing, 2021), pp. 12901.

55. D. Banik, K. Roy, D. Bhattacharjee, et al., “Polyp-Net: a multimodel fusion network for polyp segmentation,” IEEE Trans. Instrum. Meas. 70, 1–12 (2021). [CrossRef]

56. J. Wei, Y. Hu, R. Zhang, et al., “Shallow attention network for polyp segmentation,” arXiv, arXiv:2108.00882 (2021). [CrossRef]

57. P. Zhang, J. Li, Y. Wang, et al., “Domain adaptation for medical image segmentation: a meta-learning method,” J. Imaging 7(2), 31 (2021). [CrossRef]

58. R. Khadka, D. Jha, S. Hicks, et al., “Meta-learning with implicit gradients in a few-shot setting for medical image segmentation,” Comput. Biol. Med. 143, 105227 (2022). [CrossRef]

59. T. Leng, Y. Zhang, K. Han, et al., “Self-sampling meta SAM: enhancing few-shot medical image segmentation with meta-learning,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (IEEE, 2024), pp. 7925–7935.

60. T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear CNN models for fine-grained visual recognition,” in 2015 IEEE International Conference on Computer Vision, ICCV 2015 (IEEE, 2015), pp. 1449–1457.

61. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018 (IEEE, 2018), pp. 7132–7141.

62. M. Sun, K. Liang, W. Zhang, et al., “Non-local attention and densely-connected convolutional neural networks for malignancy suspiciousness classification of gastric ulcer,” IEEE Access. 8, 15812–15822 (2020). [CrossRef]

63. J. Wei, S. Wang, and Q. Huang, “F³net: fusion, feedback and focus for salient object detection,” Proc. AAAI Conf. Artif. Intell. 34(07), 12321–12328 (2020). [CrossRef]

64. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv, arXiv:1711.05101 (2018). [CrossRef]

65. K. Patel, A. M. Bur, and G. Wang, “Enhanced U-Net: a feature enhancement network for polyp segmentation,” arXiv, arXiv:2301.10847 (2021). [CrossRef]

66. P. Chao, C.-Y. Kao, Y.-S. Ruan, et al., “HarDNet: a low memory traffic network,” arXiv, arXiv:1909.00948 (2019). [CrossRef]

67. R. Zhang, G. Li, Z. Li, et al., “Adaptive context selection for polyp segmentation,” in Medical Image Computing and Computer Assisted Intervention, MICCAI 2020 (Springer International Publishing, 2020), pp. 253–262.

68. L.-C. Chen, G. Papandreou, F. Schroff, et al., “Rethinking atrous convolution for semantic image segmentation,” arXiv, arXiv:1706.05587 (2017). [CrossRef]

69. O. Oktay, J. Schlemper, L. L. Folgoc, et al., “Attention U-Net: learning where to look for the pancreas,” arXiv, arXiv:1804.03999 (2018). [CrossRef]

70. C.-H. Huang, H.-Y. Wu, and Y.-L. Lin, “HarDNet-MSEG: a simple encoder-decoder polyp segmentation neural network that achieves over 0.9 mean dice and 86 FPS,” arXiv, arXiv:2101.07172 (2021). [CrossRef]

Datasets	Train dataset	Test dataset	Total images
*ISIC2018* [34]	2075	519	2594
*BUSI* [35]	452	195	647
*CAMO* [36]	1000	250	1250

Datasets	Train dataset	Test dataset
Kvasir-SEG [29]	900	100
CVC-ClinicDB [30]	550	62
CVC-ColonDB [31]	0	380
CVC-300 [32]	0	60
ETIS [33]	0	196
total	1450	798

Methods	Mean Dice ± SD
Methods	Kvasir-SEG	CVC-ClinicDB	CVC-ColonDB	CVC-300	ETIS
U-Net	0.787 ± 0.198	0.875 ± 0.194	0.584 ± 0.370	0.694 ± 0.257	0.427 ± 0.377
PVT	0.784 ± 0.148	0.730 ± 0.181	0.608 ± 0.233	0.685 ± 0.161	0.530 ± 0.277
PVTUNet with FEF	0.903 ± 0.103	0.914 ± 0.065	0.788 ± 0.261	0.888 ± 0.120	0.756 ± 0.312
Ours (PVT-U-Net with FEF + AD)	0.915 + 0.128	0.921 ± 0.137	0.810 ± 0.235	0.899 ± 0.137	0.769 ± 0.278

Datasets	Methods	Mean Dice ± SD	Mean IoU ± SD	F2-Score ± SD
ISIC2018	UNet [16]	0.863 ± 0.161	0.855 ± 0.133	0.857 ± 0.170
	AttUNet [69]	0.857 ± 0.173	0.854 ± 0.183	0.851 ± 0.140
	UNet++ [17]	0.863 ± 0.159	0.855 ± 0.133	0.860 ± 0.170
	DeepLabV3 [68]	0.873 ± 0.118	0.884 ± 0.130	0.884 ± 0.130
	R2UNet [18]	0.867 ± 0.150	0.857 ± 0.133	0.869 ± 0.152
	EUNet [65]	0.873 ± 0.153	0.860 ± 0.139	0.897 ± 0.133
	SANet [56]	0.895 ± 0.114	0.882 ± 0.101	0.896 ± 0.123
	PraNet [19]	0.898 ± 0.113	0.884 ± 0.100	0.895 ± 0.121
	HarDNet [70]	0.895 ± 0.113	0.881 ± 0.105	0.895 ± 0.113
	ACSNet [67]	0.891 ± 0.121	0.879 ± 0.107	0.892 ± 0.127
	DBHNet [20]	0.871 ± 0.123	0.882 ± 0.140	0.881 ± 0.145
	CFANet [22]	0.889 ± 0.125	0.877 ± 0.112	0.889 ± 0.133
	Ours	0.901 ± 0.106	0.888 ± 0.093	0.902 ± 0.114
BUIS	UNet [16]	0.741 ± 0.273	0.800 ± 0.161	0.732 ± 0.285
	AttUNet [69]	0.743 ± 0.280	0.803 ± 0.165	0.738 ± 0.290
	UNet++ [17]	0.749 ± 0.267	0.749 ± 0.267	0.755 ± 0.279
	DeepLabV3 [68]	0.783 ± 0.247	0.826 ± 0.149	0.778 ± 0.250
	R2UNet [18]	0.692 ± 0.325	0.778 ± 0.179	0.685 ± 0.332
	EUNet [65]	0.800 ± 0.208	0.828 ± 0.141	0.828 ± 0.202
	SANet [56]	0.801 ± 0.247	0.840 ± 0.153	0.792 ± 0.253
	PraNet [19]	0.806 ± 0.245	0.842 ± 0.152	0.798 ± 0.250
	HarDNet [70]	0.806 ± 0.246	0.844 ± 0.150	0.794 ± 0.250
	ACSNet [67]	0.803 ± 0.244	0.839 ± 0.152	0.810 ± 0.245
	DBHNet [20]	0.762 ± 0.250	0.810 ± 0.151	0.764 ± 0.253
	CFANet [22]	0.791 ± 0.244	0.832 ± 0.147	0.777 ± 0.249
	Ours	0.812 ± 0.223	0.842 ± 0.145	0.805 ± 0.223
CAMO	UNet [16]	0.511 ± 0.208	0.590 ± 0.132	0.544 ± 0.205
	AttUNet [69]	0.499 ± 0.227	0.589 ± 0.133	0.526 ± 0.234
	UNet++ [17]	0.491 ± 0.249	0.600 ± 0.144	0.490 ± 0.254
	DeepLabV3 [68]	0.505 ± 0.244	0.611 ± 0.145	0.489 ± 0.246
	R2UNet [18]	0.434 ± 0.187	0.553 ± 0.101	0.439 ± 0.194
	EUNet [65]	0.572 ± 0.287	0.656 ± 0.172	0.578 ± 0.299
	SANet [56]	0.719 ± 0.239	0.750 ± 0.157	0.750 ± 0.157
	PraNet [19]	0.733 ± 0.215	0.753 ± 0.153	0.755 ± 0.202
	HarDNet [70]	0.695 ± 0.253	0.742 ± 0.163	0.686 ± 0.265
	ACSNet [67]	0.572 ± 0.278	0.639 ± 0.183	0.589 ± 0.279
	DBHNet [20]	0.470 ± 0.223	0.585 ± 0.126	0.458 ± 0.226
	CFANet [22]	0.511 ± 0.252	0.613 ± 0.149	0.510 ± 0.263
	Ours	0.787 ± 0.196	0.800 ± 0.144	0.793 ± 0.186

	FLOPs(G)	Params(M)
UNet	75.79	17.26
AttUNet	125.95	34.88
UNet++	65.51	9.16
DeepLabV3	82.07	41.99
R2UNet	289.16	39.09
EUNet	23.01	31.43
SANet	11.28	23.9
PraNet	13.11	32.55
HarDNet	11.39	33.34
ACSNet	21.61	29.45
DBHNet	12.24	117.21
CFANet	55.5	25.83
Ours	19.76	42.63

BFE-Net: bilateral fusion enhanced network for gastrointestinal polyp segmentation

Abstract

1. Introduction

1.1 Related work

1.1.1 Deep learning technologies in medical image segmentation

1.1.2 Gastric polyp segmentation methods

1.1.3 Colorectal polyp segmentation methods

2. Materials and methods

2.1 Datasets

2.2 Method

2.2.1 Overview

2.2.2 Two branches

2.2.3 Feature enhanced fusion module

2.2.4 Attention decoder module

2.2.5 Loss function

2.2.6 Evaluation metrics

3. Experimental results

3.1 Experiment details

3.2 Comparative experimental results

3.3 Ablation experiment results

3.4 Generalization ability study

4. Discussions

4.1 Weight parameters of the weighted loss function

4.2 Limitations

4.3 Future direction

5. Conclusion

Funding

Disclosures

Data availability

Supplemental document

Reference

Supplementary Material (1)

Data availability

Cited By

Figures (11)

Tables (5)

Equations (19)

Biomedical Optics Express