Automatic classification of esophageal disease in gastroscopic images using an efficient channel attention deep dense convolutional neural network

Wenju Du; Wenju Du; Nini Rao; Nini Rao; Nini Rao; Changlong Dong; Changlong Dong; Yingchun Wang; Yingchun Wang; Dingcan Hu; Dingcan Hu; Linlin Zhu; Bing Zeng; Tao Gan; Tao Gan

doi:10.1364/BOE.420935

1. Introduction

Gastroscope is an advanced diagnostic imaging tool, which is able to provide a high resolution visualization of living esophageal tissues [1,2]. It first uses a flexible optical fiber to guild the light into the esophageal cavity and an image sensor such as charge coupled device (CCD) or complementary metal-oxide semiconductor (CMOS) to receive the reflections from the mucous membrane in the cavity [3,4]. Then it converts the light signals into electronic signals. After a series of electrical signal processing, the gastroscopic images of the esophageal mucosal are generated [5,6]. Currently, the examination and diagnosis of upper gastrointestinal (esophageal and gastric) diseases mainly rely on gastroscopic images generated by gastroscope [7].

In clinic, gastroscopic diagnosis of esophagus will encounter variety of esophageal diseases, which can be roughly summarized into four main categories: normal esophagus (NE), precancerous esophageal diseases (PEDs), early esophageal cancer (EEC) and advanced esophageal cancer (AEC). The division of EEC and AEC is based on the depth of cancer cells invading under mucous membrane [8]. The 5-year survival rate of patients with AEC is only 15–25% [9,10] whereas the 5-year survival rate of patients with EEC can be as high as 92–93% [9,11]. Therefore, the accurate classification of esophageal diseases is crucial for providing precision therapy planning, especially for realizing an improvement in 5-year survival rate of esophageal cancer patients [12]. However, the diagnosis process with gastroscope is susceptible to a variety of negative factors, such as clinicians’ fatigue and lack of experience, and diversity of the appearance of lesion, etc., which are prone to misdiagnosis and missed diagnosis [2,7,8,13,14]. What’s more, the mucosal characteristics in the lesion areas of some esophageal diseases is very complex. For example, the apparent mucosal characteristics of some EEC lesions are similar to those of PEDs and even AEC lesions, which are difficult to be distinguished even by the experienced clinicians. Computer Aided Diagnosis (CAD) have been confirmed to be able to prevent many of these problems and improve the accuracy and efficiency of the diagnosis of esophageal diseases [2,15,16].

There are mainly three kinds of gastroscopic image processing tasks in CAD methods, which are classification, segmentation and object detection. The classification task will give an image-wise classification result of the lesion type and it only needs image-wise labelled images for training [17–19]. The segmentation task will produce a pixel-wise classification result of lesion type [20,21]. The object detection task will give prediction result on both the lesion type and its location [15,22]. Automatic classification can assist doctors to quickly screen out images with lesions from a huge numbers of gastroscopic images, discriminate different types of diseases and save a lot of labor time, which is important to clinic. The traditional classification methods usually used artificially designed algorithm to extract image features and then a classifier such as support vector machine (SVM) was applied to perform classification based on the extracted features [23,24]. The deep learning-based classification methods have showed better classification ability than the traditional methods [2,18,25]. For example, Kumagai et al. [17] developed a classification method based on GoogLeNet, which can classify malignant and non-cancerous lesions of esophageal squamous cell carcinoma in endocytoscopic system images. Liu et al. [18] fine-tuned a pre-trained CNN to classify three gastric diseases (chronic gastritis, low grade neoplasia and early gastric cancer) on magnification narrow-band imaging images. Zhu et al. [19] also constructed a CNN-CAD system based on the same CNN framework of Liu et al. to determine the invasion depth of gastric cancer using gastroscopic images. In recent years, deep learning-based methods have performed well on the classification of multi-categories of diseases [26,27]. However, the deep learning-based classification method for a wide variety of diseases in gastroscopic images have not been previously reported.

The long-range dependencies maybe easily ignored by the traditional CNNs, if there is none smart mechanism to guide feature selection [28–30]. More recently, attention mechanism has been demonstrated to offer great potential in improving the performance of deep CNNs. The subsequently development of attention modules towards two directions: enhancement of feature aggregation and combination of channel and spatial attentions. At present, the attention mechanism has been applied to many natural image analysis tasks, such as image captioning [31], image recognition [32] and image classification [33]. There also exist many examples of successful applications of the attention mechanism in medical image analysis, such as thorax disease classification [34,35], pulmonary lesion detection [36] and skin lesion segmentation [37,38]. Therefore, the attention mechanism has the potential to improve the ability of gastroscopic images classification.

In response to better classification of the four main categories (NE, PEDs, EEC and AEC) and also inspired by the works in [39] and [40], this study proposed to introduce the efficient channel attention mechanism into the dense block of densely connected CNN for constructing a novel network. We named this network as efficient channel attention deep dense convolutional neural network (ECA-DDCNN). The purpose of building ECA-DDCNN was to enhance the interdependencies between channels and strengthen feature propagation in the network in order to highlight and extract the features related to the subtle differences among the various types of lesions. Finally, an ECA-DDCNN based method was developed to classify a wide variety of esophageal diseases. The imbalanced sample size among different categories is a common problem in the collected medical images, and it also occurs in our esophageal image dataset. For maintaining efficient training, we also proposed a novel dynamic random weighted sampling (RWS) method to balance the sample numbers among different categories.

The main contributions of this paper are as follows:

a) ECA-DDCNN network was proposed, which enhanced the interdependencies between channels and feature propagation in the network.
b) RWS method was presented for balancing sample number of gastroscopic images in different categories.
c) A classification method based on ECA-DDCNN was developed for classifying the four main categories of esophagus (seven sub-categories) based on gastroscopic images, and it exhibits state-of-the-art performances. The categories of our classified esophageal diseases range from NE to AEC which is the most extensive in the existing related methods.

The rest of this paper is organized as follows. The experimental datasets, data pre-processing method, and proposed ECA-DDCNN were introduced and detailed, in section 2. The experiment results were reported in section 3. Further discussions and a summary of the experiment results have been presented in Section 4. The conclusions of this work are presented in Section 5.

2. Materials and methods

2.1 Materials

The esophageal gastroscopic images used in our experiments were collected by gastroenterologists from the digestive gastroscope center of the West China Hospital in Chengdu China. The gastroscopic images were captured using OLYMPUS GIF-Q260 and GIF-Q290 gastroscope, and saved as graphic files of type JPEG (Joint Photographic Experts Group) with four resolutions: 1920×1080 pixel, 1916×1076 pixel, 768×576 pixel, and 764×572 pixel. In total, 20,965 conventional non-magnified white-light imaging gastroscopic images from 4077 patients were collected, including 1471 normal esophagus (NE) images from 296 patients, 2183 surgical scar (SC) images from 485 patients, 3377 osophagitis (O) images from 598 patients, 5921 esophageal varices (EV) images from 1300 patients, 1945 esophageal submucous eminence (ESE) images from 368 patients, 2806 early esophageal cancer (EEC) images from 484 patients, and 3262 advanced esophageal cancer (AEC) images from 546 patients. Among these esophageal diseases, SC, O, EV and ESE belong to PEDs. The collected esophageal gastroscopic images cover the four main categories of esophagus, including NE, PEDs, EEC and AEC. Due to the complex mucosal surface and individual difference from patients, the collected images show high interclass similarity and high intraclass variation. Permission from the medical ethical review committee of West China Hospital and University of Electronic Science and Technology of China, and informed patient’s consent were obtained.

Three gastroenterologists with 5, 10, and 15 years of experience, respectively, arrived at a consensus regarding the labels of the images used in this study. All the cancerous lesions were confirmed through biopsies. As the image number of some diseases is very small (for example, there only 1471 images in NE and 1945 images in ESE), we divided the dataset into: training and test groups. The gastroscopic images were randomly selected for generating two groups. Approximately 80% images of each esophageal disease were selected for the training group and the remaining were included in the test group. It needs to be mentioned that the disease distributions (i.e., the ratio of each sub-category in each subset) in the training and test sets are the same. The gastroscopic images recorded from one patient appeared only in one group. The statistics of the training and test groups are listed in Table 1, where the training and test group consist of 16771 and 4194 gastroscopic images in total, respectively. The total patient number of training and test groups is 3253 and 824, respectively. The overall median age of test dataset is 56 with a wide range of 20-88, the total sex ratio between male and female is 569/295.

Table 1. Statistics of the esophagus dataset used in this work.

View Table | View all tables in this article

2.2 Methods

Figure 1 illustrates the proposed method of classifying esophageal diseases, including data preprocessing, RWS and ECA-DDCNN. The details of each part are stated below.

Fig. 1. The framework of proposed classification method of esophageal diseases.

Download Full Size | PDF

2.2.1 Data preprocessing and random weighted sampling (RWS)

The collected raw gastroscopic images usually contain a large area of black background with some texts that generally comprises the patient information, as shown in C1 of Fig. 1. This content makes no contributions to the classification task; thus, we cropped these black background areas by using a rectangular box of suitable size. As the esophageal images may have different size, the cropped images were uniformly resized to the size of 224×224, as shown in C2 of Fig. 1.

Our esophageal dataset shows imbalance in image number among different categories because the case numbers of different types are different in clinical diagnosis. For instance, there is only 1471 images of NE from 296 patients, while there are 5921 images of EV from 1300 patients. Hence, we designed a novel random weighted sampling (RWS) method to eliminate the imbalance and improve the overall classification accuracy. For each category, RWS sets a random sampling weight, which W_s is inversely proportional to the sample size, as shown in Eq. (1).

(1)$${W_s}(i) = \frac{1}{{{N_s}(i)}},({N_s}(i) > 0,i = 1,2,\ldots .N)$$

Where, W_s (i) and N_s (i) denotes the random sampling weight and the sample size of the i_th category respectively. N refers to the number of categories. After RWS, the numbers of samples among different categories will remain balanced. Figure 1 illustrates the process of RWS, in which the thickness of each color rectangle represents the sample size of each category.

2.2.2 Efficient channel attention deep dense convolutional neural network (ECA-DDCNN)

As previously mentioned, the existing deep learning networks used in the classification of upper gastrointestinal disease are lacking of smart mechanism to guide feature selection. This makes them difficult to identify subtle difference among a variety of esophageal diseases (especially PEDs, AEC and EEC). ECA-module [39] can adaptively recalibrate the channel-wise feature responses by explicitly modelling interdependencies between channels, while DenseNet [40] can strengthen feature propagation and encourage feature reuse. Inspired by [39] and [40], we proposed a novel efficient channel attention based dense layer (ECA-DL) and densely connected several ECA-DLs into a ECA-Dense block (ECA-DB). With the equipped ECA-DBs, the accuracy and efficiency of ECA-DDCNN in classifying four main categories of esophagus (NE, PEDs, AEC and EEC) are thus enhanced. As shown in Fig. 1, there are four ECA-DBs, three Transition layers and one fully connected layer in the backbone of ECA-DDCNN. The transition layer between each two ECA-DBs has the same structure as that of DenseNet [40]. The fully connected layer was designed at the end of ECA-DDCNN, and the number of output category was set as N (N=7 in this study).

Figure 2 is the diagram of the proposed ECA-DL and ECA-DB. In ECA-DL (Fig. 2 (a)), W is the channel-wise attention coefficient computed by ECA-module, U refers to the feature maps extracted by the dense layer. When a channel-wise multiply operation is performed to W and U, the feature maps U are weighted by W. The feature maps weighted by the attention coefficients are outputted by ECA-DL. Where, the dimension of feature map outputted by ECA-DL is called as the growth rate k. In this study, the growth rate of ECA-DDCNN was set as k=48. Taking advantage of the ECA-module [39], ECA-DL can avoid dimensionality reduction and captures cross-channel interaction in an efficient way. The multiple ECA-DLs are connected to each other by a dense connectivity way [40] to form a ECA-DB (Fig. 2 (b)). In ECA-DDCNN (Fig. 1), ECA-DB-1 to ECA-DB-4 contains 6, 12, 36 and 24 ECA-DLs, respectively.

Fig. 2. The proposed ECA-Dense layer and ECA-Dense block

Download Full Size | PDF

2.2.3 Training details

In order to speed up the training process, we partly load the pre-trained model of DenseNet on ImageNet. The stochastic gradient descent is selected as the optimization function of our network, and the initial learning rate is set as 5e-3, momentum as 0.9 and weight decay as 5e-3. The learning rate is set to decay by half if the averaged training loss stops decreasing in three epochs, to ensure that the network is efficiently trained at the appropriate learning rate and to speed up the training process. The input image is set as 224×224, the training epochs are 100, and the batch size is 32. The cross-entropy loss function is used to measure the distance between the predicted and target labels during the training process.

3. Experiments and results

In order to illustrate the effectiveness and performances of our proposed method in the classification of the four main categories (seven sub-categories), we performed extensive experiments, including ablation studies, comparison experiments and generalization ability validation.

In this work, the programming language implemented is Python 3.6.4, and the deep learning library is PyTorch 1.0.0 (https://pytorch.org/). All the experiments were performed on a server based on Ubuntu 16.04.6 LTS (GNU/Linux 4.8.0-36-generic X86_64) and equipped with four graphics processing units of Nvidia GeForce RTX 2080 Ti, 11G.

3.1 Metrics

The accuracy (Acc), precision (Pr), recall (Rec), and F1-score (F1) were used to evaluate the classification performances of each method, as shown in Eqs. (2)–(6).

(2)$$\textrm{Acc} = \frac{{\sum\nolimits_i {{n_{ii}}} }}{{\sum\nolimits_{i,j} {{n_{ij}}} }}$$

(3)$${\Pr _j} = \frac{{{n_{jj}}}}{{\sum\nolimits_j {{n_{ji}}} }}$$

(4)$$\textrm{Re}{\textrm{c}_j} = \frac{{{n_{jj}}}}{{\sum\nolimits_i {{n_{ji}}} }}$$

(5)$$\textrm{F}{\textrm{1}_j} = \frac{{2 \times {{\Pr }_j} \times {\textrm{Re}} {c_j}}}{{{{\Pr }_j} + {\textrm{Re}} {c_j}}}$$

(6)$$\textrm{Mean} = \frac{1}{\textrm{N}}\sum\limits_{n = 1}^\textrm{N} p$$

Where i,j refer to the index of each category, n_ij indicates the number of the i_th category predicted as j_th category. Similarly, n_ii and n_jj refer to the number of the right predicted i_th category and j_th category, respectively. Acc and F1 evaluate the comprehensive classification abilities. Pr represents the precision rate of the disease recognition. Rec represents the sensitivity to diseases.

Furthermore, the mean values of Pr, Rec, and F1 were calculated using Eq. (6). Where, p is the currently evaluated parameter, Mean is the mean value of p. N refers to the number of the classified categories. Bootstrap was used to simulate 1000 trials to estimate the 95% confidence interval (CI) of the evaluation metrics.

3.2 Ablation studies

To demonstrate the effectiveness of the proposed ECA-DB and ECA-DL, we performed ablation studies. We set DenseNet as the benchmark in the ablation studies. We get AT-ECA-DDCNN (Fig. S1 (a) in Supplement 1) by introducing ECA-module after the transition layer of DenseNet, BT-ECA-DDCNN (Fig. S1 (b) in Supplement 1) by introducing ECA-module before the transition layer of DenseNet, and ECA-DDCNN (Our) by introducing ECA-module into the dense block of DenseNet. Then, we evaluated the classification performance of DenseNet, AT-ECA-DDCNN, BT-ECA-DDCNN and Our on the esophageal disease dataset. All the experiments conditions were the same. We performed a statistical comparison by computing the Pr, Rec, F1 and Acc values of the four methods (DenseNet, AT-ECA-DDCNN, BT-ECA-DDCNN and Our) over the esophageal test dataset. As shown in Table 2, our method achieved the optimal values of the Mean Pr, Rec, F1 and Acc (the optimal values were bolded). It indicates that the overall classification performance of our method is better than that of the other three methods, which attributes to the proposed novel ECA-DL and ECA-DB.

Table 2. Statistical comparison of ablation studies

View Table | View all tables in this article

We further compared the classification performance of ECA-DDCNN under two growth rates k=32 and k=48, and the statistical results of the evaluation metrics are shown Table 3. As shown in Table 3, we can clearly observe that all the values of Mean Pr, Rec and F1, and Acc under k=48 are higher than that under k=32. The growth rate k refers to the new information of each layer that contributes to the global state, that is, the dimension of the feature map of each ECA-DL. A bigger growth rate means that more feature maps are reused. This is the main reason why the classification performance of ECA-DDCNN under the growth rate k=48 is better than that under k=32° It shows that we increase the growth rate of ECA-DDCNN is effective, not redundant.

Table 3. Statistical comparison of our network with different growth rates

View Table | View all tables in this article

3.3 Comparisons of different data sampling methods

In order to verify the effectiveness of the proposed RWS method, we applied physical data augmentation (Aug) and RWS to balance the distribution among different categories. The shuffle method was set as benchmark. We computed the Pr, Rec, F1 and Acc values for different methods over the esophageal test dataset (Table 4) to evaluate the classification results of the three methods. From Table 4, we can clearly see that the overall classification result using RWS (Our) is higher than the other two data sampling methods, and the network training time is shorter than the other two methods. This shows that our RWS method can effectively improve the overall classification performance and the training efficiency of the network.

Table 4. Statistical comparison of different data sampling methods

View Table | View all tables in this article

3.4 Comparisons with other related state-of-the-art methods

To validate the performance of our method in classifying the four main categories (seven sub-categories), we compared our method with five other related methods on our dataset. The comparison methods include: the classification method of esophageal disease proposed by Kumagai et al. [17], the classification method of gastric disease proposed by Liu et al. [19], and the other three most advanced image classification methods: SE-ResnNet152 proposed by Hu et al. [30], Efficient-Net-b5 proposed by Tan et al. [41] and ECA-ResNet152 proposed by Wang et al. [39]. In order to ensure the best classification performance of each method and guarantee fair comparison, the size of the input image is set to be consistent with the original network, where the size of the input images of Kumagai et al. [17] method was set as 229×229, and those of the other methods were set as 224×224. The other experimental conditions were the same for all the methods.

For each method, we calculated the Mean values of Pr, Rec and F1, the Acc and averaged AUC, as well as the network computational complexity. As shown in Table 5, our method achieves the highest values on Mean Pr, Rec and F1, the Acc and averaged AUC, and these values are higher than the suboptimal indices by 1.07%, 0.76%, 0.94%, 0.73% and 0.01%, respectively. The parameter amount of our method is only 26.49M, which is close to the lowest value 23.13M of Kumagai et al. [17] and less than half of the maximum value 67.40M of Hu et al. [30]. Thus, our method is relatively lightweight and costs less computing resources. We also calculated the Pr, Rec and F1 values for each method on each sub-category of test dataset (Table S1 in Supplement 1). As can be seen from the Table S1, the F1 values of our method in each esophageal disease are generally higher than other methods. The F1 value is a composite evaluation parameter combining Pr and Rec. Therefore, the overall classification ability of our method is better than the other five comparison methods. Also, our method obtains the optimal values in Rec for four sub-categories, the optimal values in AUC for three sub-categories and the optimal values in Pr for two sub-categories. Table 5 and Table S1 fully demonstrate that our method is better than other comparison methods in identifying of seven sub-categories of esophageal diseases.

Table 5. Global classification performance and computational complexity comparisons with other related state-of-the-art methods on esophageal dataset

View Table | View all tables in this article

In addition to the statistical comparison of evaluation parameters, we calculated the confusion matrix of each method on sub-category of the dataset. The statistics of the true positive (TP) and all the possible false positive records of all methods over the test dataset in the case of each sub-category are shown in Fig. 3. It can be seen from the confusion matrixes of Fig. 3 that the worst TP rates of Kumagai et al. [17] are obtained in SC and O, which are both 0.86, 5% of SC is misclassified as EEC and 4% of O is misclassified as EV. The worst TP rates of Liu et al. [19] and Hu et al. [30] are both obtained in EEC, which are 0.84 and 0.83, respectively. For the above two methods, 9% and 8% of the EECs are misclassified as O, respectively. The worst TP rates of Tan et al. [41] and Wang et al. [39] are both obtained in ESE with TP rate of 0.84, 4% of ESE is misclassified as EEC in the former method and 4% is misclassified as SC in the latter method. For our method, the TP rate of each category is relatively higher and the minimum TP rate is as high as 0.87. In the sub-categories with the worst TP rates obtained by the other five comparison methods, our method still can achieve high TP rates. For example, both methods of Liu et al. [19], and Hu et al. [30] show poor classification ability in EEC, while our method can achieve a high TP rate of 0.88. Tan et al. [41] and Wang et al. [39] both get minimum TP rate in ESE, while our method can achieve a high TP rate of 0.87. This demonstrates that our method can overcome the classification weakness of other methods in the esophageal dataset, and can distinguish various types of esophageal diseases with similar morphology or appearance.

Fig. 3. Confusion matrixes of test results for the comparison methods on the esophageal disease classification task. (a) Kumagai et al. [17], (b) Liu et al. [19], (c) Hu et al. [30], (d) Tan et al. [41], (e) Wang et al. [39], (f) Our method. All the possible records for each disease are presented using a color gradient and numbers.

Download Full Size | PDF

Finally, to further evaluate the comprehensive classification ability of each method, we calculated the average receiver operating characteristics (ROC) curves and AUC values on the test dataset, as shown in Fig. 4. It shows that our method outperforms the competitors by achieving better ROC curve and higher AUC value of 0.9877.

Fig. 4. Average ROC curves of each comparison methods on the esophageal dataset. AUC values of each method is reported in the legend.

Download Full Size | PDF

3.5 Generalization ability

In order to verify the generalization ability of our method, we evaluated the classification performance of our and the other five comparison methods on the skin disease dataset ISIC2019 [42]. Dermoscope is a new biomedical photoacoustic (PA) imaging system equipped with a waterless coupling and impedance matching opto-sono probe to achieve quantitative, high-resolution, and high-contrast imaging of the human skin [43]. The training dataset of ISIC2019 contains 25331 dermoscopic images, eight types of skin diseases, including actinic keratosis (AK), basal cell carcinoma (BCC), benign keratosis (BKL), dermatofibroma (DF), melanoma (MEL), melanocytic nevus (NV), squamous cell carcinoma (SCC) and vascular lesion (VASC). There is a severe imbalance in the number of images among various skin diseases. NV contains 12874 images while DF only includes 238 images. Because the number of NV category is larger than that of MEL category with the second largest number of images (4521) by one order magnitude, we randomly selected 4000 images from the NV category to form our generalization experiment dataset, and the number of images of the other seven categories remains unchanged. Finally, the experiment dataset contained a total number of 16,499 dermoscopic images. We randomly selected 80% of each category as the training set, and the remaining 20% as the test dataset. Table 6 shows the calculated Mean Pr, Rec and F1, and Acc values of each method on the skin disease test dataset. Our method is better than other comparison methods by achieving the optimal values of Mean Rec, F1 and Acc, and the Mean Pr of our method is 84.46% which is close to the highest Mean Pr value of 85.04%. Furthermore, we also computed the evaluation indicators Pr, Rec and F1 of each method on each category of test dataset (Table S2 in Supplement 1). Our method achieves the best evaluation values on most of the eight categories. The confusion matrix of each method is shown in Fig. 5, which illustrates that our method obtains higher TP rates in most categories of skin diseases than the other five comparison methods. The smallest TP rate of our method is 0.78, which is at least 6% higher than the minimum TP values of other comparison methods. In summary, the generalization experiment results prove the generalization ability of our proposed ECA-DDCNN in the skin disease classification task.

Fig. 5. Confusion matrixes of test results for the comparison methods on the skin lesion classification task. (a) Kumagai et al. [17], (b) Liu et al. [19], (c) Hu et al. [30], (d) Tan et al. [41], (e) Wang et al. [39], (f) Our method. All the possible records for each disease are presented using a color gradient and numbers.

Download Full Size | PDF

Table 6. Global classification performance with different state-of-the-art methods on the skin datasets

View Table | View all tables in this article

4. Discussions

Limited by the shooting conditions such as light, most of the images taken by gastroscope are not as high contrast and definition as natural images. In addition, the lesions of esophageal diseases generally show irregular shapes and blurry boundaries. Hence, the classification of esophageal disease is a very challenging task, and much harder than the natural images. Nevertheless, the proposed classification method based on ECA-DDCNN network still can get satisfactory results on the classification of the four main categories (seven sub-categories) including NE, PCD, EEC and AEC.

In ablation studies, we evaluated the contributions of the ECA-module, the embedding location of ECA and the higher growth rate. The results of Table 2 shows that the location of the ECA-module is also critical. Compared with the benchmark method (DenseNet), the Acc value of BT-ECA-DDCNN method improves a bit, while the Acc value AT-ECA-DDCNN method drops 0.2%. This phenomenon is closely related to the transition layer. In AT-ECA- DDCNN (Fig. S1 (a) in Supplement 1), ECA-module only adaptively recalibrated the responses of the feature maps with dimension reduction, while in BT-ECA-DDCNN (Fig. S1 (b) in Supplement 1), the ECA-module adaptively recalibrated the channel-wise feature responses of the feature maps without dimension reduction. In our method (Fig. 1 and Fig. 2), since the ECA-module was embedded into dense block and placed behind the dense layer, the channel-wise feature responses of the output of each dense layer can be adaptively recalibrated by ECA-module. Therefore, the classification performance of our method is the best. Different from traditional DenseNet, the proposed ECA-DDCNN has a higher growth rate. The experimental results in Table 3 show that higher growth rate benefits classification performance.

The experiment result shown in Table 4 also confirmed that the proposed RWS method could both effectively improve the classification accuracy and shorten the training time. Although the Aug method can also increase the sample size of the category with few samples, it doesn’t improve the classification accuracy, and the accuracy is even a bit lower than Shuffle method. This is because pure data augmentation only increased the sample size, and it did not effectively increase the diversity of the dataset. In the end, the training time of Aug method was prolonged due to the large amount of redundant images.

From the comparative experiments, we can see that the overall classification performance of our method outperforms the state-of-the-art methods, while the proposed ECA-DDCNN is also relatively lightweight and efficiency. It only costs 26.49M parameters and 7.82 GFLOPs (Table 5). This is attributed to the proposed ECA-DBs that encourage feature reuse, strengthen feature propagation and reduce the number of parameters. Additionally, the proposed ECA-DB can guide ECA-DDCNN to highlight and extract the subtle difference among the various types of diseases. Therefore, our proposed method can achieve higher TP rates in classifying these esophageal and skin diseases with high interclass similarity (Fig. 3 and Fig. 5).

Although we have achieved better results than the other related methods in the classifying esophageal diseases, there are still some limitations in this work. Firstly, our method only classified esophageal cancer into EEC and AEC, but not further into subtypes (e. g. esophageal squamous cell carcinoma and adenocarcinoma). Secondly, our method is supervised learning, and it needs a huge numbers of image-wise labeled images for training. However, a huge numbers of image annotation work will put burden on the doctor. Finally, we did not compare our method with the traditional classification methods based on handcrafted features, as many studies have proved that the deep learning methods are superior the traditional methods in classification [18,44–47]. Furthermore, the majority of the traditional methods are limited to binary classification and it is difficult to extend them into multiple classification methods.

5. Conclusions

In this study, we presented a novel network ECA-DDCNN combining attention mechanism and densely connected deep CNN. The proposed ECA-DDCNN is guided by the attention mechanism to highlight and extract those subtle differences among various types of diseases, which are difficult to be distinguished by the experienced clinicians. On the basis, we developed a classification method based on ECA-DDCNN to classify the four main categories of esophagus (including one normal and six esophageal diseases). The experimental results show that the proposed method is competent for classifying these esophageal categories and get higher TP rates on the sub-categories with similar mucosal features, compared with the other state-of-the-art methods. Additionally, the categories of esophagus classified by the proposed method cover the four main categories of esophagus (NE, PCD, EEC and AEC including seven sub-categories), which is the largest among all the existing related classification methods for the esophageal diseases. Therefore, our proposed classification method is suitable for clinical needs and hold a great prospect of clinical applications.

In future works, we intend to perform clinical test of the proposed classification method based on the ECA-DDCNN. We shall overcome the limitations of gastroscopic image labels and design semi-supervised or unsupervised deep learning methods to classify, detect, and segment more types of gastrointestinal diseases.

Funding

Key Research and Development Program of Sichuan Province (2020YFS0243); National Natural Science Foundation of China (61720106004, 61872405).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Supplemental document

See Supplement 1 for supporting content.

References

1. T. Keen and C. Brooks, “Principles of gastrointestinal endoscopy,” Surgery 38(3), 155–160 (2020). [CrossRef]

2. W. Du, N. Rao, D. Liu, H. Jiang, C. Luo, Z. Li, T. Gan, and B. Zeng, “Review on the applications of deep learning in the analysis of gastrointestinal endoscopy images,” IEEE Access 7, 142053–142069 (2019). [CrossRef]

3. J. Kim, H. Al Faruque, S. Kim, E. Kim, and J. Y. Hwang, “Multimodal endoscopic system based on multispectral and photometric stereo imaging and analysis,” Biomed. Opt. Express 10(5), 2289–2302 (2019). [CrossRef]

4. M. Vasilakakis, A. Koulaouzidis, D. E. Yung, J. N. Plevris, E. Toth, and D. K. Iakovidis, “Follow-up on: optimizing lesion detection in small bowel capsule endoscopy and beyond: from present problems to future solutions,” Expert Rev. Gastroenterol. Hepatol. 13(2), 129–141 (2019). [CrossRef]

5. M. V. Sivak, “Gastrointestinal endoscopy: Past and future,” Gut 55(8), 1061–1064 (2005). [CrossRef]

6. A. S. Haase and A. Maier, Endoscopy (Springer, 2018), Chap. 4.

7. J. Mannath and K. Ragunath, “Role of endoscopy in early oesophageal cancer,” Nat. Rev. Gastroenterol. Hepatol. 13(12), 720–730 (2016). [CrossRef]

8. R. M. Gore and M. S. Levine, Diseases of the Upper GI Tract (Springer, 2018), Chap. 10.

9. M. José, D. Arnal, Á. F. Arenas, and Á. L. Arbeloa, “Esophageal cancer: Risk factors, screening and endoscopic treatment in Western and Eastern countries,” World J. Gastroenterol. 21(26), 7933–7943 (2015). [CrossRef]

10. F. L. Huang and S. J. Yu, “Esophageal cancer: Risk factors, genetic association, and treatment,” Asian J. Surg. 41(3), 210–215 (2018). [CrossRef]

11. A. Mocanu, R. Bârla, P. Hoara, and S. Constantinoiu, “Current endoscopic methods of radical therapy in early esophageal cancer,” J. Med. Life 8, 150–156 (2015).

12. Y. Sun, T. Zhang, W. Wu, D. Zhao, N. Zhang, Y. Cui, Y. Liu, J. Gu, P. Lu, F. Xue, J. Yu, and J. Wang, “Risk factors associated with precancerous lesions of esophageal squamous cell carcinoma: A screening study in a high risk Chinese population,” J. Cancer 10(14), 3284–3290 (2019). [CrossRef]

13. S. Mönig, M. Chevallay, N. Niclauss, T. Zilli, W. Fang, A. Bansal, and J. Hoeppner, “Early esophageal cancer: the significance of surgery, endoscopy, and chemoradiation,” Ann. N. Y. Acad. Sci. 1434, 115–123 (2018). [CrossRef]

14. K. Goda, A. Dobashi, N. Yoshimura, M. Kato, H. Aihara, K. Sumiyama, H. Toyoizumi, T. Kato, M. Ikegami, and H. Tajiri, “Narrow-band imaging magnifying endoscopy versus lugol chromoendoscopy with pink-color sign assessment in the diagnosis of superficial esophageal squamous neoplasms: a randomised noninferiority trial,” Gastroenterol. Res. Pract. 2015, 1–10 (2015). [CrossRef]

15. Y. Horie, T. Yoshio, K. Aoyama, S. Yoshimizu, Y. Horiuchi, A. Ishiyama, T. Hirasawa, T. Tsuchida, T. Ozawa, S. Ishihara, Y. Kumagai, M. Fujishiro, I. Maetani, J. Fujisaki, and T. Tada, “Diagnostic outcomes of esophageal cancer by artificial intelligence using convolutional neural networks,” Gastrointest. Endosc. 89(1), 25–32 (2019). [CrossRef]

16. Y. Mori, S. ei Kudo, H. E. N. Mohmed, M. Misawa, N. Ogata, H. Itoh, M. Oda, and K. Mori, “Artificial intelligence and upper gastrointestinal endoscopy: current status and future perspective,” Dig. Endosc. 31(4), 378–388 (2019). [CrossRef]

17. Y. Kumagai, K. Takubo, K. Kawada, K. Aoyama, Y. Endo, T. Ozawa, T. Hirasawa, T. Yoshio, S. Ishihara, M. Fujishiro, J. ichi Tamaru, E. Mochiki, H. Ishida, and T. Tada, “Diagnosis using deep-learning artificial intelligence based on the endocytoscopic observation of the esophagus,” Esophagus 16(2), 180–187 (2019). [CrossRef]

18. X. Liu, C. Wang, J. Bai, and G. Liao, “Fine-tuning pre-trained convolutional neural networks for gastric precancerous disease classification on magnification narrow-band imaging images,” Neurocomputing 392, 253–267 (2020). [CrossRef]

19. Y. Zhu, Q. C. Wang, M. D. Xu, Z. Zhang, J. Cheng, Y. S. Zhong, Y. Q. Zhang, W. F. Chen, L. Q. Yao, P. H. Zhou, and Q. L. Li, “Application of convolutional neural network in the diagnosis of the invasion depth of gastric cancer based on conventional endoscopy,” Gastrointest. Endosc. 89(4), 806–815.e1 (2019). [CrossRef]

20. D. Liu, H. Jiang, N. Rao, W. Du, C. Luo, Z. Li, L. Zhu, and T. Gan, “Depth information-based automatic annotation of early esophageal cancers in gastroscopic images using deep learning techniques,” IEEE Access 8, 97907–97919 (2020). [CrossRef]

21. L. J. Guo, X. Xiao, C. C. Wu, X. Zeng, Y. Zhang, J. Du, S. Bai, J. Xie, Z. Zhang, Y. Li, X. Wang, O. Cheung, M. Sharma, J. Liu, and B. Hu, “Real-time automated diagnosis of precancerous lesions and early esophageal squamous cell carcinoma using a deep learning model (with videos),” Gastrointest. Endosc. 91(1), 41–51 (2020). [CrossRef]

22. M. Ohmori, R. Ishihara, K. Aoyama, K. Nakagawa, H. Iwagami, N. Matsuura, S. Shichijo, K. Yamamoto, K. Nagaike, M. Nakahara, T. Inoue, K. Aoi, H. Okada, and T. Tada, “Endoscopic detection and differentiation of esophageal lesions using a deep neural network,” Gastrointest. Endosc. 91(2), 301–309.e1 (2020). [CrossRef]

23. D. Y. Liu, T. Gan, N. N. Rao, Y. W. Xing, J. Zheng, S. Li, C. S. Luo, Z. J. Zhou, and Y. L. Wan, “Identification of lesion images from gastrointestinal endoscope based on feature extraction of combinational methods with and without learning process,” Med. Image Anal. 32, 281–294 (2016). [CrossRef]

24. F. Riaz, F. B. Silva, M. D. Ribeiro, and M. T. Coimbra, “Invariant Gabor texture descriptors for classification of gastroenterology images,” IEEE Trans. Biomed. Eng. 59(10), 2893–2904 (2012). [CrossRef]

25. G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. W. M. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Med. Image Anal. 42, 60–88 (2017). [CrossRef]

26. A. Esteva, K. Brett, A. Novoa Roberto, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun, “Dermatologist-level classification of skin cancer with deep neural networks,” Nature 542(7639), 115–118 (2017). [CrossRef]

27. N. Gessert, M. Nielsen, M. Shaikh, R. Werner, and A. Schlaefer, “Skin lesion classification using ensembles of multi-resolution efficientnets with meta data,” arXiv preprint arXiv: 1910.03910 (2019).

28. D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in International Conference on Learning Representations, ICLR (2015), pp. 1–15.

29. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local Neural Networks,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (IEEE, 2018), pp. 7794–7803.

30. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (IEEE, 2018), pp. 7132–7141.

31. Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention quanzeng,” in Proceedings Ofthe IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 4651–4659.

32. J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (IEEE, 2017), pp. 4476–4484.

33. F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (IEEE, 2017), pp. 6450–6458.

34. Q. Guan, Y. Huang, Z. Zhong, Z. Zheng, L. Zheng, and Y. Yang, “Diagnose like a Radiologist: Attention guided convolutional neural network for thorax disease classification,” arXiv preprint arXiv:1801.09927 (2018).

35. B. Chen, J. Li, G. Lu, and D. Zhang, “Lesion location attention guided network for multi-label thoracic disease classification in chest X-rays,” IEEE J. Biomed. Health Inform. 24(7), 2016–2027 (2020). [CrossRef]

36. E. Pesce, S. Joseph Withey, P. P. Ypsilantis, R. Bakewell, V. Goh, and G. Montana, “Learning to detect chest radiographs containing pulmonary lesions using visual attention networks,” Med. Image Anal. 53, 26–38 (2019). [CrossRef]

37. R. Gu, G. Wang, T. Song, R. Huang, M. Aertsen, J. Deprest, S. Ourselin, T. Vercauteren, and S. Zhang, “CA-Net: Comprehensive attention convolutional neural networks for explainable medical image segmentation,” IEEE Trans. Med. Imaging 40(2), 699–711 (2021). [CrossRef]

38. H. Wu, J. Pan, Z. Li, Z. Wen, and J. Qin, “Automated skin lesion segmentation via an adaptive dual attention module,” IEEE Trans. Med. Imaging 40(1), 357–370 (2021). [CrossRef]

39. Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net: Efficient channel attention for deep convolutional neural networks,” in Proceedings the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (IEEE, 2020) pp. 11531–11539.

40. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (IEEE, 2017), pp. 2261–2269.

41. M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proceedings - 36th Int. Conf. Mach. Learn. ICML 2019 (International Machine Learning Society, 2019) pp. 10691–10700.

42. M. Combalia, N. C. F. Codella, V. Rotemberg, B. Helba, V. Vilaplana, O. Reiter, C. Carrera, A. Barreiro, A. C. Halpern, S. Puig, and J. Malvehy, “BCN20000: Dermoscopic lesions in the wild,” arXiv preprint arXiv :1908.02288 (2019).

43. H. Ma, S. Yang, Z. Cheng, and D. Xing, “Photoacoustic confocal dermoscope with a waterless coupling and impedance matching opto-sono probe,” Opt. Lett. 42(12), 2342 (2017). [CrossRef]

44. J. Bernal, N. Tajkbaksh, F. J. Sanchez, B. J. Matuszewski, H. Chen, L. Yu, Q. Angermann, O. Romain, B. Rustad, I. Balasingham, K. Pogorelov, S. Choi, Q. Debard, L. Maier-Hein, S. Speidel, D. Stoyanov, P. Brandao, H. Cordova, C. Sanchez-Montes, S. R. Gurudu, G. Fernandez-Esparrach, X. Dray, J. Liang, and A. Histace, “Comparative validation of polyp detection methods in video colonoscopy: results from the miccai 2015 endoscopic vision challenge,” IEEE Trans. Med. Imaging 36(6), 1231–1249 (2017). [CrossRef]

45. X. Wu, H. Chen, T. Gan, J. Chen, C. W. Ngo, and Q. Peng, “Automatic hookworm detection in wireless capsule endoscopy images,” IEEE Trans. Med. Imaging 35(7), 1741–1752 (2016). [CrossRef]

46. J. Y. He, X. Wu, Y. G. Jiang, Q. Peng, and R. Jain, “Hookworm detection in wireless capsule endoscopy images with deep learning,” IEEE Trans. Image Process. 27(5), 2379–2392 (2018). [CrossRef]

47. A. Abdolmanafi, L. Duong, N. Dahdah, and F. Cheriet, “Deep feature learning for automatic tissue classification of coronary artery using optical coherence tomography,” Biomed. Opt. Express 8(2), 1203–1220 (2017). [CrossRef]

	Training dataset		Test dataset
	Number of images	Number of patients	Number of images	Number of patients	Median age,(range)	Sex, male/female
NE	1177	229	294	67	49 (24-77)	38/29
SC	1740	391	443	94	62 (37-80)	67/27
O	2702	482	675	116	55 (24-82)	71/45
EV	4737	1038	1184	262	50 (22-76)	199/63
ESE	1556	294	389	74	55 (20-88)	44/30
EEC	2248	381	558	103	60 (36-78)	67/30
AEC	2611	438	651	108	63 (36-88)	83/25
Total	16771	3253	4194	824	56 (20-88)	569/295

Methods	Mean Pr (%)	Mean Rec (%)	Mean F1 (%)	Acc (%)
DenseNet	89.04	89.12	89.02	90.20
(95% CI)	(86.25,91.59)	(88.36,92.60)	(86.98,90.97)	(89.32,90.99)
AT-ECA-DDCNN	88.80	88.86	88.80	89.99
(95% CI)	(86.10,91.39)	(86.06,91.47)	(86.69,90.74)	(89.03,90.16)
BT-ECA-DDCNN	89.23	88.95	89.05	90.25
(95% CI)	(86.53,91.82)	(86.32,91.60)	(87.03,91.01)	(89.39,91.13)
Our	89.71	89.26	89.45	90.63
(95% CI)	(87.09,92.24)	(86.49,91.88)	(87.47,91.36)	(89.79,91.54)

Methods	Mean Pr (%)	Mean Rec (%)	Mean F1 (%)	Acc (%)
k=32	89.07	88.95	89.00	90.18
(95% CI)	(86.39,91.64)	(86.19,91.61)	(86.97,90.99)	(89.22,91.06)
k=48	89.71	89.26	89.45	90.63
(95% CI)	(87.09,92.24)	(86.49,91.88)	(87.47,91.36)	(89.79,91.54)

Method	Mean Pr (%)	Mean Rec (%)	Mean F1 (%)	Acc (%)	Train time
shuffle	88.68	88.56	88.57	89.80	9h32m
(95%CI)	(85.99,91.30)	(85.80,91.31)	(86.47,90.56)	(88.89,90.77)
Aug	88.79	88.97	88.82	90.18	15h28m
(95%CI)	(86.07,91.47)	(86.24,91.63)	(86.79,90.79)	(89.29,91.08)
Our	89.71	89.26	89.45	90.63	9h21m
(95%CI)	(87.09,92.24)	(86.49,91.88)	(87.47,91.36)	(89.79,91.54)

Methods	Mean Pr(%)	Mean Rec(%)	Mean F1(%)	Acc (%)	Average AUC	P ^a (M)	GPs ^b
Kumagai et al. [17]	88.26	88.23	88.23	89.39	0.9849	23.13	6.36
(95% CI)	(85.55,90.09)	(85.38,90.91)	(86.14,90.25)	(88.45,90.29)	(0.9823,0.9872)
Liu et al. [19]	88.32	88.38	88.32	89.44	0.9858	25.25	4.11
(95% CI)	(85.55,91.02)	(85.56,91.07)	(86.15,90.35)	(88.48,90.43)	(0.9835,0.9881)
Hu et al. [30]	88.29	88.16	88.21	89.46	0.9876	67.40	11.60
(95% CI)	(85.54,90.98)	(85.37,90.86)	(86.11,90.22)	(88.48,90.31)	(0.9853,0.9894)
Tan et al. [41]	88.28	88.24	88.21	89.56	0.9835	30.39	0.01
(95% CI)	(85.52,90.96)	(85.52,90.98)	(86.12,90.25)	(88.67,90.44)	(0.9807,0.9862)
Wang et al. [39]	88.64	88.50	88.55	89.90	0.9858	63.68	10.85
(95% CI)	(85.86,91.24)	(85.79,91.14)	(86.50,90.52)	(88.98,90.77)	(0.9834,0.9881)
Our	89.71	89.26	89.45	90.63	0.9877	26.49	7.82
(95% CI)	(87.09,92.24)	(86.49,91.88)	(87.47,91.36)	(89.79,91.54)	(0.9856,0.9897)

Automatic classification of esophageal disease in gastroscopic images using an efficient channel attention deep dense convolutional neural network

Abstract

1. Introduction

2. Materials and methods

2.1 Materials

2.2 Methods

2.2.1 Data preprocessing and random weighted sampling (RWS)

2.2.2 Efficient channel attention deep dense convolutional neural network (ECA-DDCNN)

2.2.3 Training details

3. Experiments and results

3.1 Metrics

3.2 Ablation studies

3.3 Comparisons of different data sampling methods

3.4 Comparisons with other related state-of-the-art methods

3.5 Generalization ability

4. Discussions

5. Conclusions

Funding

Disclosures

Supplemental document

References

Supplementary Material (1)

Cited By

Figures (5)

Tables (6)

Equations (6)

Biomedical Optics Express

Methods	Mean Pr (%)	Mean Rec (%)	Mean F1 (%)	Acc (%)
Kumagai et al. [17]	83.95	81.64	82.50	84.32
(95% CI)	(78.05,89.22)	(75.60,87.50)	(77.56,86.91)	(84.04,85.56)
Liu et al. [19]	81.12	79.82	80.25	82.89
(95% CI)	(74.57,87.09)	(73.18,85.88)	(74.93,84.89)	(81.52,84.13)
Hu et al. [30]	83.15	82.81	82.88	84.60
(95% CI)	(77.09,88.48)	(756.57,88.39)	(77.92,87.15)	(83.39,85.84)
Tan et al. [41]	85.04	82.82	83.55	84.60
(95% CI)	(80.60,89.20)	(76.69,88.59)	(78.81,87.67)	(83.35,85.87)
Wang et al. [39]	84.56	83.21	83.77	85.13
(95% CI)	(78.95,89.51)	(76.75,88.81)	(78.71,87.99)	(83.82,86.31)
Our	84.46	85.58	84.90	85.75
(95% CI)	(78.93,89.83)	(80.08,90.32)	(80.36,88.75)	(84.51,86.93)

Methods	Mean Pr (%)	Mean Rec (%)	Mean F1 (%)	Acc (%)
Kumagai et al. [17]	83.95	81.64	82.50	84.32
(95% CI)	(78.05,89.22)	(75.60,87.50)	(77.56,86.91)	(84.04,85.56)
Liu et al. [19]	81.12	79.82	80.25	82.89
(95% CI)	(74.57,87.09)	(73.18,85.88)	(74.93,84.89)	(81.52,84.13)
Hu et al. [30]	83.15	82.81	82.88	84.60
(95% CI)	(77.09,88.48)	(756.57,88.39)	(77.92,87.15)	(83.39,85.84)
Tan et al. [41]	85.04	82.82	83.55	84.60
(95% CI)	(80.60,89.20)	(76.69,88.59)	(78.81,87.67)	(83.35,85.87)
Wang et al. [39]	84.56	83.21	83.77	85.13
(95% CI)	(78.95,89.51)	(76.75,88.81)	(78.71,87.99)	(83.82,86.31)
Our	84.46	85.58	84.90	85.75
(95% CI)	(78.93,89.83)	(80.08,90.32)	(80.36,88.75)	(84.51,86.93)