Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Tissue self-attention network for the segmentation of optical coherence tomography images on the esophagus

Open Access Open Access

Abstract

Automatic segmentation of layered tissue is the key to esophageal optical coherence tomography (OCT) image processing. With the advent of deep learning techniques, frameworks based on a fully convolutional network are proved to be effective in classifying pixels on images. However, due to speckle noise and unfavorable imaging conditions, the esophageal tissue relevant to the diagnosis is not always easy to identify. An effective approach to address this problem is extracting more powerful feature maps, which have similar expressions for pixels in the same tissue and show discriminability from those from different tissues. In this study, we proposed a novel framework, called the tissue self-attention network (TSA-Net), which introduces the self-attention mechanism for esophageal OCT image segmentation. The self-attention module in the network is able to capture long-range context dependencies from the image and analyzes the input image in a global view, which helps to cluster pixels in the same tissue and reveal differences of different layers, thus achieving more powerful feature maps for segmentation. Experiments have visually illustrated the effectiveness of the self-attention map, and its advantages over other deep networks were also discussed.

© 2021 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

1. Introduction

Pathological analysis using optical coherence tomography (OCT) is receiving increasing attention nowadays due to its high resolution, non-invasive characteristics [13]. The OCT technique is first proposed by Huang et al. in 1991 for lesion detection in ophthalmology [1]. On the basis of Huang’s study, Tearney et al. designed the endoscopic OCT device by combining OCT with a fiber-optic flexible endoscope, which enables the equipment to enter the upper gastrointestinal tract [2]. Using the endoscopic OCT device, we can image the microstructure of esophageal tissues, which is of great significance in diagnosing diseases such as Barrett’s esophagus (BE) [46], eosinophilic esophagitis (EoE) [7], and dysplasia [8]. However, the OCT equipment generates a large number of images that require experts to read and analyze, which is laborious and the diagnosis result relies on the expert’s experience and subjective decision. Therefore, an automatic analysis system for esophageal OCT images is of great significance in clinical. Many esophageal diseases are manifested by changes in tissue microstructures, such as changes in the esophageal layer thickness or disruption to the layers. Accurate quantification of the esophageal layered structures from gastrointestinal endoscopic OCT images can be potentially very valuable for objective diagnosis of the diseases and assessment of the disease severity. As a result, the segmentation algorithm in an intelligent OCT image analysis system is the core, which determines whether the system can extract informative characteristics from the image.

Representative researches for automatical esophageal tissue segmentation can be summarized as follows. Ughi et al. proposed a lumen segmentation method by analyzing the A-scan [9], but their study did not include internal tissue segmentation. In 2017, Zhang et al. [10] proposed a multi-layer segmentation method based on the graph theory dynamic program [11,12], which segmented five esophageal tissues from the OCT images. Our group combined the graph search theory with canny edge detection and achieved multi-layer esophageal tissue segmentation with higher accuracy [13]. Besides, another study of our group reported a classifier-based segmentation method based on wavelet and sparse Bayesian theory to segment esophageal tissues in a robust way [14].

With the advent of deep learning, several new frameworks achieved great success in image segmentation [1518]. In the community of OCT image segmentation, deep learning based strategies are also treated as the state-of-the-arts [1922]. One common approach is using the fully convolutional network (FCN) [23,24]. This kind of method takes advantage of convolutional networks and uses an encoder-decoder architecture to assign each pixel to a label. A most widely employed work is proposed by Ronneberger, which designed a U-shape FCN called U-Net to deal with biomedical images with small training set [15]. Based on FCN, Roy proposed a ReLayNet for fluid segmentation in macular OCT image [25]. Devalla designed the DRUNET for optic nerve head tissue segmentation in OCT image [26]. Venhuizen et al. implement retinal thickness measurement and intraretinal cystoid fluid quantification using the FCN framework [27]. In esophageal OCT image processing, Li et al. proposed a U-Net based framework for an end-to-end esophageal layer segmentation [28]. Our group designed a fully convolutional network to correct the topology error of the label mask based on adversarial learning [29].

Although several researches are reported, diagnosis-relevant esophageal tissues on OCT images are not always easy to identify due to the speckle noise, irrelevant structures and unfavorable imaging conditions. The speckle noise can be suppressed in several ways. For instance, Amini and Rabbani developed a spatially constrained Gaussian mixture model for retinal OCT image denoising, and experiments showed an improvement in the segmentation of intraretinal layers using the proposed method as a preprocessing step [30]. Ma et al. proposed a combination method of structure interpolation and lateral mean filtering to improve the signal-to-noise ratio based on one retinal image, which also improves the final segmentation performance [31]. However, these methods may also remove some useful information when suppressing speckle noise, which makes them less popular in the deep learning community. Moreover, when processing esophageal OCT images, other problems except for the speckle noise also have a negative effect on the segmentation. Firstly, the prob-protected sheath (Fig. 1) is sometimes misclassified as tissues when segmenting. Secondly, blurred boundaries are difficult to identify especially for the last two layers. Finally, adjacent tissues with similar layer structures may cause the network to generate incorrect segmentation results. In the deep learning community, the key step to solve these problems is to extract more representative feature maps, which have similar expressions for pixels in the same tissue and show discriminability for pixels from different tissues. Several methods have been proposed to achieve more effective feature maps. One common approach is to combine multi-scale feature maps to capture richer context information [32,33]. Although this context fusion technique helps to capture features in different scales, it cannot reveal the long-range dependence of structures in a global view, which is important in esophageal tissue segmentation. An alternate strategy is using recurrent neural networks to exploit the long-range dependencies [34,35]. Methods of this type achieved success in scene segmentation, but the relationship is implicitly learned by the recurrent network, which leads to difficult training, and the result is sensitive to the outcome of the long-term memorization.

 figure: Fig. 1.

Fig. 1. Demonstration of (a) a typical esophageal OCT image for mouse and (b) the corresponding manual segmentation result.

Download Full Size | PDF

Recently, it has been proved that self-attention as an attention mechanism can effectively capture the global dependence of the input [36]. It was first used in natural language processing and achieved great success in a variety of tasks [3739]. In 2017, self-attention is used to construct an architecture called Transformer, which is able to draw global dependencies between input and output just like recurrent networks [36]. The self-attention mechanism is then introduced to the computer vision community. For instance, Wang et al. employed the self-attention to perform class-specific pooling, which results in more accurate image classification [40]. Zhang et al. improved the image generation quality by embedded self-attention structure in generative adversarial network [41]. Wang et al. proposed non-local self-attention to capture the long range dependencies in the image and achieved higher video classification accuracy [42]. Fu et al. proposed the dual-attention network to improve network performance in scene segmentation [43].

In this study, we proposed a novel framework, called the tissue self-attention network (TSA-Net) for layer segmentation on esophageal OCT images. The TSA-Net introduces the self-attention mechanism to the segmentation network, which helps capture long-range context dependencies from the image. The entire network employed the U-Net as the backbone, and a specifically designed TSA module is embedded to accomplish tissue attention. The TSA module is composed of two main parts, namely, the position self-attention module and the channel self-attention module. The position self-attention module is designed to reveal feature similarities between any two positions in the image, thereby capturing the spatial dependencies between different pixels. The channel self-attention module behaves similarly, but it captures dependence relationship in channel dimension. By introducing the TSA module, the segmentation network is able to analyze the input image in a global view, which is beneficial for clustering pixels in the same tissue and revealing differences of different layers, thereby achieving higher segmentation accuracy. Our main contributions can be summarized as follows:

  • • We propose a novel TSA-Net with self-attention mechanism to extract more powerful feature maps for tissue segmentation on esophageal OCT images.
  • • We designed the position and channel self-attention module, whose effectiveness on capturing tissue structures is visually demonstrated.
  • • Accuracy improvements over several popular deep networks are experimentally observed.

The rest of this study is organized as follows. Section 2 describes the related theory and detailed architecture of the proposed TSA-Net. Section 3 describes the experiment, which shows the visualization result of the attention feature map and comparisons with other deep networks. Discussions and conclusions are given in Sections 4 and 5., respectively.

2. Methods

2.1 Problem statement

Given an esophageal OCT image, the task is to assign each pixel to a particular label representing a certain tissue. A typical esophageal OCT image from the mouse is shown in Fig. 1(a). The target tissue layers marked in the images are the epithelium stratum corneum (SC), epithelium (EP), lamina propria and muscularis mucosae (LP & MM) and submucosa (SM), which are labeled from “1” to “4”, respectively. The remaining part of the image is labeled by “0” as displayed in Fig. 1(b).

2.2 Overview of the TSA network

Typical convolutional networks only focus on local features due to the local receptive field of convolution kernels, which may cause pixels from the same tissue to be misclassified as different ones. In this study, we designed the TSA-Net to capture the global contexture information from OCT images by introducing the self-attention mechanism to the segmentation network. The network is intended to have a better feature representation, which is beneficial for esophageal tissue segmentation.

The overall framework of the proposed TSA-Net is shown in Fig. 2. We use the U-Net as the backbone network since it has achieved great success in the field of medical image segmentation [15]. In Fig. 2, the “ConvBL”, “ResBL” and “TSA-BL” represent the convolutional block, residual block and the proposed TSA block, respectively, whose structures were shown in Fig. 3. Notations like “ConvBL 64” indicates the block has an output with 64 channels. The “C” in the architecture indicates the concatenate connection. As shown in Figs. 3(a) and 3(c), the convolutional layers are followed by a batch normalization layer and a PReLU activation layer. The employment of the batch normalization layer is to compensate for the covariate shifts and helps to achieve a successful training. The PReLU activation is chosen because it can introduce non-linearity in the training and prevent gradient vanishment problems. Besides, the PReLU converges faster than ReLU [25]. For the residual block, the residual layers are batch normalized and the addition is followed by the PReLU. A dropout layer with a 0.5 dropout rate is applied at the end of the encoder to prevent overfitting.

 figure: Fig. 2.

Fig. 2. Architecture of the TSA-Net.

Download Full Size | PDF

 figure: Fig. 3.

Fig. 3. Architectures of (a) the convolution block (ConvBL); (b) the tissue self-attention block (TSA-BL) and (c) the residual block (ResBL).

Download Full Size | PDF

The TSA module is added to the layer after the first pooling. The reason is that we want the TSA to retain more details of the original input, which means that the module should be set close to the input layer. However, the TSA module needs large memory when calculating the self-attention map, which limits its input size. As a result, we set the TSA module as Fig. 2, which is a compromise between the size we want and the memories of the computing device. The TSA module is composed of two sub-modules, namely, the position self-attention module and the channel self-attention module as shown in Fig. 3(b). The position self-attention module is intended to capture the global contexture feature of the input in spatial dimension and the channel self-attention module is used to explore the long-range dependence of the input feature map in different channels. The final TSA feature map is obtained by aggregating outputs of these two sub-modules and the input feature map, which generates better feature representations for esophageal tissue segmentation.

2.3 Details of the TSA module

The TSA module is designed based on the attention mechanism, which can be mathematically described by Eq. (1), where $\textbf {Q}$, $\textbf {K}$ and $\textbf {V}$ denotes the query matrix, key matrix and value matrix, respectively [36].

$$\textbf{Y} = \textrm{softmax}(\textbf{Q}\textbf{K}^T)\textbf{V}$$

In self-attention theory, Eq. (1) is transformed to Eq. (2),

$$\textbf{Y}= \textrm{softmax}((\textbf{W}_\theta \textbf{X})^T \textbf{W}_\phi \textbf{X}) \textbf{W}_g \textbf{X}$$
where $\textbf {X}$ is the input matrix, $\textbf {W}_\theta$, $\textbf {W}_\phi$ and $\textbf {W}_g$ are weight matrices that can be learned from training. In this study, the position and channel self-attention modules are designed based on Eq. (2).

2.3.1 Position self-attention module

The position self-attention module is constructed as Fig. 4. In this figure, $\textbf {X} \in \mathbb {R}^{H \times W \times C}$ denotes the $C$-channel feature map with size $H \times W$. As indicated by Eq. (2), the input matrix is supposed to be multiplied with different weight matrices. In this case, we utilized three $1 \times 1$ convolution kernels instead of fully connection to realize a similar transform with less memory request. Then, the output with size $H \times W \times C$ is transformed to a matrix with size $HW \times C$ by reshaping feature maps in each channel colume-wise to a vector with length $H \times W$. The $\otimes$ in Fig. 4 denotes matrix multiplication and the $\oplus$ indicates element-wise addition of matrices. The $\textbf {M}^s$ in Fig. 4 is regarded as the position self-attention map. This matrix has clear physical significance, which describes spatial relationship between any two pixels of the features. The attention map is then multiplicated with the value matrix $\textbf {V}^s$, the result of which is added with the original features to generate the final representations $\textbf {X}^s$. The $\textbf {X}^s$ is intended to capture the long range contexture information from the image. Moreover, the $\textbf {X}^s$ has the same size as $\textbf {X}$, indicating the module is convenient to be embedded in existing frameworks.

 figure: Fig. 4.

Fig. 4. Architecture of the position self-attention module.

Download Full Size | PDF

2.3.2 Channel self-attention module

The architecture of the channel self-attention module can be found in Fig. 5. The structure of channel self-attention module is similar to the position case. The channel attention map $\textbf {M}^c$ is calculated based on the input feature map $\textbf {X}$. This attention map represents the relationship between any two channels, which is able to capture the long range dependence in channel dimension. The output $\textbf {X}^c$ has the same size as the input feature map, indicating this module can be embedded into existing frameworks without additional change of the original structure.

 figure: Fig. 5.

Fig. 5. Architecture of the channel self-attention module.

Download Full Size | PDF

2.4 Loss function

The overall loss function of the TSA-Net can be expressed as Eq. (3).

$$L = L_{\textrm{CE}} + \lambda L_{\textrm{Dice}}$$

In this equation, $L_{\textrm {CE}}$ denotes the cross entropy loss which is a measurement for classification accuracy as described by Eq. (4), where $N$ is the pixel number, $g_l(x)$ is the target probability that pixel $x$ belongs to class $l$ with one for the true label and zero entries for the others. $p_l(x)$ is the estimated probability of pixel $x$ belongs to class $l$.

$$L_{\textrm{CE}} ={-}\sum_{i=1}^{N}g_{l}(x)\log{p_{l}(x)}$$

$L_{\textrm {Dice}}$ represents the dice loss, which is intended to evaluate the spatial overlap between the predicted mask and the ground truth as defined in Eq. (5),

$$L_{\textrm{Dice}} = 1-\frac{2\sum_{i=1}^{N}p_l(x)g_l(x)}{\sum_{i=1}^{N}p^2_l(x)+\sum_{i=1}^{N}g^2_l(x)}$$
where the parameters are defined in the same way as those in Eq. (4). The parameter $\lambda$ is a self-defined weight to balance the mentioned two terms, which is set at 0.5 in this study.

2.5 Details about the training of the TSA-Net

The TSA-Net was trained end-to-end using the Adam [44] optimizer with Nesterov momentum of 0.9. An initial learning rate of $1 \times 10^{-4}$ is applied and is decayed by a factor of 10 if the validation loss fails to improve over ten consecutive epochs. Training is performed in batches of 16 randomly chosen samples at each iteration. After going through the entire training set, an epoch is finished, and 100 epochs are needed to accomplish the training. Finally, the model with the lowest validation loss is employed, which is used to measure the segmentation performance of the network based on the test dataset.

During the training process, data augmentation is used to improve the network robustness. The data augmentation techniques used in this study include random rotation (range = 10), random shear (range = 0.05), random shift (range = 0.05), random zoom (range = 0.05) and random horizontal flip.

3. Experiments

3.1 Data

This study was approved by the animal science center of Suzhou Institute of Biomedical Engineering and Technology under protocol number 2020-A02 (from September 1st 2020 to January 31th 2021). OCT images (1840 B-scans) of the esophagus from eight C57BL mice were used to evaluate the proposed segmentation network. These images were collected in vivo from different subjects using an 800 nm ultrahigh resolution (axial resolution $\leq 3$ $\mu$m) endoscopic OCT system. The probe of the OCT device can enter the upper gastrointestinal tract and rotationally scan the esophagus noninvasively. The image is initially expressed in polar coordinates, and then converted to Cartesian coordinates by the software of the OCT system. During the experiment, 1200 B-scans were collected from six C57BL mice to establish a segmentation network, among which 800 B-scans were randomly selected for training, and the remaining 400 B-scans were used for validation. An independent test set consists of 240 B-scans were collected from two other mice, ensuring that there is no overlap between the data used for training and testing. All the algorithms are evaluated based on the performance on the independent test dataset.

The size of each B-scan in our data set is $256 \times 256$. For the data used for training and validation, each B-scan is split width-wise into two non-overlapped slices sizing $256 \times 128$ to reduce the GPU memory needed for training. Since our fully-convolutional network can process images of arbitrary size, images in the test set can be segmented without slicing.

The annotated labels were generated by an experienced grader using ITK-SNAP [45], which were used for network training and algorithm evaluation. During the process, the grader is asked to annotate twice for each image, and the average of these two annotations is used as ground truth. The TSA-Net was implemented in Keras using Tensorflow as the backend. Training of the network was performed on an 11 GB Nvidia GeForce RTX 2080Ti GPU using CUDA 9.2 with cuDNN v7.

3.2 Ablation study for attention module

We employed the TSA module in the segmentation network to capture the long-range dependencies for better structure understanding. To verify the performance of the TSA module, we conduct an ablation study on segmentation performance of network with and without the TSA module.

Intuitive visualization of the TSA effects can be found in Fig. 6. As marked by the white circle, segmentation network with TSA module generates more reasonable tissue boundaries, making the segmentation result closer to the ground truth. In the first row, the bottom of the “SM” layer is affected by adjacent tissues when segmenting without TSA module. In the second row, the network without TSA module incorrectly treated artifacts as tissues. In the last row, the TSA module help generate continuous and smooth segmentation in the two bottom layers. These results confirm that the attention modules bring great benefits to OCT image segmentation.

 figure: Fig. 6.

Fig. 6. Visualization of segmentation result of the TSA-Net on esophageal OCT images.

Download Full Size | PDF

3.3 Visualiation of the attention feature maps

The input feature map for TSA module is in size of $H \times W \times C$, which is $128 \times 128 \times 128$ in this case. For the position attention module, the size of self-attention map $HW\times HW$. As discussed in Section 2., this position attention map indicates the position weight matrix (sizing $H \times W$) for each point of the original image. In Fig. 7, for each input image, we select three points (marked as #1, #2, #3) and demonstrate their corresponding position weight matrix in columns 2 to 4. These three points represent different parts of the image. In detail, point #1 is selected from the background, point #2 is chosen from the high-reflective tissues and point #3 belongs to diagnose-irrelevant low-reflective tissues. It can be found that the position weight matrix is able to capture clear semantic similarity from the image. For instance, the position weight matrix for point #1 (column 2) highlights the upper background where the point is selected from. While for the other structures such as the high-reflective tissues, the weight value is small, indicating lower relativity. The position self-attention map for point #2 (column 3) presents large value on high-reflective regions, and the upper background has the smallest value, indicating the least relativity. As for the map corresponding to point #3 (column 4), it highlights the low-reflective tissues, and has the lowest value on high-reflective regions since they are regarded as significantly different from the selected point by the TSA module. These results confirm that the position weight matrix is able to capture meaningful position relationship of different pixels from the image.

 figure: Fig. 7.

Fig. 7. Visualization of weight matrix for the position self-attention map.

Download Full Size | PDF

For channel attention module, the self-attention map is in size of $C \times C$, which means for each channel, there is a corresponding weight vector. In this case, it is difficult to provide an intuitive explanation by directly displaying the weight vector. Instead, we show two selected channels from the feature map generated by the channel attention module. In Fig. 8, we present the 20th and 29th channel in columns 2 and 3. For the four target tissues, both the two channels can distinguish the low-reflective tissue and the high-reflective one. The difference is that the 20th channel highlights the plastic sheath of the image while the 29th channel behaves contrarily. The other channels behaved in a similar pattern as either of these two cases. The channel attention map is not as intuitive as the position attention map, but it still includes specific semantics, which is helpful for tissue segmentation.

 figure: Fig. 8.

Fig. 8. Visualization of weight matrix for the channel self-attention map.

Download Full Size | PDF

3.4 Comparisons with state-of-the-art

3.4.1 Evaluation metrics

The following metrics are employed to evaluate different deep networks, including the precision, recall, Dice coefficient, the Hausdorff distance (HD) and the average distance (AVD) [46,47]. The precision, recall and Dice coefficient are used to evaluate the overlap areas between the predicted mask and the ground truth, while the HD and AVD measure the tissue boundary accuracy of the segmentation. Detailed definition of the first three metrics can be found in Eqs. (6) to (8),

$$\textrm{Precision} = \frac{S_R \cap S_G}{S_G}$$
$$\textrm{Recall} = \frac{S_R \cap S_G}{S_R}$$
$$\textrm{Dice} = 2 \times \frac{S_R \cap S_G}{S_R + S_G}$$
where $S_R$ and $S_G$ represent the binary segmentation and ground truth areas, respectively. Definition of HD is given by Eq. (9),
$$\textrm{HD}(A, B) = \max(h(A, B), h(B, A))$$
where $A$ and $B$ are two finite point sets, $h(A, B)$ is called the directed Hausdorff distance and given by Eq. (10).
$$h(A, B) = \max_{a \in A} \min_{b \in B} \|a - b\|$$

The AVD is less sensitive to outliers than the HD, which is defined as Eq. (9),

$$\textrm{AVD}(A, B) = \max(d_a(A, B), d_a(B, A))$$
where $d_a(A, B)$ is called the directed Average Hausdorff distance given by
$$d_a(A, B) = \frac{1}{N} \sum_{a \in A} \min_{b \in B} \|a - b\|$$

3.4.2 Comparing results

The following state-of-the-art deep networks were used as comparisons with the proposed TSA-Net, including the Segnet [48], PSPNet [16], U-Net [15,49] and Pix2Pix [50]. The number of parameters and the training time for each network are listed in Table 1. It can be found that all networks except Pix2Pix can complete the training process in 1.5 hours. The Pix2Pix takes much longer to train since it is composed of two deep networks called generator and discriminator, which lead to more trainable parameters. In addition, the adversarial training strategy also makes the computing device spend more time on the training process. The proposed TSA-Net introduces a new self-attention architecture to the U-Net, which brings about 400,000 more parameters. Compared with the total number of parameters, the increase is not obvious, which only slows the training process by about 10 minutes than the U-Net. Benefiting from the parallel computing capabilities of GPU, all these trained deep networks segment the new input image fast, which means they can process the test set with 240 B-scans in 2 seconds.

Tables Icon

Table 1. Number of parameters and the training time.

The overall segmentation accuracies for the test set are listed in Table 2. In this table, the “U-Net+CSA” denotes U-Net with the proposed channel self-attention module, “U-Net+PSA” indicates U-Net with the position self-attention module. It can be found these additional attention module improves the overall segmentation accuracy, and the TSA-Net achieves the highest accuracy in the test set. However, the advantage of these attention modules in segmentation accuracy is not obvious. One reason for this is that accuracy cannot measure the topology performance of the result and the tissue area is small compared to the whole image. As a result, changes in tissue labeling will not cause an obvious difference in accuracy.

Tables Icon

Table 2. Segmentation accuracies for different methods.

For a more comprehensive comparison, we evaluate the performance of different networks on each individual layer. Results are listed in Tables 3 to 6 with the best performance bolded. It can be found that Segnet and PSPNet have Dice coefficients similar to other methods. However, HD and AVD are significantly larger than the others, indicating the segmentation result of these two networks may generate a label mask out of the target tissue region, leading to more topological errors. The Pix2Pix performs better than Segnet and PSPNet in HD and AVD, but the overall accuracy and Dice coefficient are lower than those two networks. One possible reason is that Pix2Pix is more focused on generating “real” label masks with smaller topological errors, rather than accurately classifying pixels. The U-Net used in this study has the same structure as the proposed TSA-Net without the TSA module. Compared with the three deep networks mentioned, it has achieved more satisfactory segmentation results, indicating its advantages in segmenting esophageal OCT images. The “U-Net+CSA” and “U-Net+PSA” achieved better segmentation results than the U-Net, indicating the effectiveness of the proposed attention module. Besides, “U-Net+PSA” achieved smaller HD and AVD than the “U-Net+CSA” in most cases, indicating the position self-attention module is more effective in capturing structure information of the input, which is consistent with the feature map performance of these two modules in Figs. 7 and 8. The combination of position and channel self-attention modules generates the TSA module, which adaptively captures the tissue structure in the OCT image, and achieves the best performance in almost all cases. The higher Dice coefficient of TSA-Net indicates that the proposed network labels the target tissue more accurately. In addition, the smaller HD and AVD values confirm that the segmentation result of TSA-Net generates continuous tissues with less topological errors.

Tables Icon

Table 3. Evaluation of different segmentation methods for the SC layer.

Tables Icon

Table 4. Evaluation of different segmentation methods for the EP layer.

Tables Icon

Table 5. Evaluation of different segmentation methods for the LP&MM layer.

Tables Icon

Table 6. Evaluation of different segmentation methods for the SM layer.

4. Discussions

Automatical segmentation of clinical-relevant esophageal tissues is always affected by speckle noise, disturbance structures and unfavorable image qualities. An effective solution to this problem is extracting more powerful feature maps for the deep networks. In this study, we proposed the TSA-Net, which introduces the self-attention mechanism to capture long-range dependencies of the pixels from the image. The core of the TSA-Net is the TSA module, which consists of a position self-attention module and a channel self-attention module. As observed from the experiments, the feature map generated by the position self-attention module is able to describe relationships between any two pixels of the image, while the channel self-attention feature map captures long-range dependencies of features in the channel dimension. In this case, pixels from different structures in the image will be clustered in a global view before segmentation, thus generating more discriminative feature maps for tissue identification. Comparisons with other popular segmentation networks confirmed the advantages of the proposed TSA-Net, which achieved the best performances on most quantitative indicators.

In this study, the loss function is composed of cross-entropy and dice coefficient as shown in Eq. (3). To confirm the effectiveness of these two items, we carried on an ablation study that used only the cross-entropy loss and dice loss to train the TSA-Net. Results showed that the average accuracies of segmenting the test set using cross-entropy and dice coefficient are 0.9192 and 0.8951, respectively, which is much lower than the combined result 0.9681 (Table 2). One reason for this is that the esophagus tissue is located in a small area compared to the entire image, which makes it difficult for the algorithm to find the correct optimization direction only based on cross entropy or dice coefficients. The weight in Eq. (3) is set to 0.5, which is chosen based on experiments. The model generates similar performance when the weight is set at the range from 0.4 to 0.6.

In the TSA-Net architecture shown in Fig. 2 the dropout layer with rate 0.5 is added between encoder and decoder. The reason is that given the tens of millions of parameters, the network is prone to be over-trained. Adding the dropout layer in the latent space is most effective since the latent representation significantly affects the network output, which means a single dropout layer in that location is sufficient to alleviate the overfitting problem. To verify the effectiveness of the structure, we conducted an ablation study that removed the dropout layer. Result shows that the segmentation accuracy on the test dataset is reduced from 0.9681 to 0.9562.

The TSA module is designed to generate an output of the same size as the input, which makes it convenient to be embedded into existing frameworks. In this study, we use the U-Net as the backbone since it has achieved great success in medical image segmentation [15,25] and its performance in this task is also superior to other testing deep networks as presented in the experiments. The proposed TSA-Net has the potential to be further improved when more powerful segmentation networks are designed.

A limitation of the TSA module is that it requires a large memory to compute the self-attention feature maps. In this case, a $256 \times 256$ input generates a $65536 \times 65536$ position self-attention map, which is a heavy burden for the computing device. In order to alleviate this problem, this study designed the network in a fully convolution pattern. In this case, we can train the network using image slices instead of the original image, and the trained network can be directly used for the image with the original size.

In this study, due to GPU memory limitations, we only added a TSA module after the first pooling layer of U-Net. In order to verify the performance of embedding multiple TSA modules at different stages of the network, we added two TSA modules after the first two pool layers of U-Net, and trained the new network using 16G NVIDIA Tesla P100 GPU. The results show that the segmentation accuracy is improved from 0.9681 to 0.9684. When we try to add three TSA modules to U-Net, the network cannot be trained because the GPU has insufficient memory. It can be found that adding more TSA modules to the network has the potential to improve segmentation performance. However, considering the large memory requirements of computing devices, this may not be worth all the effort.

Evaluation of the TSA-Net work is accomplished using esophageal OCT images from mice. The dataset is able to intuitively illustrated the effectiveness of the self-attention mechanism and confirm the advantages of the proposed TSA-Net. To move the proposed network from laboratory to clinic, esophageal OCT images from human with various health conditions will be collected.

5. Conclusions

In this study, we proposed a TSA-Net for esophageal tissue segmentation on OCT images. The TSA-Net introduces self-attention mechanism to capture long-range feature dependencies in a global view. The core TSA module is composed of a position self-attention module and a channel self-attention module. The position self-attetnion module is designed to describe relations between any two pixels from the image, while the channel self-attention module is supposed to reveal long-range dependencies among different channels. In this case, the TSA-Net is able to generate an attention feature map that clusters pixels from the same tissue and presents discriminability for pixels from different structures. The experiment visually demonstrated the effectiveness of the self-attention map and confirmed the advantage of TSA-Net over other popular deep networks in segmenting esophageal OCT images. The TSA-Net is convenient for further improvements, such as using a more powerful backbone other than U-Net. Besides, it is also easy to be applied to esophageal OCT images from humans since it is an end-to-end fully convolutional network. These characteristics make TSA-Net clinically attractive.

Funding

Natural Science Foundation of Jiangsu Province (BK20200216); Jiangsu Planned Projects for Postdoctoral Research Funds of China (2018K007A, 2018K044C).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

References

1. D. Huang, E. A. Swanson, C. P. Lin, J. S. Schuman, W. G. Stinson, W. Chang, M. R. Hee, T. Flotte, K. Gregory, C. A. Puliafito, and J. G. Fujimoto, “Optical coherence tomography,” Science 254(5035), 1178–1181 (1991). [CrossRef]  

2. G. J. Tearney, M. E. Brezinski, B. E. Bouma, S. A. Boppart, C. Pitris, J. F. Southern, and J. G. Fujimoto, “In vivo endoscopic optical biopsy with optical coherence tomography,” Science 276(5321), 2037–2039 (1997). [CrossRef]  

3. I. P. Okuwobi, Z. Ji, W. Fan, S. T. Yuan, L. Bekalo, and Q. Chen, “Automated quantification of hyperreflective foci in SD-OCT with diabetic retinopathy,” IEEE J. Biomed. Health Inform. 24(4), 1125–1136 (2020). [CrossRef]  

4. J. M. Poneros, S. Brand, B. E. Bouma, G. J. Tearney, C. C. Compton, and N. S. Nishioka, “Diagnosis of specialized intestinal metaplasia by optical coherence tomography,” Gastroenterology 120(1), 7–12 (2001). [CrossRef]  

5. X. Qi, M. V. Sivak, G. Isenberg, J. E. Willis, and A. M. Rollins, “Computer-aided diagnosis of dysplasia in Barrett’s esophagus using endoscopic optical coherence tomography,” J. Biomed. Opt. 11(4), 044010 (2006). [CrossRef]  

6. X. Qi, Y. S. Pan, M. V. Sivak, J. E. Willis, G. Isenberg, and A. M. Rollins, “Image analysis for classification of dysplasia in Barrett’s esophagus using endoscopic optical coherence tomography,” Biomed. Opt. Express 1(3), 825–847 (2010). [CrossRef]  

7. Z. Y. Liu, J. F. Xi, M. Tse, A. C. Myers, X. D. Li, P. J. Pasricha, and S. Y. Yu, “Allergic inflammation-induced structural and functional changes in esophageal epithelium in a guinea pig model of eosinophilic esophagitis,” Gastroenterology 146(5), S92 (2014). [CrossRef]  

8. M. J. Suter, M. J. Gora, G. Y. Lauwers, T. Arnason, J. Sauk, K. A. Gallagher, L. Kava, K. M. Tan, A. R. Soomro, T. P. Gallagher, J. A. Gardecki, B. E. Bouma, M. Rosenberg, N. S. Nishioka, and G. J. Tearney, “Esophageal-guided biopsy with volumetric laser endomicroscopy and laser cautery marking: a pilot clinical study,” Gastrointest. Endosc. 79(6), 886–896 (2014). [CrossRef]  

9. G. J. Ughi, M. J. Gora, A. F. Swager, A. Soomro, C. Grant, A. Tiernan, M. Rosenberg, J. S. Sauk, N. S. Nishioka, and G. J. Tearney, “Automated segmentation and characterization of esophageal wall in vivo by tethered capsule optical coherence tomography endomicroscopy,” Biomed. Opt. Express 7(2), 409–419 (2016). [CrossRef]  

10. J. L. Zhang, W. Yuan, W. X. Liang, S. Y. Yu, Y. M. Liang, Z. Y. Xu, Y. X. Wei, and X. D. Li, “Automatic and robust segmentation of endoscopic oct images and optical staining,” Biomed. Opt. Express 8(5), 2697–2708 (2017). [CrossRef]  

11. S. J. Chiu, X. T. Li, P. Nicholas, C. A. Toth, J. A. Izatt, and S. Farsiu, “Automatic segmentation of seven retinal layers in sdoct images congruent with expert manual segmentation,” Opt. Express 18(18), 19413–19428 (2010). [CrossRef]  

12. L. Y. Fang, D. Cunefare, C. Wang, R. H. Guymer, S. T. Li, and S. Farsiu, “Automatic segmentation of nine retinal layer boundaries in oct images of non-exudative amd patients using deep learning and graph search,” Biomed. Opt. Express 8(5), 2732–2744 (2017). [CrossRef]  

13. M. Gan, C. Wang, T. Yang, N. Yang, M. Zhang, W. Yuan, X. D. Li, and L. R. Wang, “Robust layer segmentation of esophageal OCT images based on graph search using edge-enhanced weights,” Biomed. Opt. Express 9(9), 4481–4495 (2018). [CrossRef]  

14. C. Wang, M. Gan, N. Yang, T. Yang, M. Zhang, S. H. Nao, J. Zhu, H. Y. Ge, and L. R. Wang, “Fast esophageal layer segmentation in oct images of guinea pigs based on sparse Bayesian classification and graph search,” Biomed. Opt. Express 10(2), 978–994 (2019). [CrossRef]  

15. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” Med. Image Comput. Comput. Interv. Pt Iii 9351, 234–241 (2015). [CrossRef]  

16. H. S. Zhao, J. P. Shi, X. J. Qi, X. G. Wang, and J. Y. Jia, “Pyramid scene parsing network,” in 30th Ieee Conference on Computer Vision and Pattern Recognition (Cvpr 2017), (2017), pp. 6230–6239.

17. H. M. Li, J. H. Fang, S. F. Liu, X. W. Liang, X. Yang, Z. X. Mai, M. T. Van, T. F. Wang, Z. Y. Chen, and D. Ni, “Cr-unet: A composite network for ovary and follicle segmentation in ultrasound images,” IEEE Journal of Biomedical and Health Informatics 24(4), 974–983 (2020) [CrossRef]  .

18. R. N. Zhang, X. Y. Xiao, Z. Liu, Y. J. Li, and S. Li, “MRLN: Multi-task relational learning network for mri vertebral localization, identification, and segmentation,” Ieee J. Biomed. Heal. Informatics 24(10), 2902–2911 (2020). [CrossRef]  

19. D. Romo-Bucheli, P. Seebock, J. I. Orlando, B. S. Gerendas, S. M. Waldstein, U. Schmidt-Erfurth, and H. Bogunovic, “Reducing image variability across oct devices with unsupervised unpaired learning for improved segmentation of retina,” Biomed. Opt. Express 11(1), 346–363 (2020). [CrossRef]  

20. J. Wang, T. T. Hormel, L. Q. Gao, P. X. Zang, Y. K. Guo, X. G. Wang, S. T. Bailey, and Y. L. Jia, “Automated diagnosis and segmentation of choroidal neovascularization in oct angiography using deep learning,” Biomed. Opt. Express 11(2), 927–944 (2020). [CrossRef]  

21. H. Stegmann, R. M. Werkmeister, M. Pfister, G. Garhofer, L. Schmetterer, and V. A. Dos Santos, “Deep learning segmentation for optical coherence tomography measurements of the lower tear meniscus,” Biomed. Opt. Express 11(3), 1539–1554 (2020). [CrossRef]  

22. R. Rasti, M. J. Allingham, P. S. Mettu, S. Kavusi, K. Govind, S. W. Cousins, and S. Farsiu, “Deep learning-based single-shot prediction of differential effects of anti-vegf treatment in patients with diabetic macular edema,” Biomed. Opt. Express 11(2), 1139–1152 (2020). [CrossRef]  

23. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in 2015 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), (2015), pp. 3431–3440.

24. J. Wang, Z. Wang, F. Li, G. X. Qu, Y. Qiao, H. R. Lv, and X. L. Zhang, “Joint retina segmentation and classification for early glaucoma diagnosis,” Biomed. Opt. Express 10(5), 2639–2656 (2019). [CrossRef]  

25. A. G. Roy, S. Conjeti, S. P. K. Karri, D. Sheet, A. Katouzian, C. Wachinger, and N. Navab, “Relaynet: retinal layer and fluid segmentation of macular optical coherence tomography using fully convolutional networks,” Biomed. Opt. Express 8(8), 3627–3642 (2017). [CrossRef]  

26. S. K. Devalla, P. K. Renukanand, B. K. Sreedhar, G. Subramanian, L. Zhang, S. Perera, J. M. Mari, K. S. Chin, T. A. Tun, N. G. Strouthidis, T. Aung, A. H. Thiery, and M. J. A. Girard, “Drunet: a dilated-residual u-net deep learning network to segment optic nerve head tissues in optical coherence tomography images,” Biomed. Opt. Express 9(7), 3244–3265 (2018). [CrossRef]  

27. F. G. Venhuizen, B. van Ginneken, B. Liefers, F. van Asten, V. Schreur, S. Fauser, C. Hoyng, T. Theelen, and C. I. Sanchez, “Deep learning approach for the detection and quantification of intraretinal cystoid fluid in multivendor optical coherence tomography,” Biomed. Opt. Express 9(4), 1545–1569 (2018). [CrossRef]  

28. D. W. Li, J. M. Wu, Y. F. He, X. W. Yao, W. Yuan, D. F. Chen, H. C. Park, S. Y. Yu, J. L. Prince, and X. D. Li, “Parallel deep neural networks for endoscopic oct image segmentation,” Biomed. Opt. Express 10(3), 1126–1135 (2019). [CrossRef]  

29. C. Wang, M. Gan, M. Zhang, and D. Y. Li, “Adversarial convolutional network for esophageal tissue segmentation on oct images,” Biomed. Opt. Express 11(6), 3095–3110 (2020). [CrossRef]  

30. Z. Amini and H. Rabbani, “Optical coherence tomography image denoising using gaussianization transform,” J. Biomed. Opt. 22(8), 1 (2017). [CrossRef]  

31. Y. S. Ma, Y. Z. Gao, Z. L. Li, A. Li, Y. Wang, J. Liu, Y. Yu, W. B. Shi, and Z. H. Ma, “Automated retinal layer segmentation on optical coherence tomography image by combination of structure interpolation and lateral mean filtering,” J. Innovative Opt. Health Sci. 14(01), 2140011–00 (2021). [CrossRef]  

32. L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis Mach. Intell. 40(4), 834–848 (2018). [CrossRef]  

33. H. H. Ding, X. D. Jiang, B. Shuai, A. Q. Liu, and G. Wang, “Context contrasted feature and gated multi-scale aggregation for scene segmentation,” in 2018 Ieee/Cvf Conference on Computer Vision and Pattern Recognition (Cvpr) (2018), pp. 2393–2402.

34. W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki, “Scene labeling with lstm recurrent neural networks,” in 2015 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), (2015), pp. 3547–3555.

35. B. Shuai, Z. Zuo, B. Wang, and G. Wang, “Scene segmentation with dag-recurrent neural networks,” IEEE Transactions on Pattern Analysis Mach. Intell 40(6), 1480–1493 (2018). [CrossRef]  

36. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, 30 (NIPS 2017) 30, (2017).

37. K. Stefanov, J. Beskow, and G. Salvi, “Self-supervised vision-based detection of the active speaker as support for socially aware language acquisition,” IEEE Transactions on Cogn. Dev. Syst 12(2), 250–259 (2020). [CrossRef]  

38. W. J. Li, F. Qi, M. Tang, and Z. T. Yu, “Bidirectional lstm with self-attention mechanism and multi-channel features for sentiment classification,” Neurocomputing 387, 63–77 (2020). [CrossRef]  

39. T. Huang, Z. H. Deng, G. H. Shen, and X. Chen, “A window-based self-attention approach for sentence encoding,” Neurocomputing 375, 25–31 (2020). [CrossRef]  

40. F. Wang, M. Q. Jiang, C. Qian, S. Yang, C. Li, H. G. Zhang, X. G. Wang, and X. O. Tang, “Residual attention network for image classification,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), (2017), pp. 6450–6458.

41. H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” arXiv:1805.08318 (2018).

42. X. L. Wang, R. Girshick, A. Gupta, and K. M. He, “Non-local neural networks,” in 2018 IEEE/Cvf Conference on Computer Vision and Pattern Recognition (CVPR) (2018), pp. 7794–7803.

43. J. Fu, J. Liu, H. J. Tian, Y. Li, Y. J. Bao, Z. W. Fang, and H. Q. Lu, “Dual attention network for scene segmentation,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) (2019), pp. 3141–3149.

44. D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR) (2015).

45. P. A. Yushkevich, J. Piven, H. C. Hazlett, R. G. Smith, S. Ho, J. C. Gee, and G. Gerig, “User-guided 3d active contour segmentation of anatomical structures: Significantly improved efficiency and reliability,” NeuroImage 31(3), 1116–1128 (2006). [CrossRef]  

46. A. A. Taha and A. Hanbury, “Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool,” BMC Med. Imaging 15(1), 29–00 (2015). [CrossRef]  

47. H. S. Zhao, B. He, Z. Y. Ding, K. Y. Tao, T. D. Lai, H. Kuang, R. Liu, X. G. Zhang, Y. C. Zheng, J. Y. Zheng, and T. G. Liu, “Automatic lumen segmentation in intravascular optical coherence tomography using morphological features,” Ieee Access 7, 88859–88869 (2019). [CrossRef]  

48. V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39, 2481–2495 (2015). [CrossRef]  

49. K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 770–778.

50. P. Isola, J. Y. Zhu, T. H. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in 30th Ieee Conference on Computer Vision and Pattern Recognition (CVPR 2017) (2017), pp. 5967–5976.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (8)

Fig. 1.
Fig. 1. Demonstration of (a) a typical esophageal OCT image for mouse and (b) the corresponding manual segmentation result.
Fig. 2.
Fig. 2. Architecture of the TSA-Net.
Fig. 3.
Fig. 3. Architectures of (a) the convolution block (ConvBL); (b) the tissue self-attention block (TSA-BL) and (c) the residual block (ResBL).
Fig. 4.
Fig. 4. Architecture of the position self-attention module.
Fig. 5.
Fig. 5. Architecture of the channel self-attention module.
Fig. 6.
Fig. 6. Visualization of segmentation result of the TSA-Net on esophageal OCT images.
Fig. 7.
Fig. 7. Visualization of weight matrix for the position self-attention map.
Fig. 8.
Fig. 8. Visualization of weight matrix for the channel self-attention map.

Tables (6)

Tables Icon

Table 1. Number of parameters and the training time.

Tables Icon

Table 2. Segmentation accuracies for different methods.

Tables Icon

Table 3. Evaluation of different segmentation methods for the SC layer.

Tables Icon

Table 4. Evaluation of different segmentation methods for the EP layer.

Tables Icon

Table 5. Evaluation of different segmentation methods for the LP&MM layer.

Tables Icon

Table 6. Evaluation of different segmentation methods for the SM layer.

Equations (12)

Equations on this page are rendered with MathJax. Learn more.

Y = softmax ( Q K T ) V
Y = softmax ( ( W θ X ) T W ϕ X ) W g X
L = L CE + λ L Dice
L CE = i = 1 N g l ( x ) log p l ( x )
L Dice = 1 2 i = 1 N p l ( x ) g l ( x ) i = 1 N p l 2 ( x ) + i = 1 N g l 2 ( x )
Precision = S R S G S G
Recall = S R S G S R
Dice = 2 × S R S G S R + S G
HD ( A , B ) = max ( h ( A , B ) , h ( B , A ) )
h ( A , B ) = max a A min b B a b
AVD ( A , B ) = max ( d a ( A , B ) , d a ( B , A ) )
d a ( A , B ) = 1 N a A min b B a b
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.