Connectivity-based deep learning approach for segmentation of the epithelium in in vivo human esophageal OCT images

Ziyun Yang; Somayyeh Soltanian-Zadeh; Kengyeh K. Chu; Haoran Zhang; Lama Moussa; Ariel E. Watts; Nicholas J. Shaheen; Adam Wax; Sina Farsiu; Sina Farsiu

doi:10.1364/BOE.434775

1. Introduction

Esophageal cancer is the seventh most common cancer and the sixth most common cause of cancer mortality worldwide [1]. Early detection of precancerous esophagus and dysplasia can help reduce the morbidity and mortality of esophageal cancer [2]. Systematic biopsies can detect high-grade dysplasia [3] but are limited by their invasiveness and by blind sampling of tissue. Optical coherence tomography (OCT) allows non-invasive cross-sectional imaging of soft tissues [4] and has been used for examining the upper gastrointestinal tract [5–9]. Accurate interpretation of esophageal OCT images is essential for detection of dysplasia [10,11] and for diagnosis of esophageal diseases such as eosinophilic esophagitis [12] and Barrett’s esophagus (BE) [13,14]. In BE, the normal stratified squamous epithelium of the esophagus is replaced by a specialized columnar epithelium [2,15]. However, due to the large volume of OCT images and patient-specific confounders such as tissue folding and mucus covering, analysis of OCT images by gastroenterologists is time-consuming and subjective. To enable high-throughput clinical use of OCT in esophageal cancer screening, an automated segmentation algorithm to accurately quantify tissue characteristics such as thickness and shape is needed.

Existing automated esophageal OCT segmentation algorithms fall into two main categories: traditional image processing-based methods and deep learning-based methods. Zhang et al. [16] and Gan et al. [17] proposed traditional image processing algorithms to alleviate speckle noise and improved graph searching-based methods for esophageal layer segmentation in guinea pigs. Wang et al. [18] combined a sparse Bayesian classifier with graph theory and dynamic programming [19] to segment esophageal layers, also in guinea pigs. Since these traditional image processing-based methods rely on predefined features, they are less reliable in dealing with variation among images.

Deep learning-based segmentation methods are more generalizable than traditional image processing methods as they may benefit from implicit learning of complex segmentation rules from diverse sets of training data. Several popular deep learning-based segmentation methods use an encoder-decoder structured neural network to assign each pixel a class label. An encoder-decoder model that is widely used in medical image segmentation is the U-shaped fully convolutional network (U-Net) proposed by Ronneberger et al. [20]. Li et al. [21] connected multiple U-Nets in parallel to segment esophageal layers in OCT images of guinea pigs. Wang et al. [22] proposed a self-attention network in which channel- and position-aware attention modules were added to U-Net for mouse esophageal OCT segmentation. However, the predictions of esophageal layers by these encoder-decoder models often suffer from topological errors such as outlier predictions, disconnected regions, and non-simply connected predictions [21,23]. One reason for topological errors is that these studies model segmentation as a pure pixel-level classification problem. These models give insufficient attention to inter-pixel relationships since they ignore that the class predictions of neighboring pixels are correlated; it is suboptimal to treat them independently. To alleviate these topological errors, Wang et al. [23] proposed an adversarial convolutional network consisting of an improved U-Net-based generator and discriminator trained through adversarial learning to segment the esophagus layer in guinea pig OCT images. Despite encouraging results showing that adversarial learning reduced topological errors by encoding high-order pixel relationships, this study did not demonstrate significant improvement in overall pixel-wise segmentation accuracy and Dice coefficient compared to U-Net. In addition, adversarial training requires complicated hyperparameter tuning and exhibits unstable model training behavior [24–26]. Further, current deep learning-based algorithms for esophageal OCT segmentation have been designed mainly for analyzing ex vivo images from small animals. The work of Ughi et al. focused on automatically segmenting the esophageal wall in clinical human OCT images [27], but their method was not an end-to-end learning algorithm as it was based on feature extractions that depended on prior knowledge. Currently, there is no end-to-end learning algorithm designed for automated segmentation of in vivo human esophageal OCT images, which are inherently lower quality than ex vivo images, and are further complicated by artifacts, patient-specific confounders such as tissue folding, mucus covering, and disease-related changes in the epithelium.

Here we propose a novel end-to-end connectivity-based deep learning algorithm, Bilateral connectivity-based CE-Net (Bicon-CE), for epithelial layer segmentation from in vivo human esophageal OCT images. We build upon our recent findings in salient object detection [28] to develop a novel OCT layer segmentation algorithm. We model the layer segmentation task as a combination of pixel connectivity modeling and pixel-wise tissue classification. To adapt our recent theoretical work [28] for this biomedical application, we use one of the most successful medical segmentation networks, CE-Net, as our backbone in Section 2.4. Furthermore, we propose a new aggregation operation in Section 2.5 to alleviate the interference of noisy pixels and enhance the learning efficacy. Also, we analyze the data composition of esophageal OCT images and revise the loss function to balance the background and target pixels in Section 2.6. In the result sections, we compare the topological relation of predictions and performance of Bicon-CE with commonly used methods in Section 3.2 and 3.3. We test the robustness of Bicon-CE under real-world clinical imaging artifacts and complicated patient-specific scenarios in Section 3.4, then assess the clinical potential of Bicon-CE by using it to identify changes to the epithelium due to BE in Section 3.5.

2. Methods

2.1 Dataset and annotation

Human subjects research was conducted with the approval of the Institutional Review Boards of Duke University Health System (Pro0090173) and University of North Carolina (UNC; 17-3037). Subjects were recruited from patients undergoing routine care endoscopy at UNC Healthcare. Of 54 patients initially recruited, 30 were successfully imaged using OCT (784 B-scans total); of these 30 subjects, six had non-dysplastic BE and one had BE with low-grade dysplasia.

The imaging technique design is described in [7]. Briefly, a paddle-shaped probe that is attached externally to an endoscope provides cross-sectional spectral domain OCT images of the esophageal mucosa to supplement standard video endoscopy, in a form factor compatible with existing workflow and clinical practice.

Acquired OCT data were cropped manually around the region of interest (ROI; the esophagus) and were labeled independently by three graders who are experienced at evaluating OCT images. The annotations of the most experienced grader (Grader #1) served as the gold standard, and those of the other two graders were used to test inter-grader variability. We note that Graders #2 and #3 were trained by Grader #1 prior to labeling the images presented in the paper, with in vivo human esophageal OCT images that were not part of the dataset included in the manuscript. Since segmentation is challenging even for experts, the graders were asked to segment the epithelium only in regions in which they were confident about the accuracy of their annotation. Therefore, manual segmentations often did not span the entire B-scan. We refer to the horizontal range of an OCT B-scan that was manually segmented by graders as their “trainable interval”. To avoid interference from unsegmented epithelium regions in network training, we cropped the Grader #1’s trainable intervals of the OCT ROI images and used only the resulting cropped images for training (Fig. 1 (a)). In the testing phase, we used the original uncropped ROI images as the network input. However, the performance of different methods and graders was compared to the gold standard only over the specifically-defined trainable interval. We define two intervals for evaluation: first is the trainable interval of Grader #1, which is used for comparing the Bicon models with the baseline methods in Section 3.3. The second one is the graders’ consensus interval (Fig. 1(b)), defined as the overlap of all three graders’ trainable intervals. The graders’ consensus interval is used in Section 3.2 and 3.4 for experiments that include inter-grader analysis to provide a fair comparison among different graders. All OCT images and their corresponding manual annotations by the three graders used in this paper are available at [29].

Fig. 1. (a) Illustration of the preparation of training data. Original OCT images were cropped around the region of interest (ROI; the esophagus) and were labeled manually by the most experienced grader (Grader #1). The resulting horizontal range (the “trainable interval”) was used to crop ROI images and labels for training Bicon-CE. (b) Process of finding the graders’ consensus interval, which is defined as the overlap of the trainable intervals of all three graders.

Download Full Size | PDF

2.2 Bilateral connectivity network with CE-Net backbone

Building upon our recent work [28], we constructed a bilateral connectivity network to fully model pixel connectivity along with pixel-wise tissue classification. Figure 2 provides an overview of the method. Our network contains three parts: a connectivity-based CE-Net [30] backbone, a bilateral voting (BV) module, and a region-guided channel aggregation (RCA) module. As we use CE-Net as the backbone of our model, we refer to it as Bicon-CE. For the loss function, we adopted a modified Bicon loss [28]. We describe each component in detail in the following sections.

Fig. 2. Overview of Bicon-CE, which contains a connectivity-based CE-Net backbone, a bilateral voting (BV) module, and a region-guided channel aggregation (RCA) module. DAC: dense atrous convolution, RMP: residual multi-kernel pooling.

Download Full Size | PDF

2.3 Connectivity mask

Classic convolutional neural network (CNN)-based methods treat image segmentation as a pure pixel label assignment problem. We refer to these methods as pixel classification-based methods. Since this modeling strategy neglects inter-pixel relationships [28], it can result in inconsistent boundaries and topological issues in layer segmentation (examples in Section 3.3). We propose an alternative model to address these problems. Unlike general semantic segmentation tasks in which a single class may contain multiple instances in a dataset, in esophagus layer segmentation each layer class contains only one simply connected region — the pixels from the same layer are all topologically connected. Therefore, strong inter-pixel coherence exists between pixels of the same layer. Inspired by this feature, we design our CNN such that it learns to classify image pixels in concert with modeling the connectivity between pixels from the same class.

In the manually labeled binary masks, areas belonging to the epithelial layer are marked with 1 and are referred to as positive pixels, while all other pixels are marked with 0. We define two pixels as connected if and only if they are adjacent and both are positive pixels. As shown in Fig. 3 (a), given a pixel in a binary mask G_S, we find its 8 neighboring pixels (C1-8) using the 8-neighborhood [31] system. Then, we construct an 8-entry connectivity vector for the center pixel in which each entry represents the connectivity between the center pixel and one neighboring pixel in a specific direction. Thus, given a binary mask G_S, we generate an 8-channel mask by deriving connectivity vectors for all of its pixels. We call this 8-channel mask the connectivity mask (G_C). For each two neighboring pixels in G_S, there are two specific paired elements in G_C that represent the mutual connectivity between them. We call these two elements a connectivity pair (Fig. 3 (b)). We use the connectivity mask as the label to model pixel connectivity for supervised learning.

Fig. 3. (a) Illustration of generating the connectivity mask (G_C) from the segmentation mask (G_S). Given an arbitrary pixel in G_S (yellow), we find its 8 neighboring pixels (C1-8). Then we convert the pixel into a connectivity vector (yellow vector in G_C). The subfigure on the top shows an example of converting the pixel GS (1,4) into a connectivity vector. G_S is zero-padded at the boundaries. G_C is obtained after all pixels of G_S are converted. (b) Example of a connectivity pair. The connectivity pair corresponding to the two green pixels in G_S is shown as the red boxed pair in G_C. G_C₁ (2,2) represents the top-left connectivity of G_S (2,2) and G_C₈ (1,1) represents the bottom-right connectivity of G_S (1,1).

Download Full Size | PDF

2.4 Connectivity-based CE-Net

We used CE-Net [30] as our network’s backbone. Like U-Net, CE-Net has a U-shaped encoder-decoder structure. To extract high-resolution feature maps, CE-Net consists of a dense atrous convolution (DAC) block at its last encoder block. To maintain the multi-scale information from the DAC block, CE-Net includes a residual multi-kernel pooling (RMP) block. For single-class tasks like epithelial layer segmentation, the original output layer of CE-Net is a single-channel fully connected (FC) layer. To introduce the pixel connectivity information, we replace the output layer with an 8-channel FC layer and use connectivity masks as the training labels. By doing so, we construct a connectivity-based CE-Net which, given an OCT esophagus image, outputs an 8-channel map which we call the connectivity map (Conn map, C). Every pixel in C represents the unidirectional connection probability of a pixel in a specific connectivity pair, and every channel represents the unidirectional pixel connection probability in a specific direction.

2.5 BV and RCA modules

After obtaining the Conn map, we enhance the coherence between neighboring pixels via the bilateral voting module (Fig. 4). In the BV module, we multiply the two elements in every connectivity pair and assign the resulting value to both elements, yielding a new map called the bilateral connectivity map (Bicon map, $\tilde{C}$):

(1)$$\begin{aligned} {{\tilde{C}}_j}({x,y} )&= {{\tilde{C}}_{9 - j}}({x + a,y + b} )\\& = {C_j}({x,y} )\times {C_{9 - j}}({x + a,y + b} )\end{aligned}$$

where j is the ${j_{th}}$ channel, and $a,\; b\; \in \{{0,\; \pm 1} \}$ index the front-view (spatial-view) relative position of the two pixels in this connectivity pair. Every channel of $\tilde{C}$ represents the bidirectional pixel connection probability of a specific direction.

Fig. 4. The connectivity modeling process. In the BV module, every connectivity pair in C is multiplied to generate a new map ($\tilde{C}$). In the RCA module, a channel-wise aggregation function f is applied to every connectivity vector (highlighted in yellow) to generate a single-channel map.

Download Full Size | PDF

As stated in Section 2.3, connectivity is defined only for adjacent positive pixels. Thus, pixel connectivity and pixel positivity are closely correlated. If two neighboring pixels are positive, then they are connected; conversely, if we know two pixels are connected, then both are positive. Therefore, the probability of a pixel being connected with others is the probability of it being positive. This reverse inference is done in the RCA module (Fig. 4). Given the Bicon map $\tilde{C}$, we derive the overall connectivity probability map via a channel-wise aggregation function $f$:

(2)$$\tilde{S}({x,y} )= f\{{{{\tilde{C}}_i}({x,y} )} \}_{i = 1}^8, $$

where f is an adaptive aggregating operation that varies with location (x, y), i is the ${i_{th}}$ channel, and ${\tilde{C}_i}$ is one channel of $\tilde{C}$ showing the bidirectional connection probability of the ${i_{th}}$ direction. $\tilde{S}$ is a single-channel map representing the overall aggregated probability of each pixel being positive. Here we use two types of aggregation methods to define f, resulting in two single-channel output maps. In the first method, which is different from the method in [28], we take the maximum value among channels to construct the global map ${\tilde{S}_{global}}$:

(3)$${\tilde{S}_{global}}({x,y} )= max\{{{{\tilde{C}}_i}({x,y} )} \}_{i = 1}^8\; . $$

Using the maximum connection probability across channels, we enforce a high probability for the pixel at (x, y) of ${\tilde{S}_{global}}$ to be positive as long as it is connected with at least one of its neighbors. This strategy encourages the pixel to focus on learning its connectivity with its highest likely connected neighboring pixel, thus alleviating the effects of noisy pixels. Next, to emphasize the boundary of the epithelial layer, we use a second method, called edge-guided aggregation, which combines the channels differently at edge and non-edge locations. This yields a new map called the edge-decouple map, ${\tilde{S}_{decouple}}$, as we described previously [28]:

(4)$${\tilde{S}_{decouple}}({x,y} )= \left\{ {\begin{array}{cc} {1 - min\{{{{\tilde{C}}_i}({x,y} )} \}_{i = 1}^8}&{({x,y} )\in {P_{edge}}}\\ {max\{{{{\tilde{C}}_i}({x,y} )} \}_{i = 1}^8}&{({x,y} )\notin {P_{edge}}} \end{array}} \right., $$

where $\,{P_{edge}}$ is the set of ground truth edge pixels which are obtained from the connectivity mask [28]. Both ${\tilde{S}_{global}}$ and ${\tilde{S}_{decouple}}$ are used during the training process, and ${\tilde{S}_{global}}$ is used as the final prediction in the testing phase.

2.6 Loss function

As in [28], we define the overall loss function of our network as

(5)$${L_{bicon}} = {L_{decouple}} + {L_{con\_const}} + {L_{dice.}}. $$

The first term, ${L_{decouple}}$, is the edge-decoupled loss, which is the binary cross-entropy (BCE) loss between ${\tilde{S}_{decouple}}$ and the ground truth segmentation mask ${G_s}$:

(6)$${L_{decouple}} = {L_{BCE}}({{{\tilde{S}}_{decouple}},{G_s}} )= \left\{ {\begin{array}{cc} {{L_{BCE}}({1 - min\{{{{\tilde{C}}_i}({x,y} )} \}_{i = 1}^8,{G_S}({x,y} )} )}&{({x,y} )\in {P_{edge}}}\\ {{L_{BCE}}({max\{{{{\tilde{C}}_i}({x,y} )} \}_{i = 1}^8,{G_S}({x,y} )} )}&{({x,y} )\notin {P_{edge}}} \end{array}} \right.. $$

L_con_{_}_const is the connectivity consistency loss. Unlike in [28], where this loss is defined as a weighted sum of BCE losses applied to both the Conn map (L_conmap) and the Bicon map (L_bimap), here we define the connectivity consistency loss only for the Conn map:

(7)$${L_{con\_const}} = {L_{conmap}} = {L_{BCE}}(C,{G_C}). $$

We define L_con_{_}_const only for the Conn map because L_bimap automatically gives greater weight to boundary pixels and less weight to background regions [28]. This strategy works well in natural images because the inter-class difference between background and positive pixels is relatively large, while the intra-class difference between positive pixels is small. However, in in vivo human esophageal OCT images, due to the limited information in grey level pixels, the lower image quality, and the complex characteristics of tissues, it is usually hard to observe a large overall inter-pixel difference between epithelial layer pixels and pixels of other layers. Therefore, we use L_con_{_}_const = L_conmap to give the same weight to positive pixels and background pixels.

The third term in Eq. (5), L_dice, is the dice loss [32], defined as

(8)$${L_{dice}} = \; 1 - \frac{{2 \times \mathop \sum \nolimits_{i,j}^{H,W} ({{{\tilde{S}}_{global}} \times {G_S}} )}}{{\mathop \sum \nolimits_{i,j}^{H,W} {{\tilde{S}}_{global}} + \; \mathop \sum \nolimits_{i,j}^{H,W} {G_S}}}\; \; ,$$

where $H\; \textrm{and\; }W$ are the height and width of the input image, respectively.

2.7 Training

We used 10-fold cross-validation, where for each fold we randomly chose 3 subjects that were not previously selected for testing, and used the remaining 27 subjects for training. There was no overlap between the training and testing sets.

We pretrained the CE-Net backbone on ImageNet, as in the original CE-Net paper [29]. We trained Bicon-CE on cropped data, which contained only the trainable interval. We did not perform data augmentation during training and kept the original aspect ratio of the ROI images for training and testing. We used a mini-batch with batch size of 8 to train the network. To use the mini-batch while keeping the original aspect ratio, for every batch, we first found the maximum width (x_max) and height (y_max) of the images in the batch. Then, we zero padded every image in the batch to the size of (x_max, y_max). We used the Adam optimizer with $({{\beta_1},\; {\beta_2}} )= ({0.9,\; 0.999} )$ and weight decay = 0.0001. We trained our network for 45 epochs in total. The learning rate was initially 2e-4, and was reduced by a factor of 0.2 at the 30^th epoch.

2.8 Prediction

In the testing phase, we used uncropped ROI images as the network inputs. We obtained the final prediction by thresholding the output global map ${\tilde{S}_{global}}$ at 0.5 (Fig. 2).

3. Experimental design and results

3.1 Evaluation metrics

We calculated the dice coefficient (DSC) [33] for each prediction as

(9)$$DSC = \; \frac{{2 \times TP}}{{2 \times TP + FP + FN}}, $$

where TP is the number of true positive pixels, FP is the number of false positive pixels, and FN is the number of false negative pixels in the predicted binary map. DSC ranges from 0 to 1, where a higher value means the prediction is closer to the gold standard. While DSC is a commonly used metric in medical segmentation, it is not sensitive to small outliers. Therefore, we also calculated the mean total error, ${E_t}$, and the mean net error, ${E_n}$ [34], of the predicted epithelial layer:

(10)$${E_t} = \frac{k}{N} \times |{FP + FN} |, $$

(11)$${E_n} = \frac{k}{N} \times |{FP - FN} |, $$

where N is the total number of B-scan columns and k = 6.5 is the scaling factor for converting pixels to microns. To quantify tissue characteristics, we calculated the overall thickness of the predicted epithelial layer. To evaluate the statistical significance of results, we calculated p-values using the Wilcoxon signed-rank test, where p < 0.05 indicated statistical significance.

3.2 Comparison with alternative methods

We compared our approach with three widely used medical image segmentation models, U-Net, UNet++ [35], and CE-Net (Table 1). Bicon-CE outperformed U-Net, U-Net++, [35] and CE-Net across all listed metrics. The average DSC of our method was significantly higher (p < 0.0001) than that of CE-Net (5.1% higher), U-Net++ (5.2% higher), and U-Net (6.7% higher). The overall layer thickness values of Bicon-CE were closer to the gold standard (Grader #1) than those of the other two methods. Both the net error and the total error of Bicon-CE segmentations were lower than those of U-Net, U-Net++, and CE-Net. Other than Bicon-CE, we constructed two other Bicon version models: Bicon-UNet and Bicon-UNet++, and reported their results. The analysis of the alternative Bicon version models is included in Section 3.3.

Table 1. Statistical comparisons between Bicon-CE, alternative methods, and human graders. The numbers in each field denote mean ± standard deviation, median across 30 subjects (29 subjects for Grader #2). The best performance scores are written in bold. The results are evaluated over the graders’ consensus intervals.

View Table | View all tables in this article

We also compared the labels of Graders #2 and #3 with those of the more experienced Grader #1 (Table 1, row 2-3). Grader #2 could not confidently grade one of the subjects and therefore graded only 29 of the 30 subjects; Grader #3 graded all subjects. The results show variability and disagreement between human graders, reflecting the subjectivity and challenging nature of this task. We can also see that none of the baseline models outperformed the manual segmentation results of Grader #3, while Bicon-CE significantly outperformed it, demonstrating the effectiveness of our model. Note that the results in Table 1 were evaluated over the graders’ consensus intervals to provide a fair comparison between different graders as they had different trainable intervals.

3.3 Performance of connectivity modeling

In the previous section, we showed that Bicon-CE performed better than CE-Net, U-Net++, and U-Net. Here, we show that for the tested models: (1) connectivity modeling is superior to pixel classification-based methods in general and is independent of the backbone selection; (2) our Bicon-based method is compatible with other image segmentation models. For this, we constructed the Bicon enhanced models of U-Net (Bicon-UNet) and U-Net++ (Bicon-UNet++). In Table 2, we compare the Bicon enhanced models with the corresponding baseline networks. We used Grader #1’s trainable interval for evaluation since it reflects the performances over a wider region. All three Bicon enhanced models significantly (p < 0.0001) outperformed their corresponding baseline methods, indicating the effectiveness of the connectivity modeling method. Moreover, the extra computational cost of our method is negligible; for example, Bicon-UNet increased DSC by 6.3% compared to U-Net with only 455 extra parameters. These results also show that the connectivity modules are compatible with other pixel classification-based neural networks.

Table 2. Statistical comparisons of connectivity-based models (Bicon-CE, Bicon-UNet++, and Bicon-UNet) with CE-Net, UNet++, and U-Net. The numbers in each field denote mean ± standard deviation, median across 30 subjects. The best performance scores are written in bold. The results here were evaluated over the Grader #1’s trainable intervals. FPS: frame per second.

View Table | View all tables in this article

This connectivity modeling method reduces topological problems such as outlier prediction, disconnected prediction, and non-simply connected prediction (examples in Fig. 5). As shown in (a), both CE-Net and U-Net generated an outlier region and a non-simply connected region, which did not occur in Bicon-CE or Bicon-UNet. UNet++ predicted an incorrect boundary likely due to artifacts, while Bicon-UNet++ avoided it. In (b), likely due to the non-uniform intensity of the esophagus, none of the three baselines made a horizontally continuous prediction. But the three Bicon enhanced models all made continuous predictions that covered the layer region. In (c), likely due to strong ring artifacts, the baselines were negatively affected around the artifact area; however, three Bicon enhanced models made a continuous prediction despite these artifacts. The feature space of Bicon-CE also demonstrates its ability to extract layered features while avoiding artifacts (See Supplement S1 for supporting content).

Fig. 5. Qualitative comparisons of connectivity-based modeling and pixel classification-based methods. Examples of topological errors such as outlier prediction, disconnected prediction, and non-simply connected prediction are highlighted in green, blue, and yellow bounding boxes, respectively. The trainable interval of Grader #1 is marked at the bottom of each image.

Download Full Size | PDF

Comparisons to the results of Bicon-UNet and Bicon-UNet++ demonstrate the effectiveness and efficiency of Bicon-CE. Compared to Bicon-UNet, Bicon-CE achieved significantly lower net thickness error. Compared to Bicon-UNet++, Bicon-CE had significantly less net and total thickness errors. As shown in Fig. 5, although all three Bicon models largely avoided the topological issues, Bicon-CE predicted a smoother boundary and a more continuous shape even in the presence of strong artifacts compared to the other two. To show the efficiency of Bicon-CE, we reported the processing speeds of all networks when tested on OCT B-scans of size 512×512 pixels in Table 2. Bicon-CE was faster than Bicon-UNet and Bicon-UNet++ even though it had a larger number of parameters than Bicon-UNet. Thus, we chose CE-Net as the backbone as it enabled the extraction of high-resolution multi-level features while maintaining a fast-processing speed.

3.4 Robustness analysis

Automated segmentation of clinical in vivo human esophageal OCT images is challenging not only due to imaging artifacts and noise, but also due to patient-specific outliers such as irregular tissue shape, mucus, and in-layer image intensity non-uniformity [36]. Examples of the robustness of our method are shown in qualitative comparisons between our method and others under different scenarios in Fig. 6: (a) mucus covering; (b) OCT discontinuity; (c-d) imaging artifacts; (e) low contrast imaging; (f) non-uniform intensity. Mucus is produced by glands in the esophageal lining to keep the passageway moist; however, due to variable thickness and scattering content, mucus does not always appear in OCT images. Therefore, it is important for the automated method to handle segmentation with and without mucus. As shown in Fig. 6 (a), neither U-Net nor CE-Net could accurately exclude mucus from the epithelial layer, while Bicon-CE made a precise prediction. Figure 6 (b) shows an example of OCT discontinuity, which can be caused by non-uniform refractive index in an overlying layer (such as a bubble) causing a sudden apparent change in depth of the tissue. Again, only Bicon-CE predicted an accurate segmentation. Figure 6 (c-d) show examples of two imaging artifacts: (c) a ring artifact caused by internal reflection in the probe, and (d) an artifact (top left) caused by the adhesive on the inside of the probe paddle window. Both U-Net and CE-Net were misled by these artifacts and gave erroneous segmentations, whereas Bicon-CE gave an accurate segmentation. Figure 6 (e) shows an example of low contrast imaging, caused inadvertently by a non-optimized selection of the OCT reference position. Again, only our method robustly handled this situation. Lastly, Fig. 6 (f) shows a case of non-uniform intensity (highlighted by yellow arrows), which may be due to a duct or vessel that was fluid-filled and weakly reflective. Both U-Net and CE-Net were affected by this non-uniformity in the tissue, whereas Bicon-CE made a continuous prediction. These results validate our motivation: by focusing more on inter-pixel relationships, Bicon-CE makes a connected prediction and avoids topological errors.

Fig. 6. Visualization of model performance in different scenarios: (a) mucus covering (blue shaded area); (b) OCT discontinuity; (c)-(d) imaging artifacts; (e) low contrast and (f) non-uniform intensity. The region indicated by the yellow arrows in (f) is likely a fluid-filled region. The trainable interval of Grader #1 is marked at the bottom of each image.

Download Full Size | PDF

We further tested the robustness of the algorithm with respect to the human grading used for training and testing. In supplementary material S2, we used the majority-based markings of all graders as the gold standard. These experiments demonstrate that Bicon-CE still performed superior to other techniques (and is close to human grading).

3.5 Clinical potential

Bicon-CE was robust under different clinical conditions, as shown in Section 3.4. Here we demonstrate the potential applicability of our segmentation model in detecting BE by assessing the performance of our method on images from diseased patients (6 non-dysplastic BE and 1 had BE with low-grade dysplasia) and healthy subjects. The results are summarized in Table 3. For both patient groups, our model outperformed the baseline neural networks and the human graders by achieving significantly higher DSC scores (p < 0.0001). Representative examples are visualized in Fig. 7, where Bicon-CE shows a smoother and more continuous prediction of the segmented layer.

Fig. 7. Qualitative comparison between BE samples and healthy samples. The trainable interval of Grader #1 is marked at the bottom of each image.

Download Full Size | PDF

Table 3. Statistical comparisons of Bicon-CE, CE-Net, UNet++, and U-Net, and human graders on samples from volumes of 23 healthy (HL) and of 7 diseased (DE) patients. The numbers in each field denote mean ± standard deviation, median across all subjects in each group. The best performance scores are written in bold. The results are evaluated over the graders’ consensus intervals.

View Table | View all tables in this article

With the new low-cost imaging device [7], we were able to quantify the changes in epithelial layer thickness due to BE. The results in Table 3 show that healthy subjects’ mean overall epithelial layer thickness was significantly larger (p < 0.001) than those with BE. Compared to the baselines, Bicon-CE’s estimated difference in mean layer thickness (28.2 μm) between the normal and BE subjects was closer to the gold standard (21.8 μm) while maintaining significantly lower (p < 0.0001) net error and total error. Although the BE-related change in epithelial layer thickness has not been proven in a large-scale randomized clinical trial, we believe this pilot observation can provide a potential guide for further studies of BE.

Lastly, we investigated the computational cost of our method for clinical application. In our experiments, the processing time for a single esophageal OCT ROI image in our dataset was 0.024 ${\pm} $ 0.005 seconds (median, 0.023 seconds) on an Ubuntu system with a GTX 2080Ti GPU running on Pytorch with data loaded from a solid-state drive. The average acquisition time for the full field-of-view OCT B-scan was ∼0.05 seconds. Thus, our method can potentially be utilized in the clinic for real-time segmentation of the epithelial layer.

4. Discussion

In this work we proposed a bilateral connectivity-based neural network to accurately segment the epithelial layer from in vivo human esophageal OCT images. This network, Bicon-CE, models the single-class segmentation task as a combination of pixel connectivity modeling and pixel-wise tissue classification. Bicon-CE significantly outperformed popular alternative segmentation models (U-Net, U-Net++, and CE-Net), and outperformed human graders. Connectivity-based versions of these models were superior to the baseline methods, indicating the general superiority of a connectivity modeling approach. The robustness of Bicon-CE was shown by testing it under different image artifacts and variants. The potential clinical application of Bicon-CE was shown by its ability to segment the epithelium in samples from patients with BE and to detect the potential thickness changes due to BE, suggesting that it can be used as a part machine learning approach for accurate real-time monitoring of esophageal diseases.

To the best of our knowledge, Bicon-CE is the first end-to-end layer segmentation algorithm for in vivo human esophageal OCT images. Processing in vivo OCT images is more challenging than processing ex vivo OCT images due to the sometimes-lower image quality of in vivo images and due to variations between images due to imaging artifacts and disease conditions. In our study, manual segmentation of in vivo OCT data was challenging even for our most experienced grader, and the labeling by the three graders showed great disagreement. We defined a “trainable interval” over which the labels were of high confidence for training; as a result, we lost some information from the unlabeled regions. Our model could be further improved with better manual segmentation labels, with a larger dataset, or with data augmentation (e.g., by simulating the ring artifacts) in the training stage. Different methods can be used to generate the gold standard labels for training deep learning algorithms and assessing their performance. In Tables 1–3, we used the single expert Grader #1 (KKC), with significant experience in assessing esophageal OCT images, as the gold standard. In addition, in the supplementary material S2, we used the majority-based markings of all three graders as the gold standard. In both cases, Bicon-CE was shown to have superior performance. At the inference stage, while we evaluated the results only across the trainable interval, the predictions of our CNN were made on the entire image. Bicon-CE’s predictions in these “non-trainable regions” were rational (see Fig. 6), suggesting that Bicon-CE has the potential to provide guidance to clinicians for interpreting data even in challenging areas.

There are a few possible avenues for improving our method. First, this work can be extended to multi-layer segmentation problems given a dataset that is labeled for multiple layers and with improvements in the model architecture. Technically, although our Bicon-CE can be readily extended to a multi-class model by changing its output layers, exploiting the special properties of multi-class data could produce even a stronger model for the multi-class segmentation problem. For example, in multi-class data, there exists not only intra-class connectivity but also inter-class relationship. To utilize this property, a channel-wise attention module could potentially capture the inter-class information. Secondly, the number of individuals with BE in our dataset was small compared to the number of healthy subjects. Our method could be improved by increasing the amount of BE training data. Furthermore, although the BE-related changes in the thickness of the epithelium as measured on OCT haven’t been investigated in a large-scale randomized clinical trial, our observation provides a potential direction for future studies on the diagnosis and prognosis of BE using in vivo OCT imaging. As part of our future studies, we will utilize this technology for differentiating different stages of esophageal diseases. Last, the processing speed can be improved with a hardware update. We envision that our deep learning method will reduce the workload of human grading and improve the accuracy of segmenting the epithelial layer in in vivo human esophageal OCT images.

Funding

National Institutes of Health (1R01CA210544, P30EY005722); Duke University (Research to Prevent Blindness Unrestricted Grant); Hartwell Foundation (Postdoctoral Fellowship); Duke University Fitzpatrick Institute for Photonics (the Chambers Fellowship).

Disclosures

Adam Wax is founder and president of Lumedica, Inc.

Data availability

Esophageal OCT images and their corresponding manual annotations by the three graders used in this paper are available at [29].

Supplemental document

See Supplement 1 for supporting content.

References

1. H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, and F. Bray, “Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA A Cancer J Clin 71(3), 209–249 (2021). [CrossRef]

2. R. C. Haggitt, “Barrett's esophagus, dysplasia, and adenocarcinoma,” Human Pathol. 25(10), 982–993 (1994). [CrossRef]

3. B. J. Reid, W. M. Weinstein, K. J. Lewin, R. C. Haggitt, G. VanDeventer, L. DenBesten, and C. E. Rubin, “Endoscopic biopsy can detect high-grade dysplasia or early adenocarcinoma in Barrett's esophagus without grossly recognizable neoplastic lesions,” Gastroenterology 94(1), 81–90 (1988). [CrossRef]

4. D. Huang, E. A. Swanson, C. P. Lin, J. S. Schuman, W. G. Stinson, W. Chang, M. R. Hee, T. Flotte, K. Gregory, and C. A. Puliafito, “Optical coherence tomography,” Science 254(5035), 1178–1181 (1991). [CrossRef]

5. G. J. Tearney, M. E. Brezinski, B. E. Bouma, S. A. Boppart, C. Pitris, J. F. Southern, and J. G. Fujimoto, “In vivo endoscopic optical biopsy with optical coherence tomography,” Science 276(5321), 2037–2039 (1997). [CrossRef]

6. B. E. Bouma, G. J. Tearney, C. C. Compton, and N. S. Nishioka, “High-resolution imaging of the human esophagus and stomach in vivo using optical coherence tomography,” Gastrointest. Endosc. 51(4), 467–474 (2000). [CrossRef]

7. Y. Zhao, K. Chu, M. Crose, Y. Ofori-Marfoh, H. Cirri, N. Shaheen, and A. Wax, “Esophageal OCT using endoscope-coupled paddle probe (Conference Presentation),” Proc. SPIE10854, 108540K (2019).

8. E. T. Jelly, W. Kendall, R. Schmitz, S. J. Knechtle, D. L. Sudan, A. Joseph, J. Roper, J. Kwun, and A. Wax, “Novel implementations of optical coherence tomography for clinical applications in the lower gastrointestinal tract,” in Biophotonics Congress: Biomedical Optics 2020 (Translational, Microscopy, OCT, OTS, BRAIN), OSA Technical Digest (Optical Society of America, 2020), OTu4E.4.

9. S. Kim, M. Crose, L. A. Kresty, and A. Wax, “Guidance of angle-resolved low coherence interferometry using co-located optical coherence tomography on rat esophageal tissue,” in Biomedical Optics2016, OSA Technical Digest (online) (Optical Society of America, 2016), JTu3A.16.

10. M. J. Suter, M. J. Gora, G. Y. Lauwers, T. Arnason, J. Sauk, K. A. Gallagher, L. Kava, K. M. Tan, A. R. Soomro, T. P. Gallagher, J. A. Gardecki, B. E. Bouma, M. Rosenberg, N. S. Nishioka, and G. J. Tearney, “Esophageal-guided biopsy with volumetric laser endomicroscopy and laser cautery marking: a pilot clinical study,” Gastrointest. Endosc. 79(6), 886–896 (2014). [CrossRef]

11. P. A. Testoni and B. Mangiavillano, “Optical coherence tomography in detection of dysplasia and cancer of the gastrointestinal tract and bilio-pancreatic ductal system,” World J. Gastroenterol. 14(42), 6444 (2008). [CrossRef]

12. Z. Liu, J. Xi, M. Tse, A. C. Myers, X. Li, P. J. Pasricha, and S. Yu, “426 allergic inflammation-induced structural and functional changes in esophageal epithelium in a guinea pig model of eosinophilic esophagitis,” Gastroenterology 146(5), S-92 (2014). [CrossRef]

13. J. Sauk, E. Coron, L. Kava, M. Suter, M. Gora, K. Gallagher, M. Rosenberg, A. Ananthakrishnan, N. Nishioka, and G. Lauwers, “Interobserver agreement for the detection of Barrett’s esophagus with optical frequency domain imaging,” Dig. Dis. Sci. 58(8), 2261–2265 (2013). [CrossRef]

14. J. M. Poneros and N. S. Nishioka, “Diagnosis of Barrett's esophagus using optical coherence tomography,” Gastrointest. Endosc. Clin. N. Am. 13(2), 309–323 (2003). [CrossRef]

15. W. Hameeteman, G. Tytgat, H. Houthoff, and V. D. Tweel, “Barrett's esophagus; development of dysplasia and adenocarcinoma,” Gastroenterology 96(5), 1249–1256 (1989). [CrossRef]

16. J. Zhang, W. Yuan, W. Liang, S. Yu, Y. Liang, Z. Xu, Y. Wei, and X. Li, “Automatic and robust segmentation of endoscopic OCT images and optical staining,” Biomed. Opt. Express 8(5), 2697–2708 (2017). [CrossRef]

17. M. Gan, C. Wang, T. Yang, N. Yang, M. Zhang, W. Yuan, X. Li, and L. Wang, “Robust layer segmentation of esophageal OCT images based on graph search using edge-enhanced weights,” Biomed. Opt. Express 9(9), 4481–4495 (2018). [CrossRef]

18. C. Wang, M. Gan, N. Yang, T. Yang, M. Zhang, S. Nao, J. Zhu, H. Ge, and L. Wang, “Fast esophageal layer segmentation in OCT images of guinea pigs based on sparse Bayesian classification and graph search,” Biomed. Opt. Express 10(2), 978–994 (2019). [CrossRef]

19. S. J. Chiu, X. T. Li, P. Nicholas, C. A. Toth, J. A. Izatt, and S. Farsiu, “Automatic segmentation of seven retinal layers in SDOCT images congruent with expert manual segmentation,” Opt. Express 18(18), 19413–19428 (2010). [CrossRef]

20. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in 18th Medical Image Computing and Computer-Assisted Intervention (MICCAI), (Springer, 2015), 234–241.

21. D. Li, J. Wu, Y. He, X. Yao, W. Yuan, D. Chen, H.-C. Park, S. Yu, J. L. Prince, and X. Li, “Parallel deep neural networks for endoscopic OCT image segmentation,” Biomed. Opt. Express 10(3), 1126–1135 (2019). [CrossRef]

22. C. Wang and M. Gan, “Tissue self-attention network for the segmentation of optical coherence tomography images on the esophagus,” Biomed. Opt. Express 12(5), 2631–2646 (2021). [CrossRef]

23. C. Wang, M. Gan, M. Zhang, and D. Li, “Adversarial convolutional network for esophageal tissue segmentation on OCT images,” Biomed. Opt. Express 11(6), 3095–3110 (2020). [CrossRef]

24. C. Chu, K. Minami, and K. Fukumizu, “Smoothness and stability in gans,” in International Conference on Learning Representations (ICLR), (2020), pp. 1–30.

25. O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero, S. A. Cook, A. d. Marvao, T. Dawes, D. P. O’Regan, B. Kainz, B. Glocker, and D. Rueckert, “Anatomically constrained neural networks (ACNNs): Application to cardiac image enhancement and segmentation,” IEEE Trans. Med. Imaging 37(2), 384–395 (2018). [CrossRef]

26. K. Kurach, M. Lučić, X. Zhai, M. Michalski, and S. Gelly, “A large-scale study on regularization and normalization in GANs,” in Proceedings of the 36th International Conference on Machine Learning (ICML), (PMLR, 2019), pp. 3581–3590.

27. G. J. Ughi, M. J. Gora, A.-F. Swager, A. Soomro, C. Grant, A. Tiernan, M. Rosenberg, J. S. Sauk, N. S. Nishioka, and G. J. Tearney, “Automated segmentation and characterization of esophageal wall in vivo by tethered capsule optical coherence tomography endomicroscopy,” Biomed. Opt. Express 7(2), 409–419 (2016). [CrossRef]

28. Z. Yang, S. Soltanian-Zadeh, and S. Farsiu, “BiconNet: an edge-preserved connectivity-based approach for salient object detection,” Pattern Recognit. 121, 108231 (2022). [CrossRef]

29. Z. Yang, S. Soltanian-Zadeh, K. K. Chu, H. Zhang, L. Moussa, A. E. Watts, N. J. Shaheen, A. Wax, and S. Farsiu, “Connectivity-based deep learning approach for segmentation of the epithelium in in vivo human esophageal OCT images,” Duke University Repository (2021). http://people.duke.edu/∼sf59/Yang_BOE_2021.htm

30. Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, and J. Liu, “CE-Net: Context encoder network for 2D medical image segmentation,” IEEE Trans. Med. Imaging 38(10), 2281–2292 (2019). [CrossRef]

31. R. C. Gonzalez and R. E. Woods, Digital Image Processing (3rd Edition) (Prentice-Hall, Inc., 2006).

32. F. Milletari, N. Navab, and S. Ahmadi, “V-Net: Fully convolutional neural networks for volumetric medical image segmentation,” inFourth International Conference on 3D Vision, (2016), pp. 565–571.

33. L. R. Dice, “Measures of the amount of ecologic association between species,” Ecology 26(3), 297–302 (1945). [CrossRef]

34. J. Loo, L. Fang, D. Cunefare, G. J. Jaffe, and S. Farsiu, “Deep longitudinal transfer learning-based automatic segmentation of photoreceptor ellipsoid zone defects on optical coherence tomography images of macular telangiectasia type 2,” Biomed. Opt. Express 9(6), 2681–2698 (2018). [CrossRef]

35. Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “UNet++: redesigning skip connections to exploit multiscale features in image segmentation,” IEEE Trans. Med. Imaging 39(6), 1856–1867 (2020). [CrossRef]

36. J. Xi, A. Zhang, Z. Liu, W. Liang, L. Y. Lin, S. Yu, and X. Li, “Diffractive catheter for ultrahigh-resolution spectral-domain volumetric OCT imaging,” Opt. Lett. 39(7), 2016–2019 (2014). [CrossRef]

Methods	Metric
Methods	DSC	Thickness (μm)	$E_{n}$ (μm)	$E_{t}$ (μm)	P-value^a
Gold Standard	-	314.2 ± 93.3, 304.5	-	-	-
Grader #2	0.764 ± 0.227, 0.834	249.0 ± 68.2, 243.5	90.1 ± 85.7, 56.6	131.2 ± 126.4, 82.6	<0.0001
Grader #3	0.864 ± 0.160, 0.917	357.3 ± 99.3, 360.6	59.0 ± 74.0, 26.8	85.0 ± 92.5, 52.0	<0.0001
U-Net	0.846 ± 0.125, 0.883	345.4 ± 91.9, 320.7	64.6 ± 66.0, 45.1	99.3 ± 82.6, 73.7	<0.0001
Bicon-UNet	0.891 ± 0.117, 0.933	325.8 ± 84.4, 312.0	38.7 ± 49.8, 22.1	64.0 ± 65.5, 41.5	-
U-Net++	0.858 ± 0.143, 0.904	332.6 ± 80.8, 301.6	52.8 ± 60.9, 30.7	82.2 ± 76.2, 57.1	<0.0001
Bicon-UNet++	0.879 ± 0.117, 0.918	333.3 ± 77.8, 323.5	49.6 ± 63.9, 28.7	74.2 ± 69.0 51.5	-
CE-Net	0.859 ± 0.116, 0.897	334.1 ± 76.9, 315.0	51.9 ± 59.3, 33.1	87.9 ± 71.9, 64.5	<0.0001
Bicon-CE	0.903 ± 0.105. 0.934	324.5 ± 82.8, 308.9	33.7 ± 40.7, 20.2	56.7 ± 52.8, 39.9	-

Methods	Metric
Methods	DSC	Thickness (μm)	$E_{n}$ (μm)	$E_{t}$ (μm)	Number of Param.	Speed (FPS)	P-value^a
Gold Standard	-	311.1 ± 92.4, 298.9	-	-			-
U-Net	0.822 ± 0.153, 0.867	333.2 ± 88.8, 320.2	61.8 ± 61.4, 44.7	106.6 ± 76.5, 86.5	13,394,177	40	<0.0001
Bicon-UNet	0.874 ± 0.130, 0.918	315.3 ± 79.6, 306.4	40.0 ± 47.7, 22.89	73.4 ± 68.1, 51.6	13,394,632	39	-
U-Net++	0.858 ± 0.143, 0.904	332.6 ± 80.8, 301.6	52.8 ± 60.9, 30.7	82.2 ± 76.2, 57.0	36,628,481	11	<0.0001
Bicon-UNet++	0.843 ± 0.107, 0.873	333.3 ± 77.8, 323.5	49.6 ± 63.9, 28.7	74.2 ± 69.0, 51.5	36,628,936	11	-
CE-Net	0.859 ± 0.116, 0.897	330.0 ± 71.5, 312.7	51.3 ± 51.2, 35.8	97.6 ± 65.4, 77.1	28,996,821	44	<0.0001
Bicon-CE	0.889 ± 0.101, 0.920	312.3 ± 75.9, 300.3	35.3 ± 38.6, 21.4	64.8 ± 49.9, 48.9	28,998,844	43	-

Vol.	Methods	Metric
Vol.	Methods	DSC	Thickness (μm)	$E_{n}$ (μm)	$E_{t}$ (μm)	P-value^a
HL	Gold Standard	-	319.0 ± 97.3, 305.4	-	-	-
	Grader #2	0.750 ± 0.227, 0.834	244.3 ± 64.1, 237.5	99.5 ± 91.6, 62.3	142.8 ± 133.1, 87.8	<0.0001
	Grader #3	0.865 ± 0.159, 0.914	360.4 ± 101.0, 360.6	56.5 ± 68.0, 28.4	85.2 ± 92.1, 54.0	<0.0001
	U-Net	0.846 ± 0.117, 0.883	354.2 ± 96.1, 326.7	68.7 ± 69.7, 48.8	102.9 ± 81.1, 76.0	<0.0001
	U-Net++	0.865 ± 0.124, 0.904	322.3 ± 81.0, 313.4	53.6 ± 61.4, 31.5	81.0 ± 68.3, 57.8	<0.0001
	CE-Net	0.858 ± 0.111, 0.892	344.8 ± 76.8, 326.2	53.9 ± 64.7, 35.1	90.9 ± 70.7, 68.1	<0.0001
	Bicon-CE	0.910 ± 0.081, 0.934	330.7 ± 87.7, 314.3	33.9 ± 39.8, 20.9	54.4 ± 44.8, 40.5	-
DE	Gold Standard	-	297.2 ± 74.8, 301.5	-	-	-
	Grader #2	0.815 ± 0.191, 0.881	265.5 ± 78.8, 270.6	57.7 ± 48.9, 44.7	90.9 ± 88.5, 65.0	<0.0001
	Grader #3	0.861 ± 0.159, 0.914	346.1 ± 92.3, 354.1	67.5 ± 91.6, 20.4	84.5 ± 89.3, 41.0	<0.0001
	U-Net	0.848 ± 0.149, 0.907	313.9 ± 66.5, 301.1	50.1 ± 47.5, 31.7	86.5 ± 86.6, 61.0	<0.0001
	U-Net++	0.833 ± 0.195, 914.0	278.2 ± 70.0, 284.9	49.6 ± 59.0, 26.8	86.3 ± 99.1, 52.2	<0.0001
	CE-Net	0.863 ± 0.132, 0.917	296.1 ± 64.1, 279.1	44.6 ± 49.1, 29.8	77.1 ± 74.9, 50.5	<0.0001
	Bicon-CE	0.878 ± 0.160, 0.937	302.5 ± 56.8, 299.5	33.1 ± 43.8, 13.9	64.6 ± 74.1, 40.0	-

Methods	Metric
Methods	DSC	Thickness (μm)	$E_{n}$ (μm)	$E_{t}$ (μm)	P-value^a
Gold Standard	-	314.2 ± 93.3, 304.5	-	-	-
Grader #2	0.764 ± 0.227, 0.834	249.0 ± 68.2, 243.5	90.1 ± 85.7, 56.6	131.2 ± 126.4, 82.6	<0.0001
Grader #3	0.864 ± 0.160, 0.917	357.3 ± 99.3, 360.6	59.0 ± 74.0, 26.8	85.0 ± 92.5, 52.0	<0.0001
U-Net	0.846 ± 0.125, 0.883	345.4 ± 91.9, 320.7	64.6 ± 66.0, 45.1	99.3 ± 82.6, 73.7	<0.0001
Bicon-UNet	0.891 ± 0.117, 0.933	325.8 ± 84.4, 312.0	38.7 ± 49.8, 22.1	64.0 ± 65.5, 41.5	-
U-Net++	0.858 ± 0.143, 0.904	332.6 ± 80.8, 301.6	52.8 ± 60.9, 30.7	82.2 ± 76.2, 57.1	<0.0001
Bicon-UNet++	0.879 ± 0.117, 0.918	333.3 ± 77.8, 323.5	49.6 ± 63.9, 28.7	74.2 ± 69.0 51.5	-
CE-Net	0.859 ± 0.116, 0.897	334.1 ± 76.9, 315.0	51.9 ± 59.3, 33.1	87.9 ± 71.9, 64.5	<0.0001
Bicon-CE	0.903 ± 0.105. 0.934	324.5 ± 82.8, 308.9	33.7 ± 40.7, 20.2	56.7 ± 52.8, 39.9	-

Methods	Metric
Methods	DSC	Thickness (μm)	$E_{n}$ (μm)	$E_{t}$ (μm)	Number of Param.	Speed (FPS)	P-value^a
Gold Standard	-	311.1 ± 92.4, 298.9	-	-			-
U-Net	0.822 ± 0.153, 0.867	333.2 ± 88.8, 320.2	61.8 ± 61.4, 44.7	106.6 ± 76.5, 86.5	13,394,177	40	<0.0001
Bicon-UNet	0.874 ± 0.130, 0.918	315.3 ± 79.6, 306.4	40.0 ± 47.7, 22.89	73.4 ± 68.1, 51.6	13,394,632	39	-
U-Net++	0.858 ± 0.143, 0.904	332.6 ± 80.8, 301.6	52.8 ± 60.9, 30.7	82.2 ± 76.2, 57.0	36,628,481	11	<0.0001
Bicon-UNet++	0.843 ± 0.107, 0.873	333.3 ± 77.8, 323.5	49.6 ± 63.9, 28.7	74.2 ± 69.0, 51.5	36,628,936	11	-
CE-Net	0.859 ± 0.116, 0.897	330.0 ± 71.5, 312.7	51.3 ± 51.2, 35.8	97.6 ± 65.4, 77.1	28,996,821	44	<0.0001
Bicon-CE	0.889 ± 0.101, 0.920	312.3 ± 75.9, 300.3	35.3 ± 38.6, 21.4	64.8 ± 49.9, 48.9	28,998,844	43	-

Connectivity-based deep learning approach for segmentation of the epithelium in in vivo human esophageal OCT images

Abstract

1. Introduction

2. Methods

2.1 Dataset and annotation

2.2 Bilateral connectivity network with CE-Net backbone

2.3 Connectivity mask

2.4 Connectivity-based CE-Net

2.5 BV and RCA modules

2.6 Loss function

2.7 Training

2.8 Prediction

3. Experimental design and results

3.1 Evaluation metrics

3.2 Comparison with alternative methods

3.3 Performance of connectivity modeling

3.4 Robustness analysis

3.5 Clinical potential

4. Discussion

Funding

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (7)

Tables (3)

Equations (11)

Biomedical Optics Express