Weakly supervised serous retinal detachment segmentation in SD-OCT images by two-stage learning

Ruiwen Xing; Ruiwen Xing; Sijie Niu; Sijie Niu; Xizhan Gao; Xizhan Gao; Tingting Liu; Tingting Liu; Wen Fan; Yuehui Chen; Yuehui Chen

doi:10.1364/BOE.416167

1. Introduction

Central serous chorioretinopathy (CSC) is a neuroepithelial detachment in the macular area or the posterior pole due to impaired pigment epithelial barrier function, resulting in fluid entering the nerve epithelium, accompanied by retinal pigment epithelial (RPE) detachment [1–3]. In recent years, the use of spectral domain optical coherence tomography (SD-OCT) imaging technology, in particular, has greatly improved the ability of ophthalmologists to detect subtle internal changes. This imaging technique can provide ophthalmologists with a large amount of available image data, especially the quantitative assessment of CSC in outline, shape and area. Recently, different shape and type pigment epithelial detachment (PED) with CSC are reported. The majority of choroidal neovascularisation (CNV) cases are linked to flat irregular PED. The greater width and en-face area of PED may point towards the presence of an underlying CNV network which maybe need anti-vascular endothelial growth factor (anti-VEGF) and half-dose photodynamic therapy (PDT) [4]. However, obtaining these quantitative data still requires a lot of manpower and resources. Hence an automatic segmentation method is highly desired in clinical applications.

Serous retinal detachment, such as pigment epithelial detachment and neurosensory retinal detachment (NRD), is a prominent characteristic of CSC. It is well known that accurate and automatic segmentation of CSC in SD-OCT images is very challenging, as shown in Fig. 1. There are two main reasons why this challenge has been allowed to continue. First, the region of lesion is almost the same with background, and the size of lesion region would generally have a large difference, so it’s difficult to precisely locate for the lesion region. Second, compared with natural images or MRI images, SD-OCT images are low contrast and weak boundaries, which is hard to achieve accurate segmentation. To address this problem, many studies have been reported to develop automatic CSC segmentation methods [5–7]. Especially, with the drastic advance of deep learning, recent deep networks have demonstrated successful performance in image segmentation task [8–11]. However, a major bottleneck to good performance is the high cost of getting high-quality annotations in the fully supervised learning method. Since it does not require expensive human efforts, weakly supervised image recognition methods [12–14] have been extensively studied.

Fig. 1. (a) SD-OCT image with the normal. (b) SD-OCT image with NRD. (c) SD-OCT image with NRD and PED. In our classification task, SD-OCT image with the normal and NRD are labeled as 0 and 1, respectively.

Download Full Size | PDF

Previous work has explored various alternative weak annotations, such as points [15], object bounding boxes [16], and scribble [17]. Among them, one of the most attractive approaches is to segment images only from image-level annotations methods [18–20] that require less human efforts. For such methods, the most critical problem to be solved is how to accurately and intensively locate the target area, so as to obtain high-quality target clues, and further improve segmentation model training performance [21,22]. And such methods achieved good results, even close to fully supervised learning performance on Pascal VOC-07/10/12 object detection datasets [23]. However, weakly supervised learning only used image-level label has not made a significant breakthrough and doesn’t show striking improvement on benchmark tasks in biomedical field. The reason is that, compared with annotation of the natural image, biomedical image segmentation data require professional labeling and a great deal of patience. Many researchers have also made many attempts at weakly supervised learning in the medical field [24–29]. Whether it’s prostate cancer detection, chest X-ray localization or gastric tumor segmentation. Unfortunately, weakly supervised learning has not been involved in SD-OCT images.

In this work, we propose a two-stage weakly supervised learning method to segment CSC accurately and automatically in SD-OCT images only by using image-level labels. During the first phase, a discriminative region of the lesion can be obtained by our proposed Located-CNN. Then, we evolve it as the initial contour of the level set method and produce pseudo pixel-level labels. During the second phase, the generated segmentation labels obtained above are used to train the segmentation network, which uses a active-contour loss function. To sum up, the main contributions of this work are three-fold:

$\bullet$ A pinpoint Located-CNN based on classification network is trained by simply using image-level label.

$\bullet$ We customize the active-contour loss function in deep networks to achieve the effective segmentation of the lesion area.

$\bullet$ To the best of our knowledge, we are the first to use weakly supervised learning to solve CSC segmentation problems. Without using pixel-level ground truth in the whole segmentation process, our segmentation results show that the effectiveness of our proposed segmentation method is as competitive as those relying on stronger supervision.

2. Related works

2.1 Conventional segmentation methods

For the subretinal fluid segmentation, various of unsupervised approaches have emerged to deal with this problem, whether the thresholding-based algorithms [30], or more complicated methods based on enface fundus driven method (EFD) [5]. Active contour models (ACMs) have shown better performance as represented by the active contour without edge (ACWE) model [31,32]. In Chan and Vese’s work, level set functions are introduced to formulate the segmentation model treated as an energy minimization problem solved through dealing with partial differential equations (PDEs); Then, because of low contrast and speckle noise in retinal SD-OCT image, the approaches of semi-supervised learning have begun to use some prior knowledge to overcome these problems. Wang et al. [7] utilized the label propagation and higher-order constraint-based segmentation of fluid-associated regions in retinal SD-OCT images, but the performance is heavily influenced by the key slice selected. Wu et al. [6] proposed a three-dimensional continuous max flow optimization-based serous retinal detachment segmentation approach to segment NRD and PED under the restriction of the fluid region selected; In addition, supervised learning methods, including random forest [33] and K nearest neighbor [34], have been introduced to identify the fluid region from the background.

2.2 CNN-based segmentation methods

Methods Based on Supervised Segmentation: Recent years have witnessed the successful application of deep learning in image segmentation, because it has the ability of automatic feature extraction. The DeepLab-v1 model [35] brings together methods from atrous convolution and conditional random fields (CRFs) for addressing the task of pixel-level classification. The DeepLab-v2 model [36] has better segmentation of objects at multiple scales by using atrous spatial pyramid pooling (ASPP). The DeepLab-v3 model [37] is general and could be applied to any network. Ronneberger et al. [38] proposed a novel method named U-Net to segment medical images automatically. Recently, for segmenting subretinal fluid in SD-OCT images, Gao et al. [39] proposed a novel image-to-image double-branched and area-constraint fully convolutional networks (DA-FCN). With strong supervision from pixel-level masks, the above approaches have greatly boosted the performance of segmentation. However, the problem that how to achieve good segmentation performance under weak supervision remains open.

Methods Based on Weakly Supervised Segmentation: There are various of ways of weakly supervised segmentation. Among them, the most attractive one is learning to segment images from only image-level annotations. Image-level label, which is easy to obtain, is the simplest supervision for leaning to segment. Some works [40–44] utilize deep activation for location with image-level labels. This method can aggregate the features of the last convolutional layer to generate discriminative class activation maps (CAM) [45]. However, we observe that some critical issues exist in such solutions, mainly including: failing to localize integral regions of the target objects densely within an image. Using more discriminative regions found, Wei et al. [46] and Zhang et al. [47] trained extra independent networks for generating class-specific activation maps with the assistance of the pre-trained networks in a post-processing step. However, the above methods are not apply to the medical fields which are a small set of samples and weak boundaries, especially for the precise segmentation by means of weakly supervised learning. In this case, we proposed our two-stage learning architecture, which will be demonstrated in Section 3.

3. Proposed segmentation method

In this section, we describe the details of the proposed method for automated serous retinal detachment segmentation. First, we present Located-CNN, which is a more precision way for producing object localization maps. Then, the proposed level set is presented for evolving high-quality pseudo object pixel-wise labels. Finally, the segmentation labels obtained by above procedure are used as supervision, so as to train the segmentation network with active-contour loss function. Architecture of our approach is illustrated in Fig. 2.

Fig. 2. Architecture of our approach. (i) A Located-CNN is designed to detect the location of lesion regions only used image-level annotations and highlight the distinguishing regions. (ii) If input is judged to be NRD will highlight the region by Located-CNN, and we regard this significance region as the initial contour of level set method to obtain pseudo pixel-level labels. Otherwise, the non-NRD images generate empty labels. (iii) Furthermore, the generated pixel-level annotations are employed to train the segmentation model with active-contour loss function.

Download Full Size | PDF

3.1 Highlighting the lesion area by located-CNN

CAM [45] shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. It shows that the deep neural network really pays attention to object localization in the classification process. However, this approach is hard to accurate locate the position for CSC lesion area in SD-OCT images. There are two reasons for this: (1) the region of lesion is almost the same with background area; (2) SD-OCT images are low contrast and weak boundaries, which are also hard to achieve precise location of lesion area. Based on the characteristics of CSC, we customized Located-CNN to highlight the lesion area. Specifically, we have made two improvements to CAM: (i) The fully connected layer is discarded to avoid the influence of weight on the location feature map. Because we found that the effect of weights on the localization map can cause the network attention to deviate. (ii) To locate the lesion more accurately, a spatial attention module was added further to capture the lesion area’s location information. Through the above two improvements, our method can overcome the above two difficulties.

For a given image, suppose we are given a convolutional neural network with last convolutional feature maps denoted as $S\in {{R}^{K\times H\times W}}$, here $H\times W$ is the spatial size of feature map and $K$ is the number of channels. In our method, we denote the max value of the ${{P}^{k}}$ feature map by:

(1)$${{{P}^{k}}=\underset{H,W}{\mathop{max}}\,S_{H,W}^{k},\forall k\in \{1,\ldots,K\}}$$

where ${{P}^{k}}$ is achieved by aggregating the $K\times H\times W$ matrix of output for $H\times W$ different positions of the input window using a global max-pooling operation into a single $K\times$1 vector.

Inspired by [48], we generate a spatial attention map by utilizing the inter-spatial relationship of features. The spatial attention focuses on ’where’ is an informative part. To compute the spatial attention, we apply mean operation along the channel axis to generate an efficient feature descriptor, which is defined as:

(2)$${u = \sigma (\frac{{\sum\nolimits_{k = 1}^K {{P}^{k}} }}{K})}$$

where $\sigma$ denotes the sigmoid function. After above operations, we optimize the classification network by minimizing binary cross-entrophy (BCE) loss:

(3) $$BCELoss({u,v})={-}({{v}log{u}+}(1{-v}){log}(1{-u}))$$

where $u$ and $v$ represent predicted value and image-level label.

Through the modification of the above network, we can obtain the location of the lesion area after training. Specifically, the feature map obtained by the convolution of the last layer of the network is defined as follows:

(4)$${{F}_{H,W}}=\sum\nolimits_{k=1}^{K}{{S_{H,W}^{k}}},{S_{H,W}^{k}}\in {{R}^{H\times W}}$$

where add all the features $S_{H,W}^{k}$ along the $K$ dimension to get ${{F}_{H,W}}$. Then simply upsampling the ${{F}_{H,W}}$ to the size of the input image denote as $F$. After the above series of operations, we can find that part of the lesion area is accurately highlighted by Located-CNN. It should be noteworthy to mention that the Locate-CNN is a classification network. After this procession, classification network can tell whether input has lesion or not. Only when input are judged to be NRD will highlight the area by Located-CNN.

3.2 Pseudo pixel-level label obtained by using level set

In medical segmentation field, the quality and quantity of pixel-level label can great influence the performance of deep learning. Unfortunately, in our work, we do not have accurate pixel-level label to train a segmentation network. So in order to get the pseudo pixel-level label of NRD lesion area in SD-OCT image, we use the level set method to segment automatically. However, this method needs an accurate initial contour when evolving toward object boundary. In level set based image segmentation techniques, initial contours should be generated carefully since their sizes and locations can affect the segmentation performance. To automatically obtain initial contour, a suppression mask is designed to select the highlighted regions from Section A, defined as:

(5)$${M}= \begin{cases} 0, & \mbox{if } {F}<0.6\cdot max({F}) \\ 255, & \mbox{otherwise}. \end{cases}$$

where applying hard thresholding on the heatmap $F$ reveals the discriminative region $M$. Then, we use level set method to evolve the obtained significant seed regions $M$.

Chan-Vese model (C-V model) [31], a kind of level set method based on region can perform well on medical image segmentation by statistical intensity distribution on different regions. The energy functional is defined as follows:

(6)$$\begin{aligned} E^{CV}({{c}_{1}},&{{c}_{2}},\phi )=\mu \cdot Length(\phi (x,y)) \\ +&{{\lambda }_{1}}\cdot \int_{\Omega}^{{}}{|I (x,y)-{{c}_{1}}}{{|}^{2}}{H_\varepsilon }(\phi (x,y))dxdy \\ +&{{\lambda }_{2}}\cdot \int_{\Omega}^{{}}{|I (x,y)-{{c}_{2}}{{|}^{2}}(1 - {H_\varepsilon }(\phi (x,y)))dxdy} \end{aligned}$$

where $\mu$ , ${{\lambda }_{1}}$ and ${{\lambda }_{2}}$ are positive constants. The first term of energy functional is length regularization term, which is used to smooth the evolution region $\Omega$; the second and third term are fidelity terms, which are responsible for attracting the evolution region $\Omega$ to the expected boundary. $\phi (x,y)$ is an level set function. $I (x,y)$ represents the original image. ${{c}_{1}}$ and ${{c}_{2}}$ are global average intensity inside and outside the contour, which can be defined as follows:

(7)$$\left\{ \begin{array}{l} {c_1} = \frac{{\int_{\Omega}^{{}} {I(x,y){H_\varepsilon }(\phi (x,y))dxdy} }}{{\int_{{\Omega _{}}} {{H_\varepsilon }(\phi (x,y))dxdy} }} \\ {c_2} = \frac{{\int_{\Omega}^{{}} {I(x,y)(1-{H_\varepsilon }(\phi (x,y)))dxdy} }}{{\int_{{\Omega _{}}} {(1-{H_\varepsilon }(\phi (x,y)))dxdy} }} \end{array}\right.$$

where $H_\varepsilon (\phi (x,y))$ is Heaviside function :

(8)$${{H}_{\varepsilon }}(\phi (x,y))=\frac{1}{2}[1+\frac{2}{\pi }\arctan (\frac{\phi (x,y)}{\varepsilon })]$$

where $\varepsilon$ is a tiny positive constant.

By minimizing the energy functional of level set, we can get the confidence regions as pixel-level labels of lesion area. In our work, the lesion area can be accurately segmented after 300 iterations.

3.3 Segmentation network with active-contour loss function

The pixel-level labels obtained by above procedure are used as supervision to train a segmentation network. From the above Section 3.2, we can see that the closed lesion area can be evolved by the level set method from a significant located seed. Essentially the level set approach is based on the feature of the image itself, such as gray scale distribution, foreground and background. However, these characteristics are only based on some low level features of the image itself; for example, color, texture, etc., yet it can not deal with some high-level semantic features of the target that related to the object category itself. So we consider giving the level set method the ability to combine high-level semantic features.

Inspired by [49], we try to embed the level set into the neural network as the loss function of the deep neural network. In the process of deep learning, deep features are rich in semantic information, and iteratively optimize the whole neural network through the gradient descent of the neural network. Furthermore, for different loss functions, the predicted value will be close to the real value in different ways. In this way, the level set can be fully considered as a loss function combined with the high-level semantic features in the training of neural network, while maintaining the influence on the low-level texture boundary features. The active contour loss function (ACLoss) can be expressed as:

(9)$$\begin{aligned}{ACLoss=}&{{{\lambda }_{\textrm{1}}} \sum_{\Omega }^{i=1,j=1}{{{u}_{i,j}}{{\textrm{(}{{d}_{1}}-{{v}_{i,j}}\textrm{)}}^{2}}}} \\ {+}&{{{\lambda }_{\textrm{2}}} \sum_{\Omega }^{i=1,j=1}{\textrm{(}1-{{u}_{i,j}}\textrm{)(}{{d}_{2}}-{{v}_{i,j}}{{\textrm{)}}^{2}}}} \end{aligned}$$

different from the C-V model above, ${d}_{1}$ and ${d}_{2}$ are represented as the gray mean of inside (foreground) and outside (background). And in here, due to supervised-learning framework, ${d}_{1}$ and ${d}_{2}$ can be simply defined as constants in advance as ${{d}_{1}}=1$ and ${{d}_{2}}=0$. $u$ and $v$ are represented as output of network and a pixel-level label respectively. It can give full play to the role of level set method in maintaining boundaries and the effective combination of deep learning in extracting high-level features of targets. So far, the proposed framework above, to generate pseudo pixel-level labels and combine with the segmentation network, is named "Ours".

3.4 Implementation details

For the first stage of classification network, we choose a popular CNN, ResNet [50]. Specifically for ResNet-101, we remove the layer4 and layer5 after the layer3 and replace them with global max pooling followed by averaging operation. We use Adam [51] with an initial learning rate of 1e-4 to update the weights. In our classification task, SD-OCT image with the normal and NRD are labeled as 0 and 1, respectively. For the second learning stage, we use DeepLab [37] as our segmentation network and use sigmoid function in the final output. Our backbone network is ResNet-101, and dilated rate in ASPP is 1, 6, 12, 18 respectively. Furthermore, the generated pixel-level annotations are employed to train the segmentation network with active-contour loss function. In practice, we set ${\lambda }_{\textrm {1}}$=${\lambda }_{\textrm {2}}$=1. The classification network and segmentation network is implemented by using Keras and Pytorch; The level set method and statistical analysis of experiment are implemented by using MATLAB R2017b. Both two networks are trained on NVIDIA GeForce GTX 1080Ti with 11GB memory.

4. Experiments

4.1 Dataset and performance metrics

Dataset In our work, 23 volume cases from 12 patients containing NRD-fluid from [5,52] make up our dataset. Each case contains 128 images, and the size of each image is $1024\times$512. This challenging dataset is used to evaluate our proposed method. Of particular note is that a patient may have more than one case, which means that a patient may have similar tissue characteristics.To ensure that patients are independent of each other, we divided 12 patients into four groups on a patient-by-patient basis for cross-validation. Then the results of the four groups of experiments were averaged as the final evaluation index. In the meantime, this work was approved by the Institutional Review Board (IRB) of the First Affiliated Hospital of Nanjing Medical University with informed consent. We obtained two sets of segmented ground truth labeled by two experienced ophthalmology experts.

Metrics We define the following evaluation criteria to indicate the accuracy of the positioning. Precision can be quantified as:

(10)$$\textrm{Precision=}\frac{{{{M}}\cap {{P}}}}{{{M}}}$$

where ${M}$ is the highlighted region of applying hard thresholding and ${P}$ is the ground truth labeled by experts. By the area of the intersection of ${{M}}$ and ${{P}}$ can reflect the size of accuracy of regional location of lesion. The larger the value, the more accurate our position is.

There are three criterions, which are the true positive volume fraction (TPVF), positive predicative value (PPV) and dice similarity coefficient (DSC) respectively, to evaluate the segmentation results:

(11)$$TPVF\textrm{ }=\textrm{ }\frac{\textrm{ }\!\!|\!\!\textrm{ }{{V}_{TP}}\textrm{ }\!\!|\!\!}{\!\!|\!\!\textrm{ }{{V}_{G}}\textrm{ }\!\!|\!\!} $$

(12)$$PPV\textrm{ }=\textrm{ }\frac{|{{V}_{TP}}|}{|{{V}_{TP}}|+|{{V}_{FP}}|} $$

(13)$$DSC\textrm{ }=\textrm{ }\frac{\textrm{2 }\!\!|\!\!\textrm{ }{{V}_{TP}}|}{|{{V}_{R}}|+|{{V}_{G}}|} $$

where $|.|$ represents the volume of slices, ${{V}_{R}}$ is the predicted results of our method, ${{V}_{TP}}$ and ${{V}_{FP}}$ represent the true positive and false positive volumes of the results of our method, respectively and ${{V}_{G}}$ is the ground truth labeled by the experts.

4.2 Location precision and evolution evaluation

To get the high-quality pseudo pixel-level labels for the segmentation network’s training, we customized a Located-CNN to highlight NRD lesion regions and then utilized the level set to generate pixel-level labels automatically. To illustrate the importance of Located-CNN, we compared the classification accuracy and location precision with CAM [45]. From Fig. 3, we can observe that the customized location approach can more highlight the lesion region than CAM that focuses on the retinal.

Fig. 3. Visualization of CAM [45] for generic CNN and the proposed Located-CNN. CNNs can be attracted by the contour of the lesion area. In contrast, Located-CNN focuses on lesion regions during training to discover more discriminative differences. Best viewed in color.

Download Full Size | PDF

Then we evaluated the classification accuracy of CAM and Located-CNN as 98.44% and 97.25%, respectively. Although there’s not much difference between CAM and Located-CNN in the classification accuracy, the focus of lesion region is more accurate. And the localization precision can reach 90.25%. It can be seen from the results that our Located-CNN does not significantly reduce classification performance, at the same time, it can ensure the higher localization precision. Furthermore, the positioning region by our method is more prone to consider as initial contour to generate pixel-level compared with CAM.

In addition, using the location seeds obtained above, we apply the level set method to evolve the lesion region. Then, we quantitatively evaluate all the test sets as shown in Fig. 4.

Fig. 4. The quantitative results obtained from the evolution of level set method. Blue and orange represent the assessment results of two related experts.

Download Full Size | PDF

4.3 Comparison of four loss functions

In this case, we use four different loss functions to train a segmentation model for pseudo pixel-level labels generated in the previous section. Three different loss functions, MSEloss (MSE), BCELoss (BCE) and DiceLoss (Dice), are defined as follows:

(14)$${MSELoss= \left|\sum_{\Omega }^{i=1,j=1}{{{\left( {{u}_{i,j}}-{{v}_{i,j}} \right)}^{2}}}\right|} $$

(15)$${BCELoss={-}\sum_{\Omega }^{i=1,j=1}{\left( {{v}_{i,j}}log{{u}_{i,j}}+(1-{{v}_{i,j}})log(1-{{u}_{i,j}}) \right)}} $$

(16)$${DiceLoss= 1-2\cdot \sum_{\Omega }^{i=1,j=1}{\frac{({{u}_{i,j}}\times {{v}_{i,j}})}{{{u}_{i,j}}+{{v}_{i,j}}}}} $$

At the bottom of Table 1, we quantitatively analyze the influence of different loss functions on the segmentation network. On the whole, active-contour loss function is better the other three loss functions. Figure 5 shows segmentation results of four examples using DeepLab and four different loss functions. From left to right, the segmentation results by Ours+MSE [53], Ours+BCE, Ours+Dice [54] and Ours+AC [49] are shown respectively. It can be directly seen from the Fig. 5 that the result of ACLoss as a loss function is closer to the ground truth.

Fig. 5. Segmentation results of four examples using DeepLab and four different loss functions. From left to right, segmentation results by Ours+MSE, Ours+BCE, Ours+Dice and Ours+AC are shown respectively. The red curves denote the ground truth marked by expert, and the green curves denote automatic segmentation results.

Download Full Size | PDF

Table 1. Comparison of eight methods and four loss functions for only NRD volume

View Table

4.4 Comparison with existing methods

In this paper, eight state-of-the-art methods are used to verify further the validity of our model, including label propagation and higher-order constraint (LPHC) [7], a random forest classifier based method (RF) [33], a stratified sampling k-nearest neighbor classifier based algorithm (SS-KNN) [34], an Enface fundus-driven method (EFD) [5], a continuous max-flow approach (CMF) [6], the fully convolutional networks (FCN) [21], a fuzzy level set with cross-sectional voting (FLSCV) [32], and the double-branched and area-constraint fully convolutional networks (DA-FCN) [39]. As presented in Table 1, the quantitative comparison of various methods indicates that our approach can achieve more accurate segmentation than other methods in some aspects. Note that our method is not comparable to other methods. On the one hand, other methods rely either on pixel-level labels or on the layer information of lesion area; on the other hand, our method only uses image-level labels from the beginning to the end. The results show that our proposed method is very close to the segmentation results of fully supervised learning, even on TPVF the index to get the best results.

In addition, we evaluated from the perspective of qualitative segmentation of NRD volume. Through the qualitative analysis of Fig. 6, the proposed method can locate and segment NRD regions more accurately by comparing with other existing methods. Figure 9 shows the 3D volume segmentation results of ours presented in Table 1. From the figure, we can also visually see the continuity and integrity of our segmentation result. We use linear correlation analysis and Bland-Altman reproducibility approach shown in Fig. 8 to compare our approach with the segmentation result of expert 1 and expert 2 respectively, which indicates that the prediction results show high agreement with the ground truth.

Fig. 6. Examples of NRD segmentation results. From the first column to eighth column,the segmentation results were obtained by LPHC, RF, KNN, EFD, CMF, FCN, FLSCV, DA-FCN and the proposed method respectively. The red curves denote the ground truth marked by expert, and the green curves denote segmentation results.

Download Full Size | PDF

4.5 Multi-lesions segmentation analysis

At present, our learning framework can effectively segment the single lesion area (only NRD) with image-level annotations. Further more, our model is verified by multi-lesion data (NRD&PED), and Fig. 7 shows the segmentation results of two examples containing both NRD and PED using DeepLab and two different loss functions. From left to right, the example-original SD-OCT image, ground truth, segmentation results by Ours+AC, Ours+Dice are shown respectively. It fully shows that our method has good boundary retention effect on multi-lesion segmentation, which further demonstrates the robustness of our method.

Fig. 7. Segmentation results of two examples containing both NRD and PED using DeepLab and two different loss functions. From left to right, the example-original SD-OCT image, ground truth, segmentation results by Ours+AC and Ours+Dice are shown respectively.

Download Full Size | PDF

5. Discussion

Training an effective segmentation network requires precise pixel-level labels, whereas the available generation of pixel-level labels is critical for weakly supervised learning. In this work, we propose a two-stage weakly supervised learning method to segment CSC accurately and automatically in SD-OCT images only by using image-level labels. A challenging dataset is used to evaluate our proposed method. The results demonstrate that the proposed method consistently outperforms some current models trained with different supervision levels and is even as competitive as those relying on stronger supervision. Furthermore, it can be applied to segment fluid tissues of the OCT image, such as lamellar macular holes and macular pseudo holes. At the same time, this will significantly reduce labeling for professional doctors and provide solutions to related problems.

In our work, the critic here is whether the localization maps are in the lesion area rather than feature localization manifestation. From Fig. 3, it can be observed that our proposed Located-CNN is a more precise way for producing object localization maps than conventional CAM. Because of Located-CNN, our method can automatically evolve with the help of the level set method and obtain high-quality pseudo labels. Figure 4 shows that the evolution result of the level set method can get available quantitative results. Although the results are relatively satisfactory, it has been found that the pseudo pixel-level label obtained depends on the location precision, and the selection of threshold itself will cause some significant regions to be missed. Another reason for this result is that the level set method is based on the gray distribution in the evolution process. Therefore, when the lesion area and tissue are close to each other in gray scale distribution, excessive segmentation will occur. In conclusion, we define the above mentioned abnormal samples as noise labels in pseudo pixel-level labels obtained by level set method. Some of the problems mentioned above inspire us to try to solve noise labels in the future.

Fortunately, the number of noise labels described above accounts for only a small fraction of the number of training samples. Because of the similarity among samples, the proportion of the complete lesion segmentation results are much higher than that of noise labels. The segmentation network is more inclined to the distribution of complete lesion segmentation results in the learning process, but will play a role in correcting some noise labels. Additionally, the loss function is a measure tool that can well reflect the gap between the predicted output and the ground truth. Therefore, we adopted four different loss functions to train the segmentation model. It can be directly seen from the Fig. 5 that the result of ACLoss as a loss function is closer to the ground truth. At the bottom of Table 1, we quantitatively analyze the influence of different loss functions on the segmentation network. Although BCE can achieve the best performance on the TPVF index, it gets the worst performance in PPV and DSC. On the contrary, AC is very balanced and very close to the fully supervised segmentation method. To sum up, we choose the active-contour loss function in our segmentation network, named Ours+AC. To further verify the validity of our model, eight state-of-the-art methods are used to compare as presented in Table 1. The results show that our proposed method is very close to the segmentation results of fully supervised learning. Besides, we can see more intuitively from the last two columns of Fig. 6 that our method is very close to DA-FCN. In the last part of the experiment, we further verify our model’s sensitivity and robustness for multiple lesions. From Fig. 7, we can intuitively see that Ours + AC can achieve a better segmentation effect than Ours + Dice.

Fig. 8. (a) Statistical correlation analysis between the proposed method and specialist 1. (b) Bland-Altman plots for the proposed method and specialist 1. (c) Statistical correlation analysis between the proposed method and specialist 2. (d) Bland-Altman plots for the proposed method and specialist 2.

Download Full Size | PDF

Fig. 9. From (a) to (f):3D segmentation results of the proposed method. The blue surface denotes the IB RPE, and the green region indicates the subretinal fluid.

Download Full Size | PDF

Although our proposed method can achieve encouraging results using a weakly supervised learning method, there are still some limitations in our proposed segmentation method. Firstly, the noise labels in pseudo pixel-level labels affects the segmentation accuracy of our model to some extent. However, there does not include noise label suppression module in our proposed method. Therefore, we would like to design a noise label suppression module in the future to improve the quality of pseudo pixel-level labels and further improve the performance of the segmentation network. Secondly, a large amount of data can be beneficial to segmentation performance of the deep learning model, ensuring the diversity of data learning by the model and overcoming some test data’s uncertainty. Nevertheless, the limitations of our approach lie not only in a larger dataset and the quality of the data, but also in the difficulty of annotating the data and the need for a professional physician to annotate the data, which is a long and costly process. Meanwhile, our proposed method focuses on the segmentation of a single lesion area. It is unable to identify each type of lesion when having multi-lesion areas in the SD-OCT image. In the future, we hope to efficiently segment serous retinal detachment by combing semi-supervised and active learning methods.

6. Conclusion

In order to solve the problems that obtaining pixel-level needs lots of time and money in medical image segmentation, in this work, a two-stage learning architecture is proposed for weakly supervised SD-OCT retinal image segmentation. Extensive experiment results validate our proposed weakly supervised learning architecture on a highly challenging dataset. Compared with other methods, our segmentation results are even as competitive as those relying on stronger supervision. The proposed weakly supervised learning architecture could greatly reduce the cost of obtaining pixel-level label and the limitations of ophthalmic image processing. In addition, we found that for this task, there is some independence in our methods. We try to break that independence by mining the potential of deep learning. In the future, we hope to realize automated serous retinal detachment segmentation in a more efficient way.

Funding

National Natural Science Foundation of China (61701192, 61671242, 61872419, 61873324); Natural Science Foundation of Shandong Province (ZR2017QF004, ZR2019MF040, ZR2019MH106); Key Technology Research and Development Program of Shandong (2017CXGC0810); China Postdoctoral Q1 Science Foundation (2017M612178).

Disclosures

The authors declare that there are no conflicts of interest in this work.

References

1. K. K. Dansingani, C. Balaratnasingam, S. Mrejen, M. Inoue, K. B. Freund, J. M. Klancnik Jr, and L. A. Yannuzzi, “Annular lesions and catenary forms in chronic central serous chorioretinopathy,” Am. J. Ophthalmol. 166, 60–67 (2016). [CrossRef]

2. Y. Kuroda, S. Ooto, K. Yamashiro, A. Oishi, H. Nakanishi, H. Tamura, N. Ueda-Arakawa, and N. Yoshimura, “Increased choroidal vascularity in central serous chorioretinopathy quantified using swept-source optical coherence tomography,” Am. J. Ophthalmol. 169, 199–207 (2016). [CrossRef]

3. R. Hua, L. Liu, C. Li, and L. Chen, “Evaluation of the effects of photodynamic therapy on chronic central serous chorioretinopathy based on the mean choroidal thickness and the lumen area of abnormal choroidal vessels,” Photodiagn. Photodyn. Ther. 11(4), 519–525 (2014). [CrossRef]

4. T. Liu, W. Lin, S. Zhou, and X. Meng, “Optical coherence tomography angiography of flat irregular pigment epithelial detachments in central serous chorioretinopathy,” Br. J. Ophthalmol. 105(2), 233–238 (2021). [CrossRef]

5. M. Wu, Q. Chen, X. He, P. Li, W. Fan, S. Yuan, and H. Park, “Automatic subretinal fluid segmentation of retinal sd-oct images with neurosensory retinal detachment guided by enface fundus imaging,” IEEE Trans. Biomed. Eng. 65(1), 87–95 (2018). [CrossRef]

6. M. Wu, W. Fan, Q. Chen, Z. Du, X. Li, S. Yuan, and H. Park, “Three-dimensional continuous max flow optimization-based serous retinal detachment segmentation in SD-OCT for central serous chorioretinopathy,” Biomed. Opt. Express 8(9), 4257–4274 (2017). [CrossRef]

7. T. Wang, Z. Ji, Q. Sun, Q. Chen, S. Yu, W. Fan, S. Yuan, and Q. Liu, “Label propagation and higher-order constraint-based segmentation of fluid-associated regions in retinal SD-OCT images,” Inf. Sci. 358-359, 92–111 (2016). [CrossRef]

8. G. Wang, W. Li, M. A. Zuluaga, R. Pratt, P. A. Patel, M. Aertsen, T. Doel, A. L. David, J. Deprest, and S. Ourselin, “Interactive medical image segmentation using deep learning with image-specific fine tuning,” IEEE Trans. Med. Imaging 37(7), 1562–1573 (2018). [CrossRef]

9. A. A. Shvets, A. Rakhlin, A. A. Kalinin, and V. I. Iglovikov, “Automatic instrument segmentation in robot-assisted surgery using deep learning,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), (IEEE, 2018), pp. 624–628.

10. J. Funke, F. Tschopp, W. Grisaitis, A. Sheridan, C. Singh, S. Saalfeld, and S. C. Turaga, “Large scale image segmentation with structured loss based deep learning for connectome reconstruction,” IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1669–1680 (2019). [CrossRef]

11. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), (2018), pp. 801–818.

12. Z. Jie, Y. Wei, X. Jin, J. Feng, and W. Liu, “Deep self-taught learning for weakly supervised object localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), pp. 1377–1385.

13. X. Liang, S. Liu, Y. Wei, L. Liu, L. Lin, and S. Yan, “Towards computational baby learning: a weakly-supervised approach for object detection,” in Proceedings of the IEEE International Conference on Computer Vision, (2015), pp. 999–1007.

14. D. Pathak, E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional multi-class multiple instance learning,” arXiv preprint arXiv:1412.7144 (2014).

15. O. Russakovsky, A. L. Bearman, V. Ferrari, and F.-F. Li, “What’s the point: semantic segmentation with point supervision,” in Proceedings of the European Conference on Computer Vision (ECCV), (2016), pp. 549–565.

16. J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation,” in 2015 IEEE International Conference on Computer Vision (ICCV), (2015), pp. 1635–1643.

17. P. Vernaza and M. Chandraker, “Learning random-walk label propagation for weakly-supervised semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), pp. 2953–2961.

18. P. O. Pinheiro and R. Collobert, “From image-level to pixel-level labeling with convolutional networks,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2015), pp. 1713–1721.

19. A. Kolesnikov and C. H. Lampert, “Seed, expand and constrain: three principles for weakly-supervised image segmentation,” in European Conference on Computer Vision, (Springer, 2016), pp. 695–711.

20. Y. Wei, X. Liang, Y. Chen, X. Shen, M. Cheng, J. Feng, Y. Zhao, and S. Yan, “STC: a simple to complex framework for weakly-supervised semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2314–2320 (2017). [CrossRef]

21. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015), pp. 3431–3440.

22. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random fields as recurrent neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, (2015), pp. 1529–1537.

23. M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: a retrospective,” Int. J. Comput. Vis. 111(1), 98–136 (2015). [CrossRef]

24. Q. Liang, Y. Nan, G. Coppola, K. Zou, W. Sun, D. Zhang, Y. Wang, and G. Yu, “Weakly supervised biomedical image segmentation by reiterative learning,” IEEE J. Biomed. Health Inform. 23(3), 1205–1214 (2019). [CrossRef]

25. X. Yang, C. Liu, Z. Wang, J. Yang, H. Le Min, L. Wang, and K.-T. T. Cheng, “Co-trained convolutional neural networks for automated detection of prostate cancer in multi-parametric mri,” Med. Image Anal. 42, 212–227 (2017). [CrossRef]

26. L. Yao, J. Prosky, E. Poblenz, B. Covington, and K. Lyman, “Weakly supervised medical diagnosis and localization from multiple resolutions,” arXiv preprint arXiv:1803.07703 (2018).

27. X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “Chestx-ray8: hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), pp. 2097–2106.

28. C. Sun, A. Xu, D. Liu, Z. Xiong, F. Zhao, and W. Ding, “Deep learning-based classification of liver cancer histopathology images using only global labels,” IEEE Journal of Biomedical and Health Informatics p. 1 (2019).

29. R. J. G. van Sloun and L. Demi, “Localizing b-lines in lung ultrasonography by weakly supervised deep learning, in-vivo results,” IEEE J. Biomed. Health Inform. 24(4), 957–964 (2020). [CrossRef]

30. G. R. Wilkins, O. M. Houghton, and A. L. Oldenburg, “Automated segmentation of intraretinal cystoid fluid in optical coherence tomography,” IEEE Trans. Biomed. Eng. 59(4), 1109–1114 (2012). [CrossRef]

31. T. F. Chan and L. A. Vese, “Active contours without edges,” IEEE Trans. on Image Process. 10(2), 266–277 (2001). [CrossRef]

32. J. Wang, M. Zhang, A. D. Pechauer, L. Liu, T. S. Hwang, D. J. Wilson, D. Li, and Y. Jia, “Automated volumetric segmentation of retinal fluid on optical coherence tomography,” Biomed. Opt. Express 7(4), 1577–1589 (2016). [CrossRef]

33. A. Lang, A. Carass, E. K. Swingle, O. Al-Louzi, P. Bhargava, S. Saidha, H. S. Ying, P. A. Calabresi, and J. L. Prince, “Automatic segmentation of microcystic macular edema in oct,” Biomed. Opt. Express 6(1), 155–169 (2015). [CrossRef]

34. G. Quellec, K. Lee, M. Dolejsi, M. K. Garvin, M. D. Abramoff, and M. Sonka, “Three-dimensional analysis of retinal layer texture: identification of fluid-filled regions in SD-OCT of the macula,” IEEE Trans. Med. Imaging 29(6), 1321–1330 (2010). [CrossRef]

35. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected CRFS,” arXiv preprint arXiv:1412.7062 (2014).

36. L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS,” IEEE Transactions on Pattern Analysis Mach. Intell. 40(4), 834–848 (2018).

37. L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587 (2017).

38. O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-assisted Intervention, (Springer, 2015), pp. 234–241.

39. K. Gao, S. Niu, Z. Ji, M. Wu, Q. Chen, R. Xu, S. Yuan, W. Fan, Y. Chen, and J. Dong, “Double-branched and area-constraint fully convolutional networks for automated serous retinal detachment segmentation in SD-OCT images,” Comput. Methods Programs Biomed. 176, 69–80 (2019). [CrossRef]

40. D. Kim, D. Cho, D. Yoo, and I. So Kweon, “Two-phase learning for weakly supervised object localization,” in Proceedings of the IEEE International Conference on Computer Vision, (2017), pp. 3534–3543.

41. Y. Zhu, Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao, “Soft proposal networks for weakly supervised object localization,” in Proceedings of the IEEE International Conference on Computer Vision, (2017), pp. 1841–1850.

42. Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang, “Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), pp. 7268–7277.

43. X. Zhang, Y. Wei, G. Kang, Y. Yang, and T. Huang, “Self-produced guidance for weakly-supervised object localization,” in Proceedings of the European Conference on Computer Vision (ECCV), (2018), pp. 597–613.

44. S. Kwak, S. Hong, and B. Han, “Weakly supervised semantic segmentation using superpixel pooling network,” in Thirty-First AAAI Conference on Artificial Intelligence, (2017), pp. 4111–4117.

45. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), pp. 2921–2929.

46. Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan, “Object region mining with adversarial erasing: a simple classification to semantic segmentation approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), pp. 1568–1576.

47. X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. S. Huang, “Adversarial complementary learning for weakly supervised object localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), pp. 1325–1334.

48. S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “Cbam: convolutional block attention module,” in Proceedings of the European Conference on Computer Vision (ECCV), (2018), pp. 3–19.

49. X. Chen, B. M. Williams, S. R. Vallabhaneni, G. Czanner, R. Williams, and Y. Zheng, “Learning active contour models for medical image segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2019), pp. 11632–11640.

50. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), pp. 770–778.

51. D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv preprint arXiv:1412.6980.

52. Y. Zheng, J. Sahni, C. Campa, A. N. Stangos, A. Raj, and S. P. Harding, “Computerized assessment of intraretinal and subretinal fluid regions in spectral-domain optical coherence tomography images of the retina,” Am. J. Ophthalmol. 155(2), 277–286.e1 (2013). [CrossRef]

53. O. Köksoy, “Multiresponse robust design: mean square error (MSE) criterion,” Appl. Math. Comput. 175(2), 1716–1729 (2006). [CrossRef]

54. F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: fully convolutional neural networks for volumetric medical image segmentation,” in 2016 Fourth International Conference on 3D Vision (3DV), (IEEE, 2016), pp. 565–571.

	Expert 1			Expert 2
Method	TPVF(%)	PPV(%)	DSC(%)	TPVF(%)	PPV(%)	DSC(%)
LPHC [7]	81.3 $\pm$ 9.4	55.6 $\pm$ 13.3	65.3 $\pm$ 10.4	81.2 $\pm$ 9.3	55.8 $\pm$ 12.8	65.7 $\pm$ 10.5
RF [33]	92.6 $\pm$ 4.4	92.4 $\pm$ 2.0	87.1 $\pm$ 4.3	92.5 $\pm$ 4.3	91.9 $\pm$ 2.2	88.9 $\pm$ 4.2
KNN [34]	80.9 $\pm$ 6.6	91.9 $\pm$ 3.8	86.1 $\pm$ 4.1	80.3 $\pm$ 6.5	91.8 $\pm$ 3.8	85.9 $\pm$ 4.1
EFD [5]	94.2 $\pm$ 5.2	93.7 $\pm$ 4.0	93.0 $\pm$ 4.8	94.3 $\pm$ 5.1	94.6 $\pm$ 4.1	94.1 $\pm$ 5.3
CMF [6]	92.1 $\pm$ 4.1	93.9 $\pm$ 2.5	93.0 $\pm$ 3.4	92.0 $\pm$ 3.9	94.3 $\pm$ 2.6	94.0 $\pm$ 3.5
FCN [21]	82.6 $\pm$ 20.3	95.0 $\pm$ 5.1	86.6 $\pm$ 15.6	82.4 $\pm$ 20.0	96.0 $\pm$ 5.1	95.9 $\pm$ 5.1
FLSCV [32]	84.4 $\pm$ 16.0	78.9 $\pm$ 21.7	86.2 $\pm$ 7.3	84.4 $\pm$ 15.1	79.4 $\pm$ 20.2	63.4 $\pm$ 7.3
DA-FCN [39]	94.3 $\pm$ 1.8	95.0 $\pm$ 1.8	95.8 $\pm$ 1.4	94.4 $\pm$ 3.2	95.6 $\pm$ 1.6	97.0 $\pm$ 1.1
Ours+MSE	96.5 $\pm$ 0.9	87.5 $\pm$ 3.6	91.8 $\pm$ 2.2	96.8 $\pm$ 0.7	88.6 $\pm$ 1.1	92.6 $\pm$ 0.7
Ours+BCE	96.7 $\pm$ 0.6	74.6 $\pm$ 11.4	83.4 $\pm$ 7.3	97.1 $\pm$ 0.0	75.0 $\pm$ 10.5	84.2 $\pm$ 6.7
Ours+Dice	96.2 $\pm$ 1.9	83.1 $\pm$ 5.0	91.8 $\pm$ 2.2	96.7 $\pm$ 1.8	83.4 $\pm$ 4.0	89.7 $\pm$ 1.7
Ours+AC	95.0 $\pm$ 1.4	92.3 $\pm$ 2.4	93.6 $\pm$ 1.5	95.2 $\pm$ 1.8	92.9 $\pm$ 1.0	94.1 $\pm$ 0.5

Weakly supervised serous retinal detachment segmentation in SD-OCT images by two-stage learning

Abstract

1. Introduction

2. Related works

2.1 Conventional segmentation methods

2.2 CNN-based segmentation methods

3. Proposed segmentation method

3.1 Highlighting the lesion area by located-CNN

3.2 Pseudo pixel-level label obtained by using level set

3.3 Segmentation network with active-contour loss function

3.4 Implementation details

4. Experiments

4.1 Dataset and performance metrics

4.2 Location precision and evolution evaluation

4.3 Comparison of four loss functions

4.4 Comparison with existing methods

4.5 Multi-lesions segmentation analysis

5. Discussion

6. Conclusion

Funding

Disclosures

References

Cited By

Figures (9)

Tables (1)

Equations (16)

Biomedical Optics Express