Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Leveraging non-expert crowdsourcing to segment the optic cup and disc of multicolor fundus images

Open Access Open Access

Abstract

Multicolor scanning laser imaging (MCI) images have broad application potential in the diagnosis of fundus diseases such as glaucoma. However, the performance level of automatic aided diagnosis systems based on MCI images is limited by the lack of high-quality annotations of numerous images. Producing annotations for vast amounts of MCI images will be a prolonged process if we only employ experts. Therefore, we consider non-expert crowdsourcing, which is an alternative approach to produce useful annotations efficiently and low cost. In this work, we aim to explore the effectiveness of non-expert crowdsourcing on the segmentation of the optic cup (OC) and optic disc (OD), which is an upstream task for glaucoma diagnosis, using MCI images. To this end, desensitized MCI images are independently annotated by four non-expert annotators, constructing a crowdsourcing dataset. To profit from crowdsourcing, we propose a model consisting of coupled regularization network and segmentation network. The regularization network generates learnable pixel-wise confusion matrices (CMs) that reflects preferences of each annotator. During training, the CMs and segmentation network are simultaneously optimized to enable dynamic trade-offs for non-expert annotations and generate reliable predictions. Crowdsourcing learning using our method have an average Mean Intersection Over Union ($\mathcal {M}$) of 91.34%, while the average $\mathcal {M}$ of model trained by expert annotations is 91.72%. In addition, comparative experiments show that in our segmentation task non-expert crowdsourcing can be on a par with the expert who annotates 90% of data. Our work suggests that crowdsourcing in the segmentation of OC and OD using MCI images has the potential to be a substitute to expert annotation, which will accelerate the construction of large datasets to facilitate the application of deep learning in clinical diagnosis using MCI images.

© 2022 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

Multicolor scanning laser imaging (MCI) was introduced by Heisenberg Engineering, which simultaneously uses blue reflectance (BR, $\lambda$= 488 nm), green reflectance (GR, $\lambda$= 515 nm), and infrared reflectance (IR, $\lambda$= 820 nm) scanning lasers to illuminate the fundus [1]. The reason MCI is popular with ophthalmologists is that the three reflectance images that make up a multicolor image can be viewed separately, allowing the ophthalmologist to navigate through each level of detail [2,3]. In addition, the contrast of the optic cup (OC) and optic disc (OD) is more distinct compared to traditional color fundus photography, which may have potential for future applications in glaucoma [4]. However, due to the lack of large MCI datasets, the reliability of MCI images-based automatic fundus disease diagnosis systems is still far from our expectations. Experts are scarce and the segmentation annotation process is time-consuming and labor-intensive [5], hindering the creation of large MCI datasets. To address this issue, we employ crowdsourcing learning by hiring non-expert individuals to perform segmentation annotation for dataset construction.

Crowdsourcing, composed of "crowd" and "outsourcing", is a method of solving problems with the help of the wisdom of the group [6]. Laypersons, amateurs, and professionals can be participants equally in crowdsourcing through open recruitment or online platforms such as Amazon Mechanical Turk [7]. When it comes to large-scale data analysis, crowdsourcing will reduce costs and increase the speed of research progress [8]. Furthermore, obtaining multiple annotations for the same instance can give considerable advantage [9]. In medical image-based diagnostic studies, crowdsourcing has been demonstrated to produce accurate results, including grading images for glaucoma and diabetic retinopathy [10], instrument segmentation in laparoscopic images [11], and kidney segmentation on CT scans [12]. However, when crowdsourcing tasks are highly specialized and the participants are non-experts, unreliable annotations with human error tend to be generated [13]. Especially in pixel-wise medical image segmentation, this drawback is especially obvious. Therefore, it is crucial to devise methods that can learn from noisy non-expert crowdsourcing labels to be comparable to experts labels. In this work, we apply crowdsourcing to OC and OD segmentation using multicolor images, which is an upstream task for glaucoma diagnosis, and propose a regularized model to learn from non-expert crowdsourcing annotations.

To explore the potential of crowdsourcing to compete with experts on the segmentation of OC and OD using MCI fundus image, we construct a crowdsourcing segmentation dataset using in-house data. Specifically, we recruit two non-ophthalmology medical students $M_1\&M_2$, and two computer visual researchers $M_3\&M_4$, where $M$ refers to the messes. They independently perform the segmentation of OC and OD after a brief training (10 min) by an ophthalmologist at the same time. Figure 1 shows the visualization of an exemplar of MCI data and the corresponding crowdsourcing annotations. Given the speed, scalability, and low cost of annotation, this means that the medical image analysis community no longer need to rely on experts to generate reference or training data for certain applications.

 figure: Fig. 1.

Fig. 1. (a) An exemplar of a group of MCI images: IR, GR, BR, and fused image in order. (b) Visualization of OC (inner) and OD (outer) annotations of non-experts. The colors of the annotations correspond to the legend.

Download Full Size | PDF

Aggregating labels using majority voting is the most commonly used method for crowdsourcing learning. However, when weak annotators dominate, the result of majority voting is usually not credible. Other label fusion methods, such as STAPLE [14], etc., are difficult to achieve learnable fusion corresponding to the instance. Deep learning provide a solution to the above issues of label fusion methods. Recently, Zhang et al. [15] proposed a segmentation model that combines the STAPLE method and a convolutional neural network (CNN), and achieved considerable improvement in multi-expert tasks. This model addresses the common drawback of traditional approaches to integrating information across training data, and utilizes the exceptional training ability of CNNs to model human experience at a level of complexity that relatively simple functions cannot achieve. Inspired by [15], we propose an architecture involving two coupled CNNs, namely, we add a regularization network to the segmentation backbone. In the regularization network, the focal loss is minimized to ensure that the confusion matrix reflects the annotation preferences of non-experts, while minimizing the trace of the confusion matrix is used to treat the prediction map as unreliable. Furthermore, the trace of the confusion matrix is first maximized to ensure the dominance of the trace and the segmentation network to positively generate predictions. Finally, the segmentation network regularized by the regularization network can be used to directly segment the input image.

In summary, the main contributions of this study are as follows:

  • (1) To deploy crowdsourcing, we recruit four non-experts to perform annotation of OC and OD segmentation using MCI images.
  • (2) Proposing a segmentation network with regularization to better utilize crowdsourcing annotations from non-experts.

2. Related work

Label fusion strategies are used to deal with inconsistencies among multiple annotations caused by annotator differences, such as Majority Voting, local weighted voting [16], joint label fusion [17], and the STAPLE approach [14]. Among them, STAPLE statistically evaluates the reliability of the annotators and weights their opinion during the fusion process to obtain the optimal combination of annotations, which is the most common method for solving multi-expert problems in the field of medical image analysis. Several variations of the STAPLE algorithm have emerged and are widely used [18,19]. For instance, Joskowicz et al. [20] proposed an automatic segmentation variability estimation method that is based on STAPLE and does not involve convergence-related and parameter setting, thus avoiding method-specific issues.

In recent years, supervised models based on deep neural networks have been widely used in crowdsourcing learning because they can automatically learn the consistency and inconsistency information from crowdsourcing annotations [21,22]. Recently, Zhang et al. [15] proposed a model that uses two coupled CNNs, one of which uses a weight matrix to evaluate and update the segmentation predictions so that the reliable result from multiple annotators may be obtained. Ji et al. [23] proposed MRNet, which embeds the expertise level of individual raters as prior knowledge into the model and reconstructs the gradings of multi-rater from coarse predictions during training. Compared with the traditional label fusion method, these methods achieve considerable improvement. However, for non-expert crowdsourcing segmentation, the performance of the above methods needs to be further researched.

3. Materials and method

In this section, we first describe the process of creating a non-expert crowdsourcing dataset of MCI images. Subsequently, we illustrate the crowdsourcing segmentation problem definition and introduce the proposed architecture in details.

3.1 Crowdsourcing process

The purpose of our work is to investigate whether multiple non-experts can serve as a alternate for ophthalmologists on a simple segmentation task. In particular, we conduct tests on the segmentation task of OC and OD using MCI images. The high contrast of the OC and OD in the MCI images allows annotators to recognize them without extensive experience. However, in order to guarantee the reliability of non-expert annotation, ophthalmologist’s instruction is implemented before the formal annotation process. The four non-expert annotators who participated in crowdsourcing are shown as follows:

  • M1: non-ophthalmological medical student A
  • M2: non-ophthalmological medical student B
  • M3: computer visual researcher A
  • M4: computer visual researcher B
It is worth mentioning that in actual crowdsourcing deployment, non-experts are non-anonymous, which prevents malicious annotators from participating in crowdsourcing to a certain extent. The crowdsourcing process is shown in Fig. 2. After 10 minutes of simple instruction, non-expert annotators use labelMe [24] to annotate the OC and the OD independently on the desensitized MCI images. The instruction covers imaging principles, definition of optic cup and optic disc, and labeling techniques. Finally, four groups of non-expert annotations are obtained. During the evaluation phase, we propose to use similarity between a small number of expert annotations and non-expert annotations to evaluate the reliability of non-expert annotators. Unreliable non-expert annotators may be denied participation in the current annotation task or be retrained until reliable. In this work, the annotations obtained by four non-experts after being instructed once are generally satisfactory (as shown in Section 4.4), so we keep all non-expert annotators and their corresponding annotations. Furthermore, we investigate the performance level when we remove certain non-expert annotators in Section 4.6.

 figure: Fig. 2.

Fig. 2. Schematic diagram of the process from raw data to crowdsourcing data with annotations.

Download Full Size | PDF

In this work, an ophthalmologist annotates all images in order to evaluate the performance of our proposed approach. Due to the varying proficiency of non-experts and experts in using annotation tool, we do not compare their labeling efficiency.

3.2 Architecture

3.2.1 Overview

Crowdsourcing learning can be defined as the task of making reliable predictions after learning from multiple labels. Specifically, we consider a scenario in which we obtain fused images $\left \{\mathbf {x}_{n} \in \mathbb {X}^{W \times H \times C}\right \}_{n=1}^{N}$ ( $W,H,C$ denote the width, height and channels of the image) from MCI images where $N$ represents the number of input images, with its manual segmentation labels annotated by non-expert masses $\left \{\mathrm {y}_{n}^{(m)} \in Y^{W \times H}\right \}_{n=1, \ldots, N}^{m\in S(\mathbf {x}_{n})}$, where $S(\mathbf {x}_{n})$ denotes the set of all non-expert annotators who labels image $\mathbf {x}_{n}$. The problem of interest here is to learn the unobserved true segmentation distribution $p(y | x)$, which in our task is considered to be the distribution without non-expert preferences, from such non-expert labelled dataset $D = \{\mathbf {x}_{n},\mathrm {y}^{(m)}_{n} \}^{m\in S(\mathbf {x}_{n})}_{n=1,\ldots, N}$. During evaluation, we treat the manual annotations of expert as ground truth.

In this work, we propose a segmentation architecture, consisting of a segmentation network and a regularization network. The two networks are coupled, i.e. their parameters influence each other and are updated simultaneously during training. Through optimization of the proposed loss function, the segmentation network is constrained by crowdsourcing annotations, outputting predictions containing the lowest preference of non-experts. The input of the proposed architecture is the fused image which is the RGB image produced by channel stacking of infrared, green, and blue laser images. The general architecture is shown in Fig. 3.

 figure: Fig. 3.

Fig. 3. The architecture consists of two parts: (a) a segmentation network parameterized by $\varphi$ and generates a segmentation probability distribution $\hat {\boldsymbol {p}}_{\varphi }$; (b) a regularization module, which is CNNs parameterized by $\gamma ^{(m)}$ and uses the input image to generate four pixel-wise confusion matrices $\widehat {\mathbf {A}}_{\gamma }^{(m)}$. During training, ($\varphi$, $\gamma$) are learned jointly by optimizing the total loss function. In the test, only the segmentation network is used for prediction.

Download Full Size | PDF

3.2.2 Segmentation network based on UNet

The segmentation network uses UNet [25] parameterized by $\varphi$ to generate a predicted probability distribution denoted by $\hat {\boldsymbol {p}}_{\varphi }(x)\in [0,1]^{\mathrm {W} \times \mathrm {H} \times \mathrm {L}}$, which represents the probability that a pixel falls into a certain class. $L$ represents the number of classes, such that $L=2$ if the ground truth of the input image is a binary graph. UNet has become the preferred network for medical image segmentation because of its U-shaped structure, which is effective in extracting low- and high-level features. In addition, UNet is more effective on small datasets, and thus it is suitable for our task. It should be mentioned that other CNN-based segmentation networks can also be used [2628]. However, the focus of our research is not related to the structure of the segmentation network. Therefore, we choose UNet as the segmentation backbone of all the methods in our work for fair competition.

3.2.3 Regularization network based on CNN

We now describe the regularization network parameterized by $\gamma$, which aims to recover the true segmentation probability distribution by simulating the preferences of the annotators, shown in Fig. 4. Preference is defined as the propensity of annotations of each annotator to mis-segment, e.g. the propensity of an annotator to label high-light optic cup edges as optic disc. In our work, preferences are modeled by confusion matrices (CMs). When the true label $y$ is known, then each element of the true CM is calculated as follows:

$$a^{(m)}(\mathbf{x}, w, h)_{i j}=p\left(y_{w h}^{(m)}=i \mid y_{w h}=j, \mathbf{x}\right),$$
where $a^{(m)}(\mathbf {x}, w, h)_{i j}$ denotes the $(i,j)^{th}$ element of CM of annotator $m$ at pixel $(w,h)$ in input image $\mathbf {x}$. $y_{wh}$ and $y_{wh}^{(m)}\in [1,\ldots,L]$ refer to the $(w,h)^{th}$ element of a true label and non-expert labels respectively. In particular, it satisfies $p(y_{w h}^{(m)}\mid \mathbf {x})=\sum _{y_{wh}=1}^{L}a^{(m)}(\mathbf {x}, w, h)_{i j}\cdot p(y_{wh} \mid \mathbf {x})$. Intuitively, CM reflects the annotation preferences of annotators. However, the true labels are unobservable, so that the true CM cannot be calculated directly.

 figure: Fig. 4.

Fig. 4. The structure of regularization network.

Download Full Size | PDF

Therefore, the regularization network is introduced to model CMs $\{\widehat {\mathbf {A}}^{(m)}_{\gamma }(\mathbf {x}) \in [0,1]^{\mathrm {W} \mathrm {H} \mathrm {L}^2}\}_{m=1}^{M}$. Each CM is the output of a sub-network CNN whose input is the fused image $\mathbf {x}$. Each CNN parameterized by $\gamma ^{(m)}$ has the same structure and is randomly initialized and independently optimized. $\hat {\boldsymbol {p}}_{\varphi }(x)$ output by segmentation network is the predicted probability distribution. Then the estimated probability distributions of annotators are obtained by element-wise multiplication of the CMs and the predicted probability distribution:

$$\hat{\boldsymbol{p}}_{\varphi, \gamma}^{(m)}(\mathbf{x})=\widehat{\mathbf{A}}_{\gamma}^{(m)}(\mathbf{x}) \cdot \hat{\boldsymbol{p}}_{\varphi}(\mathbf{x})$$
$$\begin{bmatrix} {p}_{1} \\ {p}_{2} \\ \vdots \\ {p}_{L} \\ \end{bmatrix} _{\varphi, \gamma} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1L} \\ a_{21} & a_{22} & \cdots & a_{2L} \\ \vdots & \vdots & \ddots & \vdots \\ a_{L1} & a_{L2} & \cdots\ & a_{LL} \\ \end{bmatrix} _{\gamma} \times \begin{bmatrix} p_{1} \\ p_{2} \\ \vdots \\ p_{L} \\ \end{bmatrix} _{\varphi}$$
where the “$\cdot$” operator is element-wise matrix multiplication in the spatial dimension $(W,H)$, as shown in Eq. (3).

It is inferred from Eq. (2) that when $\hat {\boldsymbol {p}}_{\varphi, \gamma }^{(m)}$ and $\widehat {\mathbf {A}}_{\gamma }^{(m)}$ are approximately true, the predicted probability distribution $\hat {\boldsymbol {p}}_{\varphi }$ containing the least noise is recovered. The CMs $\widehat {\mathbf {A}}_{\gamma }^{(m)}$ are optimized under trace norm theory [15,29], while $\hat {\boldsymbol {p}}_{\varphi, \gamma }^{(m)}$ are encouraged as similar as annotations of non-expert annotators by minimizing the segmentation loss function. In short, the generation of the optimal confusion matrix and estimated probability distribution will be carried out under the joint optimization of the segmentation network and the regularization network. The detailed description is provided in the next section.

3.2.4 Loss function

Next, we describe how we jointly optimise the parameters of segmentation network, $\varphi$ and the parameters of annotator network, $\gamma$.

To encourage the estimated probability distribution with annotator preferences to be as close as possible to the true annotations of the corresponding non-expert annotators, cross-entropy loss is considered. However, in view of the imbalance of positive and negative samples in this segmentation task, we choose focal loss [30,31] as the segmentation loss function. Specifically, the positive samples in the optic cup and optic disc segmentation task are the optic cup and optic disc of each image, while the negative samples are the background. The annotations of expert show that the area of the positive sample in the MCI images only accounts for about 1-2% of the entire image. Focal loss helps with sample balance during training and reduces the attention of network to easily predicted background. We define the multi-class focal loss under crowdsourcing task as:

$$\mathcal{L}_{s e g}(\varphi, \gamma)= \sum_{m=1}^{M} \sum_{c=1}^{L} y_{c}^{(m)}\epsilon(1-\hat{\boldsymbol{p}}_{c}^{(m)})^{\mu}log(\hat{\boldsymbol{p}}_{c}^{(m)})$$
Where $C$ denotes the class and $y_{c}^{(m)}=1$ means that one-hot vector $y^{(m)}$ specifies the manual annotation class $c$ ($c={1\ldots,L}$). $\epsilon =1$ is the balanced weight and $\mu =2$ is the focusing parameter.

Next, we introduce the regularization term according to the theorem of trace norm [15,29]. As a reminder, we recite the theorem below. Theorem. For the $(i, j)^{\text {th }}$ pixel in an input image $\mathbf {x}$, we define the mean confusion matrix $\mathbf {A}^{*}:=\sum _{m=1}^{M} \hat {\mathbf {A}}^{(m)}$ and its estimate $\hat {\mathbf {A}}^{*}:=\sum _{m=1}^{M} \hat {\mathbf {A}}^{(m)}$. If the annotator’s segmentation probabilities are perfectly modelled by the model for the input image i.e., $\hat {\mathbf {A}}^{(m)} \hat {\boldsymbol {p}}_{\varphi }(\boldsymbol {x})=\mathbf {A}^{(m)} \boldsymbol {p}(\boldsymbol {x}) \forall m=1, \ldots, M$, and the average true confusion matrix $\mathbf {A}^{*}$ at a given pixel and its estimate $\hat {\mathbf {A}}^{*}$ satisfy that $a_{k k}^{*}>a_{k j}^{*}$ for $j \neq k$ and $\hat {a}_{i i}^{*}>\hat {a}_{i j}^{*}$ for all $i, j$ $k^{\text {th }}$ columns where $k$ is the correct pixel class.

The inference of the theorem is that, under the premise that the trace of the CM is dominant, when performing the operation of Eq. (2), minimizing the trace will drive the estimated respective annotator’s CMs to match the true value. The trace of the CM is optimized by the following loss function:

$$\mathcal{L}_{r e g}(\gamma)=\sum_{m=1}^{M} \operatorname{tr}\left(\widehat{\mathbf{A}}_{\gamma}^{(m)}\left(\mathbf{x}\right)\right)$$
Where $\operatorname {tr}\left (\widehat {\mathbf {A}}\right )$ denotes the trace of $\widehat {\mathbf {A}}$.

In general, minimizing the trace encourages the estimated annotators to be maximally unreliable, while minimizing the focal loss ensures fidelity to the observed non-expert annotators. The total loss function is composed of the focal loss and the trace norm term:

$$\mathcal{L}_{\text{total }}(\varphi,\gamma)=\mathcal{L}_{\text{seg }}(\varphi,\gamma)+\alpha \mathcal{L}_{\text{reg }}(\gamma)$$
Where $\alpha$ denotes the weight of the regularization term, and the optimum $\alpha$ will be determined in the quantitative comparison. learning $\{ \varphi, \gamma \}$ by minimizing this combined loss through stochastic gradient descent.

However, the parameter initialization of the regularization network is random, and directly minimizing the loss function will cause the trace to converge to zero, resulting in negative results. Therefore, to encourage $\{ \hat {A}^{(1)},\ldots, \hat {A}^{(M)}\}$ to be diagonally dominant, we use the following piece-wise loss function to add a warm-up period. We thus optimize the combined loss:

$$\mathcal{L}_{\text{total }}(\varphi, \gamma)=\left\{ \begin{array}{lr} \mathcal{L}_{\text{seg }}(\varphi, \gamma)+\alpha \mathcal{L}_{\text{reg }}(\gamma),\mathcal{L}_{\text{reg }}(\gamma)\leq \lambda (L\times M) & \\ & \\ \mathcal{L}_{\text{seg }}(\varphi, \gamma)-\alpha \mathcal{L}_{\text{reg }}(\gamma),\mathcal{L}_{\text{reg }}(\gamma)>\lambda (L\times M) & \end{array} \right.$$
where $\lambda$ is the approximate coefficient of the maximum trace of the confusion matrix, and is set to 0.99. We use this piece-wise function to warm up the model by first maximizing the trace. Subsequently, trace minimization is carried out under diagonal dominance.

After the model is trained, the test segmentation results can be obtained using only the segmentation network without the regularization network.

4. Experiments

4.1 Dataset details

We reviewed and collected MCI images of 57 subjects from Shandong Provincial Hospital Affiliated to Shandong University to construct a in-house dataset (this dataset is used for retrospective study, and subjects’ personal information is not disclosed). In this experiment, our dataset consists of 57 sets of images, each of which consists of BR, GR, and IR scanning laser images with wavelengths of 488, 515, and 820 nm, respectively. These images are 24-bit depth, provided in the JPG format, with the typical size of $768\times 868$. The subjects, with an average age of 52 years (the minimum age of 18 years and the maximum age of 81 years), are in normal, macular edema, branch retinal vein occlusion, vitreous opacity and other conditions. For security and privacy reasons, the area of the image with the subject’s personal information is cropped, and it will not be released to annotators or public.

In our crowdsourcing dataset, each fused image corresponds to four non-expert annotations. To evaluate the performance of various methods, expert annotations as ground truth are only available in the testing phase. In order to better verify the generalization of our method, experiments are performed with 3-fold cross-validation. This study was approved by the review board of Shandong Provincial Hospital Affiliated to Shandong University. Because of the retrospective nature of the study, patient consent for inclusion was waived. Specifically, all 57 data were randomly divided into three non-overlapping data sets in advance, and then two of them were used as the training set and the other as the test set alternately in each fold. For fair competition, the selection of hyperparameters is performed on a randomly divided dataset instead of the aforementioned cross-validation set. After divided, the dataset is expanded by data augmentation (including random 90$^{\circ }$ rotation, random flipping, blurring, Gaussian noise, and color transformation), with the increment of total volume to 855. For fair comparison, all experiments are performed on the above augmented dataset.

4.2 Performance metrics

Herein, we briefly describe several performance metrics that are commonly used in the segmentation of the OC and OD, and apply them in model training and testing. Various assessment metrics, namely, Dice Similarity Coefficient ($\mathcal {D}$), Mean Intersection Over Union ($\mathcal {M}$), Sensitivity (SEN), and Specificity (SPE) are used to evaluate the performance of the proposed method in segmenting the OC and OD relative to the ground truth (expert annotations). These performance metrics are defined as follows:

$$\mathcal{D}=\frac{2 \times T P}{2 \times T P+F P+F N}$$
$$\mathcal{M}=\frac{T P}{F P+F N+T P}$$
$$SPE=\frac{T N}{T N+F P}$$
$$SEN=\frac{T P}{T P+F N}$$
where TP, FP, FN, and TN stand for true positives, false positives, false negatives, and true negatives, respectively, in the evaluation confusion matrix (see Table 1). Furthermore, Cohen’s Kappa Coefficient ($\mathcal {K}$) [32] is used to measure the similarity between non-expert annotations and expert annotations to statistically represent the inconsistency between them, as:
$$\mathcal{K}=\frac{N\times(TP+TN)-(A_1P_1+A_2P_2)}{N^2-(A_1P_1+A_2P_2)}$$

Tables Icon

Table 1. Confusion matrix for evaluation.

4.3 Implementation details

The proposed method is implemented using PyTorch. The segmentation network uses UNet with a depth of 5, where the layers have 32, 64, 128, 256, and 512 channels each. The normalization network uses instance normalization. The hyperparameters of the CNN in the regularization module are the same as those of UNet. We select $epoch = 60$ and $\alpha = 0.8$. The Adam [33] is used as optimizer. The learning rate varies with the loss function: $1\times 10^{-4}$ in the first piece, and $1\times 10^{-7}$ in the second piece. $Batchsize$ is set to 4 during training. We use an Nvidia Tesla v100 32 GB GPU for all experiments. The data to be input into the network were $256\times 256\times 3$ and each channel of the input images is normalized to values between 0 and 1. All experiment results are reported as $Average \pm Standard\ deviation$ of three 3-fold cross-validation.

4.4 Performance of the proposed approach

To evaluate the performance of non-expert annotations, we analyze their similarity to expert annotations and the performance of networks trained with them alone. Specifically, we apply the Cohen’s Kappa Coefficient ($\mathcal {K}$) to calculate the average similarity between non-expert annotations and expert annotations in a random set containing 20 instances, which their similarities fall into either substantial ($\mathcal {K}$: 0.61$\sim$0.80) or almost perfect ($\mathcal {K}$: 0.81$\sim$1.00). As shown in Fig. 5, $\mathcal {K}$ is statistically proportional to the performance of the segmentation network.

 figure: Fig. 5.

Fig. 5. Comparison histogram of annotation similarity and segmentation accuracy between non-experts and an expert. Values are presented as $Average \pm Std$.

Download Full Size | PDF

We consider that non-expert annotators can be effectively screened by similarity evaluation with a small number of expert annotations, reducing the risk of malicious annotators participating in crowdsourcing learning. In this work, the average $\mathcal {K}$ value of a subset of annotations generated by all non-expert annotators after one instruction is greater than 0.81 which indicates that the non-expert decision is almost perfect, so we retain all annotators and their corresponding annotations.

In order for the model to learn from non-experts and make reliable predictions, we optimize the regularization term and segmentation loss. Figure 6 shows the comparison between 1) simply minimizing the trace of the confusion matrix, 2) initializing the confusion matrix to a diagonal matrix, 3) learning a confusion matrix for low-rank representation [15] and 4) the proposed piece-wise loss function. The proposed piece-wise loss function makes the regularization term first increase and then decrease, guaranteeing the diagonal dominance while avoiding negative convergence where the trace tends to zero. In contrast, although low-rank representation also guarantees a diagonal dominance and makes the model’s $\mathcal {D}$ reach a stable value faster, the small number of parameters limits its final performance. Figure 7 and Fig. 8 show the superiority of focal loss in our model and the performance comparison under different hyperparameters, respectively. As shown in Table 2, the proposed method outperforms networks based on annotations from any non-experts alone on $\mathcal {D}$ and $\mathcal {M}$ metrics. Compared to the annotator $M_2$ who is the non-expert most similar to the expert, our method achieves 1.21% performance improvement on the $\mathcal {M}$ metric. It indicates that the proposed method learns knowledge from multiple non-expert annotators, outperforming optimal non-experts. In addition, Table 2 also shows the results of ablation experiments of CM including 1) removing the confusion matrices (w/o CM) and 2) using the global confusion matrix (GCM) with the shape of $L\times L$ for each annotator. The ablation results demonstrate the effectiveness of the pixel-wise CMs used to model non-expert annotator preferences in crowdsourcing segmentation learning. To verify that non-expert crowdsourcing has the potential to replace experts for annotation of large datasets, we conduct additional tests on expert networks based on reduced data volume shown in Table 2. Specifically, instances in the training set are randomly removed until the volume is reduced to 60%, 80% and 90%. The results show that non-expert crowdsourcing using our method can outperform the expert who only annotates 80% of data, and be on a par with the expert who annotates 90% of data.

 figure: Fig. 6.

Fig. 6. Left is the curves of the value of the regularization term during the training process under different regularization strategies. Right is the curves of validation accuracy during training.

Download Full Size | PDF

 figure: Fig. 7.

Fig. 7. Left is the curves of different loss functions, and they are scaled to the same interval by Min-Max normalization. Right is the curves of validation accuracy during training.

Download Full Size | PDF

 figure: Fig. 8.

Fig. 8. Curves of validation accuracy during training of our model for a range of hyperparameters. For our method, the scaling of the trace regularization varies in $[0, 0.3, 0.5, 0.8, 1.2, 1.5]$.

Download Full Size | PDF

Tables Icon

Table 2. Results of proposed method for OC and OD segmentation compared with the expert and non-experts.

4.5 Comparisons with state-of-the-arts

To demonstrate the advantage of the proposed method, we compare with the state-of-the-art (SOTA) methods in our task. We use the publicly released code with default parameters to retrain the SOTA methods, with the same training and test set as that of ours. For a fair comparison, the segmentation framework of all methods utilize the U-Net architecture as the backbone. Table 3 quantitatively compares our framework with Majority voting, which is the most commonly used label fusion method, and four SOTA multi-mask methods, including STAPLE [14], spatial STAPLE [18], DHE [15], and MRNet [23].

Tables Icon

Table 3. Results of proposed method for OC and OD segmentation compared with the SOTA methods.

As shown in Table 3, the proposed model achieves superior $\mathcal {D}$ and $\mathcal {M}$ performance compared with SOTA multi-mask methods. The performance improvement is prominent for OC and OD, with an increase of 0.76% and 0.48% for $\mathcal {M}$ metrics over the current best method. Figure 9 shows several visualization of predictions generated by each Annotator-Net and SOTA methods. The distribution of red and blue dots reflects the error propensity which is influenced by annotation preference, e.g. $M3$-Net tends to under-segment OD and $M4$-Net tends to over-segment OD. The predictions generated by our method are scattered with fewer and more uniform distributed red and blue dots, indicating that our method is less affected by annotators with significant preferences of annotation.

 figure: Fig. 9.

Fig. 9. Examples illustrating the performance of different methods. From left to right columns: (a) input image; (b) expert GT; (c) results by UNet using GT; (d)–(g) results by UNet using a single label ($M_1$, $M_2$, $M_3$, and $M_4$, respectively); (h) DHE; (i) MRNet and (j) proposed method. Red is FP for OD, blue is FN for OD, and the rest are TP or TN. Color viewer recommended.

Download Full Size | PDF

4.6 Additional comparisons

In this section, we present a series of additional comparative results, including the number of parameters (#Parameters) and Floating Point Operations (FLOPs) of the proposed method and SOTA methods, ablation experiments of non-expert annotators, and the performance of the proposed method under different segmentation architectures.

In the comparison of methods, the number of parameters and complexity of the model are also considered factors rather than just performance. Table 4 compares #Parameters, FLOPs and training time on the full training set between our method and the SOTA methods mentioned in the previous section. We see that the number of parameters of regularization network in the proposed method is only 0.1M. More precisely, only 37.4k parameters are required to generate the confusion matrix for each non-expert annotator. Furthermore, the FLOPs and training time of the proposed method are only half of MRNet whose performance is the closest to ours. The small size of the model will facilitate the application of the proposed model.

Tables Icon

Table 4. Comparison of the number of parameters, FLOPs and training time between the proposed method and the SOTA methods.

In real crowdsourcing scenarios, it is difficult to allow all annotators to be involved in model training limited by the reliability of non-expert annotators. In addition to fixing the similarity threshold, we propose to retain non-expert annotators proportionally, including 1) selecting annotators with the highest $\mathcal {K}$ values proportionally and 2) randomly selecting annotators proportionally. In Table 5, ablation studies of annotators are performed, including (i) 75% annotators with top $\mathcal {K}$ value, (ii) random 75% annotators, (iii) 50% annotators with top $\mathcal {K}$ value, (iv) random 50% annotators and (v) all annotators. $\mathcal {D}_{O C}$ and $\mathcal {D}_{OD}$ represent the Dice coefficient of the segmentation of optic cup and optic disc, respectively. We note that the $\mathcal {D}_{OD}$ value improves by 0.36% when removing $M_4$ with the lowest $\mathcal {K}$ value (see Table 5 (i) vs (v)). However, when the annotator $M_2$ with the highest $\mathcal {K}$ value is randomly removed, both $\mathcal {D}_{OC}$ and $\mathcal {D}_{OD}$ decrease substantially (see Table 5 (i) vs (iv)). Therefore, under the premise that a small subset of expert annotations are available, using the similarity measure to screen reliable non-expert annotators provides a possible way to improve the performance of crowdsourcing learning. Compared to a fixed similarity threshold, the strategy of retaining annotators proportionally seems to be more effective.

Tables Icon

Table 5. Results of ablation experiments of annotators for OC and OD segmentation.

The architecture of the segmentation network is critical to model performance. Table 6 compares $\mathcal {D}_{OC}$ and $\mathcal {D}_{OD}$ of our proposed method under different advanced segmentation architecture, including CENet [34], UNet with a ResNet34 [35] backbone, UNet++ [27] and our UNet. As listed in Table 6, UNet++ and ResUNet improve $\mathcal {D}_{OC}$ and $\mathcal {D}_{OD}$ by 0.17% and 0.24% respectively, but are also accompanied by an increase in the number of parameters by 0.6M and 13.2M. It is worth mentioning that we do not limit the choice of segmentation architecture, because the regularization network can be applied to any advanced segmentation architecture. In clinical applications, we recommend a trade-off between the number of parameters and the level of performance.

Tables Icon

Table 6. Results of proposed method for OC and OD segmentation using different segmentation architecture.

5. Conclusion and future work

In this work, we focus on exploring the effectiveness of non-expert crowdsourcing in the segmentation task of OC and OD using MCI images. To this end, we recruit four non-experts for the annotation task of OC and OD segmentation using MCI images to construct a crowdsourcing dataset. To benefit from crowdsourcing, we propose a regularized segmentation model. Specifically, the segmentation and regularization networks are simultaneously optimized by minimizing our proposed piece-wise loss function to trade off non-expert errors regarding annotation preferences and recover reliable predictions. Extensive empirical experiments demonstrate the superior performance of our method on crowdsourcing segmentation task of OC and OD using MCI images. In addition, we found that crowdsourcing learning using our method is comparable to experts who annotates 90% of data. In other words, the advantages of crowdsourcing emerge when non-experts annotate far more images than experts annotate. We believe that the construction efficiency of reliable large medical datasets will be improved with crowdsourcing learning. In the future, we will continue to explore the utility of crowdsourcing in medical image analysis to reduce the pressure on clinicians, improve the efficiency of building large datasets, and increase public participation in health and wellness.

Funding

National Natural Science Foundation of China (61773246, 81871508); Taishan Scholar Project of Shandong Province (TSHW201502038); Natural Science Foundation of Shandong Province (ZR2018ZB0419).

Acknowledgement

We thank Prof. Qingliang Zeng for his discussions and suggestions on this work.

Disclosures

The authors declare no conflicts of interest.

Data availability

The data underlying the results presented in this paper are not publicly available at this time, but may be obtained from the authors upon reasonable request.

References

1. L. He, C. Chen, Z. Yi, X. Wang, J. Liu, and H. Zheng, “Clinical application of multicolor imaging in central serous chorioretinopathy,” Retina 40(4), 743–749 (2020). [CrossRef]  

2. C. Zimmer, D. Kahn, R. Clayton, P. Dugel, and K. Freund, “Innovation in diagnostic retinal imaging: multispectral imaging,” Retina Today 9, 94–99 (2014).

3. J. Lian, Y. Zheng, P. Duan, W. Jiao, B. Zhao, Y. Ren, and D. Shen, “Measuring spectral inconsistency of multispectral images for detection and segmentation of retinal degenerative changes,” Sci. Rep. 7(1), 11288 (2017). [CrossRef]  

4. A. C. Tan, M. Fleckenstein, S. Schmitz-Valckenberg, and F. G. Holz, “Clinical application of multicolor imaging technology,” Ophthalmologica 236(1), 8–18 (2016). [CrossRef]  

5. G. Lim, Y. Cheng, W. Hsu, and M. L. Lee, “Integrated optic disc and cup segmentation with deep learning,” in 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), (IEEE, 2015), pp. 162–169.

6. K. Wazny, “Crowdsourcing ten years in: A review,” J. Global Health 7(2), 020601 (2017). [CrossRef]  

7. G. Paolacci, J. Chandler, and P. G. Ipeirotis, “Running experiments on amazon mechanical turk,” Judgment and Decision making 5, 411–419 (2010).

8. S. Vermicelli, L. Cricelli, and M. Grimaldi, “How can crowdsourcing help tackle the covid-19 pandemic? An explorative overview of innovative collaborative practices,” R&D Manag. 51(2), 183–194 (2021). [CrossRef]  

9. P. G. Ipeirotis, F. Provost, V. S. Sheng, and J. Wang, “Repeated labeling using multiple noisy labelers,” Data Min. Knowl. Discov. 28(2), 402–441 (2014). [CrossRef]  

10. D. Mitry, T. Peto, S. Hayat, P. Blows, J. Morgan, K.-T. Khaw, and P. J. Foster, “Crowdsourcing as a screening tool to detect clinical features of glaucomatous optic neuropathy from digital photography,” PLoS One 10(2), e0117401 (2015). [CrossRef]  

11. L. Maier-Hein, S. Mersmann, D. Kondermann, S. Bodenstedt, A. Sanchez, C. Stock, H. G. Kenngott, M. Eisenmann, and S. Speidel, “Can masses of non-experts train highly accurate image classifiers?” in International Conference on Medical Image Computing and Computer-assisted Intervention, (Springer, 2014), pp. 438–445.

12. P. Mehta, V. Sandfort, D. Gheysens, G.-J. Braeckevelt, J. Berte, and R. M. Summers, “Segmenting the kidney on ct scans via crowdsourcing,” in 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), (IEEE, 2019, pp. 829–832.

13. H. Song, M. Kim, D. Park, Y. Shin, and J.-G. Lee, “Learning from noisy labels with deep neural networks: a survey,” IEEE Transactions on Neural Networks Learn. Syst. (2022).

14. S. K. Warfield, K. H. Zou, and W. M. Wells, “Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation,” IEEE Trans. Med. Imaging 23(7), 903–921 (2004). [CrossRef]  

15. L. Zhang, R. Tanno, M.-C. Xu, C. Jin, J. Jacob, O. Ciccarelli, F. Barkhof, and D. C. Alexander, “Disentangling human error from the ground truth in segmentation of medical images,” arXiv preprint arXiv:2007.15963 (2020).

16. X. Artaechevarria, A. Mu noz-Barrutia, and C. Ortiz-de Solórzano, “Efficient classifier generation and weighted voting for atlas-based segmentation: Two small steps faster and closer to the combination oracle,” Proc. SPIE 6914, 69141W (2008). [CrossRef]  

17. H. Wang, J. W. Suh, S. R. Das, J. B. Pluta, C. Craige, and P. A. Yushkevich, “Multi-atlas segmentation with joint label fusion,” IEEE Trans. Pattern Anal. Mach. Intell. 35(3), 611–623 (2013). [CrossRef]  

18. A. J. Asman and B. A. Landman, “Formulating spatially varying performance in the statistical fusion framework,” IEEE Trans. Med. Imaging 31(6), 1326–1336 (2012). [CrossRef]  

19. A. Akhondi-Asl, L. Hoyte, M. E. Lockhart, and S. K. Warfield, “A logarithmic opinion pool based staple algorithm for the fusion of segmentations with associated reliability weights,” IEEE Trans. Med. Imaging 33(10), 1997–2009 (2014).

20. L. Joskowicz, D. Cohen, N. Caplan, and J. Sosna, “Automatic segmentation variability estimation with segmentation priors,” Med. Image Anal. 50, 54–64 (2018). [CrossRef]  

21. R. Kuga, A. Kanezaki, M. Samejima, Y. Sugano, and Y. Matsushita, “Multi-task learning using multi-modal encoder-decoder networks with shared skip connections,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, (2017), pp. 403–411.

22. S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning with attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 1871–1880.

23. W. Ji, S. Yu, J. Wu, K. Ma, C. Bian, Q. Bi, J. Li, H. Liu, L. Cheng, and Y. Zheng, “Learning calibrated medical image segmentation via multi-rater agreement modeling,” in Proceedings of CVPR, (2021), pp. 12341–12351.

24. B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme: a database and web-based tool for image annotation,” Int. J. Comput. Vis. 77(1-3), 157–173 (2008). [CrossRef]  

25. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-assisted Intervention (Springer, 2015), pp. 234–241.

26. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778.

27. Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support (Springer, 2018), pp. 3–11.

28. O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, and D. Rueckert, “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999 (2018).

29. R. Tanno, A. Saeedi, S. Sankaranarayanan, D. C. Alexander, and N. Silberman, “Learning from noisy labels by regularized estimation of annotator confusion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2019), pp. 11244–11253.

30. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, (2017), pp. 2980–2988.

31. J. Chang, X. Zhang, M. Ye, D. Huang, P. Wang, and C. Yao, “Brain tumor segmentation based on 3d unet with multi-class focal loss,” in 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), (IEEE, 2018), pp. 1–5.

32. M. L. McHugh, “Interrater reliability: the kappa statistic,” Biochem. Med. 22, 276–282 (2012). [CrossRef]  

33. D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).

34. Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, and J. Liu, “Ce-net: context encoder network for 2D medical image segmentation,” IEEE Trans. Med. Imaging 38(10), 2281–2292 (2019). [CrossRef]  

35. K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision (Springer, 2016), pp. 630–645.

Data availability

The data underlying the results presented in this paper are not publicly available at this time, but may be obtained from the authors upon reasonable request.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (9)

Fig. 1.
Fig. 1. (a) An exemplar of a group of MCI images: IR, GR, BR, and fused image in order. (b) Visualization of OC (inner) and OD (outer) annotations of non-experts. The colors of the annotations correspond to the legend.
Fig. 2.
Fig. 2. Schematic diagram of the process from raw data to crowdsourcing data with annotations.
Fig. 3.
Fig. 3. The architecture consists of two parts: (a) a segmentation network parameterized by $\varphi$ and generates a segmentation probability distribution $\hat {\boldsymbol {p}}_{\varphi }$; (b) a regularization module, which is CNNs parameterized by $\gamma ^{(m)}$ and uses the input image to generate four pixel-wise confusion matrices $\widehat {\mathbf {A}}_{\gamma }^{(m)}$. During training, ($\varphi$, $\gamma$) are learned jointly by optimizing the total loss function. In the test, only the segmentation network is used for prediction.
Fig. 4.
Fig. 4. The structure of regularization network.
Fig. 5.
Fig. 5. Comparison histogram of annotation similarity and segmentation accuracy between non-experts and an expert. Values are presented as $Average \pm Std$.
Fig. 6.
Fig. 6. Left is the curves of the value of the regularization term during the training process under different regularization strategies. Right is the curves of validation accuracy during training.
Fig. 7.
Fig. 7. Left is the curves of different loss functions, and they are scaled to the same interval by Min-Max normalization. Right is the curves of validation accuracy during training.
Fig. 8.
Fig. 8. Curves of validation accuracy during training of our model for a range of hyperparameters. For our method, the scaling of the trace regularization varies in $[0, 0.3, 0.5, 0.8, 1.2, 1.5]$.
Fig. 9.
Fig. 9. Examples illustrating the performance of different methods. From left to right columns: (a) input image; (b) expert GT; (c) results by UNet using GT; (d)–(g) results by UNet using a single label ($M_1$, $M_2$, $M_3$, and $M_4$, respectively); (h) DHE; (i) MRNet and (j) proposed method. Red is FP for OD, blue is FN for OD, and the rest are TP or TN. Color viewer recommended.

Tables (6)

Tables Icon

Table 1. Confusion matrix for evaluation.

Tables Icon

Table 2. Results of proposed method for OC and OD segmentation compared with the expert and non-experts.

Tables Icon

Table 3. Results of proposed method for OC and OD segmentation compared with the SOTA methods.

Tables Icon

Table 4. Comparison of the number of parameters, FLOPs and training time between the proposed method and the SOTA methods.

Tables Icon

Table 5. Results of ablation experiments of annotators for OC and OD segmentation.

Tables Icon

Table 6. Results of proposed method for OC and OD segmentation using different segmentation architecture.

Equations (12)

Equations on this page are rendered with MathJax. Learn more.

a ( m ) ( x , w , h ) i j = p ( y w h ( m ) = i y w h = j , x ) ,
p ^ φ , γ ( m ) ( x ) = A ^ γ ( m ) ( x ) p ^ φ ( x )
[ p 1 p 2 p L ] φ , γ = [ a 11 a 12 a 1 L a 21 a 22 a 2 L a L 1 a L 2   a L L ] γ × [ p 1 p 2 p L ] φ
L s e g ( φ , γ ) = m = 1 M c = 1 L y c ( m ) ϵ ( 1 p ^ c ( m ) ) μ l o g ( p ^ c ( m ) )
L r e g ( γ ) = m = 1 M tr ( A ^ γ ( m ) ( x ) )
L total  ( φ , γ ) = L seg  ( φ , γ ) + α L reg  ( γ )
L total  ( φ , γ ) = { L seg  ( φ , γ ) + α L reg  ( γ ) , L reg  ( γ ) λ ( L × M ) L seg  ( φ , γ ) α L reg  ( γ ) , L reg  ( γ ) > λ ( L × M )
D = 2 × T P 2 × T P + F P + F N
M = T P F P + F N + T P
S P E = T N T N + F P
S E N = T P T P + F N
K = N × ( T P + T N ) ( A 1 P 1 + A 2 P 2 ) N 2 ( A 1 P 1 + A 2 P 2 )
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.