Boosting few-shot confocal endomicroscopy image recognition with feature-level MixSiam

Jingjun Zhou; Xiangjiang Dong; Qian Liu; Qian Liu

doi:10.1364/BOE.478832

1. Introduction

Gastrointestinal (GI) cancer is the most common malignant tumor in the world, including stomach, intestinal, and esophageal cancer, have a higher incidence and mortality rate than lung cancer, seriously affecting people’s health and lives [1]. Cancer detection, diagnosis, and early intervention are critical for improved survival rates [2]. Probe-based confocal laser endomicroscopy (pCLE) is a new type of endoscope that employs laser scanning confocal imaging, which has been widely utilized and demonstrated to achieve satisfactory results in various preclinical and clinical studies in different organs and tissue types, including the gastrointestinal [3], bronchi [4], breast [5], and urinary tract [6]. The pCLE consists of a flexible fiber optic bundle and micro-optics to provide microscopic imaging details of in situ and in vivo tissue features in a pseudo-color video sequence, enabling the performance of real-time optical biopsies in a clinical setting [7]. However, as a new imaging technology, the morphology and content of pCLE images are significantly distinct from white light endoscopic images and whole slide images. Besides, it was observed that the diagnostic results of different trained diagnostic professionals differed significantly, which limited the diagnostic capability of pCLE. Therefore, it is essential to develop a computer-aided diagnostic (CAD) system that can recognize pCLE images to assist physicians in making more precise and prompt decisions.

Deep learning has recently made unprecedented advances in various vision tasks, including image classification [8], semantic segmentation [9], and medical image analysis [10]. This success has primarily been attributed to training deep models on the large-scale labeled dataset to learn the general patterns of the data. However, as an emerging technique for early aid diagnosis of gastrointestinal diseases, pCLE lacks perfect large-scale data. Therefore, training supervised deep models for pCLE image recognition becomes a significant challenge. It has inspired research into combined unsupervised with few-shot learning, focused on pre-trained models on limited unlabeled datasets for generalizing quickly to few-shot learning downstream tasks, as shown in Fig. 1.

Fig. 1. The proposed method comprises two stages: obtained pre-trained model by self-supervised learning, conducted linear evaluation task, and then applied the pre-trained embedding model to downstream few-shot learning.

Download Full Size | PDF

As a special case of unsupervised learning, self-supervised learning generates annotations by mining relationships between large-scale unlabeled data (such as ImageNet) to provide high-capacity deep neural networks [11]. Generally, self-supervised learning explores visual representations through designed various surrogate tasks (e.g., rotation [12] and jigsaw [13]). Another self-supervised learning methods adopted contrastive loss, which reduced the distance between different augmented views of the same similarity (positive pairs) and increased the similarity between different instances (negative pairs) to learn image representations that are invariant for data augmentation [14,15]. In addition, there are methods [16,17] that predict another view from only one augmented view representation under the same image, without negative pairs. In this paper, we adopted the pre-trained model obtained from limited unlabeled data through self-supervised learning as the base learner of the few-shot learning pipeline, enabling generalization quickly to other pCLE tasks.

Recent popular approaches treated the few-shot classification problem as the meta-learning task, which divided the dataset into different subsets to learn how to adjust the pre-trained embedding model parameters in response to changes in the data [18]. Moreover, these methods [19,20] are often highly reliant on an effective pre-trained embedding model to achieve consistent performance gains.

In this paper, we proposed Feature-Level MixSiam, a novel image representation self-supervised learning method based on the traditional Siamese network, which learned discriminative representations from limited pCLE gastrointestinal disease images and combined that with metric-based meta-learning to solve the few-shot classification problem for pCLE gastrointestinal disease image. The proposed method contains two stages: self-supervised learning stage and few-shot learning stage. On the one hand, for the stage of self-supervised learning, two augmented views of an instance are fed into the backbone to obtain augmented features. On the other hand, we conducted a feature-level mixture operation on these features to obtain two mixed features, as inspired by [21]. Beyond the Siamese contrastive learning framework, we expect the mixing representation of another view in the same instance to be predicted by the representation of one view. With this approach, more task-relevant information is introduced through regularized approximations to deal with the large intra-class variance of the pCLE dataset and avoid the risk of over-fitting the minimum mutual information between views. On the SSL_pCLE gastrointestinal disease image dataset (Fig. 2), the proposed method improved over baseline by 6% (Top-1) and supervised model by 2% (Top1) based on the linear evaluation, which demonstrates the effectiveness of the proposed feature-level-based feature mixing strategy. In the few-shot learning stage, we adopted the pre-trained embedding model from stage one as a strong foundation for the few-shot learning stage and utilized metric-based meta-learning to fine-tune the model. According to our experiments on the FS_pCLE gastrointestinal disease image dataset (Fig. 3), the method based on pre-trained models performed significantly better than the baseline for few-shot image classification. Thus, it is also further demonstrated that the Feature-Level MixSiam has excellent transferability. The proposed method has shown promise as a quantitative tool for assisting pathologists in the diagnosis of diseases.

Fig. 2. The typical example for each class on the SSL_pCLE dataset.

Download Full Size | PDF

Fig. 3. The typical example for each class on the FS_pCLE dataset. Due to limited space, only some class samples are visualized.

Download Full Size | PDF

The contributions of our work are as follows:

1. We proposed Feature-Level MixSiam, a novel self-supervised learning method for image representation based on the traditional Siamese structure that is able to incorporate more task-relevant information through feature-level mixtures to provide a robust discriminative representation for downstream tasks.
2. We study the pre-trained model obtained by self-supervised learning as the base learner of the few-shot learning pipeline to improve the latter’s adaptability for new classes with few samples, thereby achieving accurate classification of gastrointestinal disease images.
3. On the SSL_pCLE dataset, we demonstrated that Feature-Level MixSiam achieves competitive performance compared to existing state-of-the-art methods. Similarly, we obtained comparable performance on the FS_pCLE dataset for the few-shot classification task. In addition, We also explore the impact of multiple multi-layer perceptions (MLPs) on the classification performance of self-supervised learning.

The rest of the paper is structured as follows: Section 2 briefly reviewed the related work on self-supervised learning, few-shot learning, and pCLE classification. Section 3 presented the proposed method. Section 4 conducted experiments on two gastrointestinal disease datasets for validated the proposed method’s effectiveness and exhibit the potential clinical value. Finally, conclusions are given in Section 5.

2. Related works

2.1 Self-supervised learning

Self-supervised learning (SSL) can be considered a particular subject of unsupervised learning, which learns powerful representations directly from the data itself without the need for a large-scale labeled dataset [11]. One of the most common strategies for SSL learning powerful representations is designed for various pretext tasks, such as image jigsaw [13], patch relative position [22,23], image inpainting [24], image colorization [25,26], and image rotation [12,27]. Contrastive learning has also been used to acquire invariant representations and achieve state-of-the-art performance [14,15,28]. Contrastive learning methods are typically based on instance-wise discriminations aimed at reducing the distance between different views of the same instance and increasing the distance between different instances to achieve like attracts and unlike repels [14,29]. There has been a growing interest in non-contrastive learning methods [16,17,30,31] that are based only on two views generated from the same instance (no negative instances are involved) and some of which have been proposed to prevent model collapse in a variety of ways, including stop-gradient, cross-correlation matrices and whitening, all of them have shown promising results.

Self-supervised learning is also widely used in medical images besides natural images. Before contrastive learning, pretext tasks mainly consisted of jigsaw [32,33] or reconstruction [34,35]. In addition, Xie et al. [36] proposed self-supervised learning based on triplet loss in nuclei images. Based on reconstructing images as a pretext task, Haghighi et al. [37] developed a branch for classifying high-level features as anatomical patterns. Taleb et al. [38] proposed using 3D CPC (Contrastive Predictive Coding) to learn features of 3D medical images for contrastive learning. MoCo [15] was used by Sowrirajan et al. [39] to classify chest x-ray images. Some methods [40,41] incorporated contrastive learning with image reconstruction to enhance the performance of contrastive learning. However, the existing contrastive and non-contrastive methods failed to effectively tackle the problem of small inter-class variance and large intra-class variance in the pCLE dataset. We argue that the aforementioned methods are learned on large datasets and are unsuitable for scenarios involving small amounts of pCLE data. Therefore, our method follows the principles of non-contrastive learning methods and introduced more task-relevant information through performed feature-level feature mixture to avoid over-fitting the minimum mutual information between views.

2.2 Few-shot learning

Few-shot learning, an active subset of machine learning, learning robust embedding representation from base classes and then applied to classes with only a few training samples with supervised information, enabled the classification of new classes of unseen data [42]. There have been many methods recently developed for solving few-shot classification problems, grouped into two major categories: metric-based meta-learning methods and gradient-based meta-learning methods. Metric-based meta-learning methods [43–47] were generally learned for contrast data samples. In the case of few-shot classification problems, they classified test samples based on the similarity between test and trained samples. The gradient-based meta-learning method [48–50] consists of base-learner and meta-learner, where the base-learner is first trained on many samples for obtaining the pre-trained embedding. Then it is quickly adapted to tasks with only a small number of samples and unseen before through gradient updates acquired meta-learner, enabling the model to quickly adapt to new learning tasks with limited training data and without over-fitting [18]. Finn et al. [49] propose MAML, which learns a general initialization model by optimizing second-order gradients that can easily converge when faced with new tasks. In contrast to MAML, REPTILE [51] is a first-order gradient-based meta-learning algorithm conducted for multiple first-order gradients on each task for updated the parameters of the base learner but still performed in high-dimensional parameter space. In order to overcome over-fitting due to the high-dimensional space of the model, LEO [50] learns a low-dimensional embedding through the high-dimensional space of the model and implements gradient descent in the low-dimensional space. Recent studies [18,52] have combined self-supervision with few-shot learning. The key concept is to train an effective pre-trained model through self-supervised learning and quickly adapt it to previously unseen classes. In this paper, the proposed method does not rely on any supervised signal obtained pre-trained embedding model for few-shot learning.

2.3 pCLE classification

The existing work on pCLE image classification aims to improve classification accuracy through learned discriminative representations of the image, which fall into two main categories: unimodal methods [53–58] and multimodal methods [7,59–61]. For unimodal methods, the discriminative features are only learned from pCLE images. Andre et al. [53,54] proposed to utilize SIFT (Scale Invariant Feature Transform) and nearest neighbor methods for pCLE image retrieval. To improve the robustness of SIFT feature matching, SIFT features based on multiscale detection operators are exploited in [56] to calculate the similarity between bag-of-words (BoW) features. Andre et al. [57] introduced attribute information to characterize the structure of the cells to gain more visual semantic information. By learning query-specific schemes, Tafresh et al. [58] extracted RoIs (Region of Interest) and relevant subsequences from videos. With the rapid development of CNN, Gu et al. [55] proposed an end-to-end weakly supervised method that unifies feature learning, global diagnosis, and local detection and achieves higher performance in global diagnosis and local detection.

In general, multimodal methods enhance the learning of discriminative features through fused various forms of data. The pCLE multimodal classification methods [60,61] pair pCLE mosaic (obtained by stitching multiple pCLE images using SIFT and HoG (Histogram of Oriented Gradients), etc.) with histology images and then learn discriminative feature embeddings by supervised manner to achieve accurate classification of pCLE images. However, due to the difficulty in pairing different modal images, Gu et al. [59] proposed utilizing an unsupervised graph learning method to learn the potential relationships between pCLE mosaics and histology images for learned discriminative features. Then, Gu et al. [7] further proposed using Siamese networks to learn latent features and to consider both intra-modal and intermodal consistency to enable the model to learn more robust representations. Although multimodal methods can improve classification accuracy, there are still problems such as inaccurate pairing and difficulties to obtained data from different modalities. In this paper, we proposed to combine self-supervised learning with few-shot learning to solve the pCLE image classification problem. To the best of our knowledge, this may be the first time this method has been applied to the pCLE image classification task.

3. Method

Since there are two disjoint datasets in this work, one of which is a few-sample dataset, we considered using few-shot learning to deal with it. In addition, the few-shot learning method is highly dependent upon an effective pre-trained embedding network. For this reason, we proposed to divide this work into two stages. First, for SSL_pCLE dataset classification, we adopted self-supervised learning. The pre-trained model obtained through self-supervised learning was used as a base class to handle the few-shot classification problem.

3.1 Self-supervised learning stage

As shown in Fig. 4, we proposed the Feature-Level MixSiam method based on SimSiam. SimSiam [17] adopted a Siamese network to metric the similarity between different augmented views of the same positive instance without negative samples. Specifically, two randomly augmented views are fed into the encoder for obtaining the embedding pairs, and then one of the embedding is fed into the predictor to predict the other embeddings.

Fig. 4. The structure of the proposed method. Two augmented views are fed into the backbone. In addition to the original structure, feature mixture and stop gradient are used to obtain embedding pairs. Then, the similarity between the mixed embedding pairs and the embedding pairs obtained by applying the prediction MLP to the original structure is maximized. The final loss is a weighted sum of its similarity to the original structure.

Download Full Size | PDF

Formally, for each training instance $x$, two randomly generated augmented instances $x_1$ and $x_2$ are fed into encoder $f$ to obtain $z_1$ and $z_2$, respectively. Typically, encoder $f$ is a classical classification model (e.g., ResNet18 or ResNet50 [8]) with an MLP. The encoder $f$ weights are shared between the two branches, formulated as follows:

(1)$$\begin{aligned} z_1 &= f(x_1) \\ z_2 &= f(x_2) \end{aligned}$$

Next, one of the embedding is predicted by predictor $h$ to the other embeddings,

(2)$$\begin{aligned} p_1 &= h(z_1) \\ p_2 &= h(z_2) \end{aligned}$$

Then, the distance between the two views is calculated by the negative cosine similarity function,

(3)$$\begin{aligned}\mathcal{D}\left(p_{1}, z_{2}\right)&={-}\frac{p_{1}}{\left\|p_{1}\right\|_{2}} \cdot \frac{z_{2}}{\left\|z_{2}\right\|_{2}} \\ \mathcal{D}\left(p_{2}, z_{1}\right)&={-}\frac{p_{2}}{\left\|p_{2}\right\|_{2}} \cdot \frac{z_{1}}{\left\|z_{1}\right\|_{2}} \end{aligned}$$

where $\|\cdot \|_2$ denoted the $L_2$ norm. Moreover, only one branch performed the gradient update to prevent the model from collapsing. Finally minimizing the distance between two views, the loss function is defined as follows:

(4)$$\mathcal{L}=\frac{1}{2} \mathcal{D}\left(p_{1}, \text{stopgrad}\left(z_{2}\right)\right)+\frac{1}{2} \mathcal{D}\left(p_{2}, \text{stopgrad}\left(z_{1}\right)\right)$$

where $stopgrad(\cdot )$ represented that the gradient update is stopped.

Usually, the pCLE data have small inter-class and large intra-class variance in visual perception shown in Fig. 5. However, since this method considers only the mutual information between two augmented views generated by the same instance, there is a risk of over-fitting the mutual information between views, which may result in a reduced ability to cope with intra-class transformation invariance. To address this issue, we tried to introduce our method.

Fig. 5. Typical examples of pCLE images. These three images are from the SSL_pCLE dataset. Although the first and the third images belong to the same class, they are very different in visual appearance. Moreover, the second and third images belong to different classes but are close in visual appearance.

Download Full Size | PDF

To enrich the latent feature representation, we introduced feature-level for feature mixing and increased the variance of features by regularization to boost the diversity among views, thus can improve the robustness of the model. Unlike Mixup [62] or MixSiam [63], which directly mixes images and labels for training, our method directly mixes the embedding of encoder output. In particular, this way is equivalent introduced more potential hard samples and obtained more discriminative features.

For a given image $x$, the positive embedding $z_1$ and $z_2$ are obtained by the encoder. In addition, we define $z_{1mix}$ and $z_{2mix}$ as the embedding obtained by feature-level feature mixing for $z_1$ and $z_2$, respectively, calculated as:

(5)$$\begin{aligned} {z}_{1mix} &=\lambda_{mix} z_{1}+\left(1-\lambda_{mix}\right) z_{2}\\{z}_{2mix} &=\lambda_{mix} z_{2}+\left(1-\lambda_{mix}\right) z_{1} \end{aligned}$$

where $\lambda _{mix}$ is the mixture weight, and as with Mixup [62], the sum of the weights equals 1. More importantly, the negative cosine similarity between $z_{1mix}$ and $p_2$ must be lower than $z_1$ and $p_2$, namely $z_{1mix} \times p_2 \leqslant z_1 \times p_2$. In addition, we trained SimSiam based on the SSL_pCLE dataset and found that the cosine distance (equivalent to $z_2 \times p_2$) between $z_2$ and $p_2$ (based on the SSL_pCLE training set, the mean cosine distance is 0.95091) is greater than $z_1$ and $p_2$ (the mean cosine distance is 0.90649), namely $z_2 \times p_2 \geqslant z_1 \times p_2$. The similarity is calculated as follows:

(6)$$\begin{aligned}S_{mix}&=z_{1mix} \times p_2=\left(1-\lambda_{\text{mix }}\right) \cdot\left(z_2 p_2-S\right)+S \\ &\leq z_1 \times p_2 = S \end{aligned}$$

Since the negative cosine similarity value $S_{mix}$ and $S \in [-1,1]$, then to ensure that $S_{mix} \leqslant S$, we should let $\lambda _{mix} \geqslant 1$ such that $(1-\lambda _{mix}) \cdot \left (z_2 p_2-S\right ) \leqslant 0$. So we chose $\lambda _{mix}$ to be subjected to the beta distribution and then added 1, $\lambda _{mix} \sim Beta(\alpha )+1$, so that the range of $\lambda _{mix}$ in beta distribution is $(1, 2)$ . For $\alpha$ we have tested it several times and finally set it to 2.

By Eq. (6), we obtained $z_{1mix}$ and $z_{2mix}$ and optimized them for similarity,

(7)$$\begin{aligned}\mathcal{L}_{mix}&=\frac{1}{2} \mathcal{D}\left(p_{1}, \text{stopgrad}\left(z_{2mix}\right)\right) \\ &+\frac{1}{2} \mathcal{D}\left(p_{2}, \text{stopgrad}\left(z_{1mix}\right)\right) \end{aligned}$$

By optimizing the loss function in Eq. (7), the discriminant representation of the model is forced to predict the representation obtained after feature-level mixing. The features obtained using Eq. (5) are more challenging, implicitly introducing harder features. Equation (7) acts as a regularization constraint, forcing the model to introduce more task-relevant information to avoid raising the risk of overfitting and strengthen the robustness of the model.

Finally, the final optimization objective is calculated as follows:

(8)$$\mathcal{L_{\text{total }}= \mathcal{L}}+ \lambda \mathcal{L}_{\text{mix}},$$

where $\lambda \in [0,1]$ is used to constrain the effect of the mixture features on the model.

In summary, for the self-supervised learning stage, based on the siamese network, the model is further enabled to learn more discriminative representations by implicitly introducing more harder features used feature-level feature mixing. This approach is approximated by a strong regularization constraint that allows the model learned robust representation to large intra-class variance in the pCLE dataset. The whole process of the self-supervised learning stage is given in Algorithm 1.

Algorithm 1. Feature-level MixSiam

View Table | View all tables in this article

3.2 Few-shot learning stage

In this stage, we took the pre-trained model obtained in the self-supervised learning stage and fine-tuned it in an episodic way to make it possible to solve the FS_pCLE dataset with few samples for classification. Specifically, the few-shot learning dataset $D =\{(x_1, y_1), (x_2, y_2),\ldots,(x_n, y_n)\}$ contains many classes, each consisting of multiple samples, where $n$ is the number of classes in the dataset. In the training stage, $C$ classes are randomly selected from the dataset, each with $K$ samples ($C \times K$ samples in total) as the support set (S) of the model. Then m samples are selected from the remaining data of these $C$ classes as the query set (Q) of the model to construct a meta-task, i.e., the model is required to learn how to distinguish these $C$ classes from $C \times K$ samples. This task is called the $C-way$ $K-shot$ problem.

In this paper, we trained the proposed model based on the support set, which mapped all samples of the same class to a mean vector in the embedding space as the prototype center of each class [45]. For class $c$, the prototype center can be represented as:

(9)$$P_{c}=\frac{1}{\left|S_{c}\right|} \sum_{\left(x_{i}, y_{i}\right) \in S}f\left(x_{i}\right)$$

where $S_c$ denoted the training sample of class $c$, and $f(\cdot )$ is the model initialized in the first stage.

Then, the classification problem becomes the nearest neighbor in the embedding space. So the distance distribution of query sample $q$ in query set $Q$ over all classes can be calculated as:

(10)$$p(y=c \mid q)=\frac{\exp \left({-}d\left(f(q), P_{c}\right)\right)}{\sum_{c^{\prime}} \exp \left({-}d\left(f(q), P_{c^{\prime}}\right)\right)}$$

where $d(\cdot )$ is denoted as the Euclidean distance function. As shown in Eq. (10), its distribution is based on the softmax computed on the distance between the sample $q$ and the prototype center $P_{c}$, such that the loss function in the few-shot learning stage can be understood as:

(11)$$\mathcal{L}_{\text{few-shot}}=d\left(f(q), P_{c}\right)+\log \sum_{c^{\prime}} d\left(f(q), P_{c^{\prime}}\right)$$

In short, the pre-trained model is first obtained via self-supervised learning and then used as the base learner of the few-shot learning pipeline to learn proper representations from the FS_pCLE dataset quickly.

4. Experimental evaluation

Detailed experiments on the SSL_pCLE dataset and FS_pCLE dataset are conducted for evaluating the methods in both stages separately and compare them with some baseline methods. Finally, a detailed ablation study is performed, and the limitations of our method are presented.

4.1 Datasets

In this paper, we used the SSL_pCLE GI disease image dataset for linear evaluation in the first stage and the FS_pCLE GI disease few-sample image dataset for few-shot classification in the second stage. The two datasets are provided from different hospitals, and the data classes disjoint.

4.1.1 SSL_pCLE dataset

The dataset was collected by our developed preclinical pCLE system from 86 patients. The SSL_pCLE gastrointestinal disease image dataset consists of 2369 images, including intestinal epithelial hyperplasia (405), low-level endomorphism (277), high-level endomorphism (167), atrophic gastritis (568), normal sinus (627), and normal gastric body (325). There is an imbalance in class distribution with a difference of nearly 1:4. The typical example in each class is as shown in Fig. 2. Followed the previous work [7], we randomly selected 70 patients as the training set and 16 patients as the testing set. It should be noted that the images for training/validation are from the same set of patients while the images for testing are from other patients to avoid information leakage. All experiments are repeated with five times to demonstrate replicability.

4.1.2 FS_pCLE dataset

The dataset was collected by our developed preclinical pCLE system from 20 patients. The FS_pCLE image dataset has 14 classes and approximately 40 samples per class–their categories are disjointed from the SSL_pCLE dataset. A typical example of part of this dataset is shown in Fig. 3.

4.2 Self-supervised learning stage

4.2.1 Implementation details

For encoder $f$, we chose ResNet18 [8] as our backbone (removed the last fully connected layer), followed by a projection multi-layer perception head (MLP). The projection head consists of $FC-BN-ReLU-FC-BN-ReLU-FC-BN$ with output and input dimensions of 2048 and 512, respectively. For predictor $h$, we adopted batch normalization (BN) in the hidden layer and set the dimensionality to 512. Meanwhile, neither BN nor ReLU is available in the output layer.

To conduct pre-train on the SSL_pCLE dataset, we adopted ResNet18 [8] as the backbone. Moreover, pre-trained models from raw images (such as ImageNet) are not valid for medical images because their intensity distributions are different, so we train from scratch. We resized the image size to $256 \times 256$ and used the augmentation operations of random cropping ($224 \times 224$, crop scale $[0.2, 1.0]$), random horizontal/vertical flipping with a random probability of 0.5. Since the two pCLE image datasets (Fig. 2 and Fig. 3) are obtained by extracting each frame of the pseudo-color video sequence, we also employ random jittering (jitter scale [brightness = 0.4, contrast = 0.4, saturation= 0.4, hue= 0.1]). Furthermore, we used GaussianBlur (random probability 0.2) to mimic the out-of-focus condition of pCLE images during acquisition. We used SGD as the optimizer, set the base learning rate and weight decay values to 0.02 and 5e-4, respectively, and employed the cosine decay schedule [64] to adjust the learning rate. 128 is the default base batch size. The hyperparameters $\alpha$ and $\lambda$ are set to 2.0 and 0.7, respectively. 800 epochs are trained with a warm-up set to 10 epochs without negative pairs or momentum encoder. We utilized Pytorch [65] on an Nvidia RTX5000 GPU for all experiments. As shown in Fig. 6(a), the proposed method is well converged.

For linear evaluation, we froze all features on the global average pooling layer of ResNet18 [8]. Then, we trained a linear classifier for evaluating the performance of the proposed method, where we computed the classification accuracy using the Precision@N metric. The learning rate and batch size of this linear classifier are 0.01 and 128, respectively, and the learning rate is reduced by half at 60 and 80 for a total of 100 epochs of training. SGD is used as the optimizer of this model, and the momentum and weight decay are set to 0.9 and 0, respectively. In addition, we also performed experiments with two linear layers $(FC-ReLU-BN-FC)$ as a linear classifier with all the same settings as aforementioned.

4.2.2 Evaluation

To demonstrate the effectiveness of the proposed method, we compared it with supervised and unsupervised methods and reported the results of various metrics in Table 1. To make a fair comparison, we retrained the code officially provided by SimSiam [17], AMDIM [66], MixSiam [63], and ResNet18 [8] on the SSL_pCLE dataset with its batch size set to 128. We performed all experiments on an Nvidia RTX5000 GPU, and other settings are the same as in the original paper. All methods are trained from scratch without using pre-trained models on ImageNet. Our method consistently outperforms SimSiam (baseline) both for the standard linear evaluation (denoted as one) and two linear layers evaluation (denoted as two). We observed that the proposed method improves SimSiam by 6% and 5% for one and two evaluation, respectively. Moreover, the proposed method is competitive with the supervised method for both one and two evaluation. Further, it is observed from Table 1 that the performance of MixSiam [63] on the SSL_pCLE dataset is weaker compared with other methods (such as SimSiam [17], AMDIM [66]), which indicates that MixSiam [63] fails to adapt well to pCLE image. This experimental result suggested that using the proposed strategy, more task-related information is introduced by approximation through regularization to increase the diversity among views, the robustness of the model is improved. We only presented the top-1 accuracy in Table 1 because all the above methods are 100% accuracy on the top-5 metric.

Table 1. Comparisons with the state-of-the-art methods under the linear evaluation protocol on FS_pCLE dataset.^a

View Table | View all tables in this article

Followed by [29], used KNN to evaluate the representation of the model, it can be seen that the proposed method has the substantial improvement of 2% over SimSiam, which indicated that the features learned by our method yielded a considerably suitable metric, as shown in Table 1. In addition, we performed statistical analysis with 95% confidence intervals, as shown in Table 1. We obtained competitive performance of our method compared to other methods, implied that our method can produce higher quality representations.

Fig. 6. (a) Self-supervised pre-training loss on SSL_pCLE; (b) Performance curves for different values of $\lambda _{mix}$ on SSL_pCLE, where $\lambda _{mix}$ is used to control the effect of mixing features on the model; (c) Performance curves for different values of $\alpha$ on SSL_pCLE, where $\alpha$ is used to control $\lambda _{mix} \sim Beta(\alpha )+1$.

Download Full Size | PDF

4.2.3 Ablation study

In Eq. (8), $\lambda$ is used to constrain the effect of mixing features on the model. We reported the corresponding performance change of $\lambda$ from 0 to 1 in Fig. 6(b). When $\lambda$ = 0, which corresponds to without Feature-Level feature mixing, the same as SimSiam (learning rate: 0.02), which is suggested that SimSiam suffers from overfitting the minimum mutual information between views. However, the pCLE dataset has a large intra-class variance, so additional features can be appropriately introduced through regularization to constrain the optimization process of the model. We observed that the performance consistently increases as $\lambda$ increases from 0 to 0.7, which demonstrated that appropriate constraints help improve the robustness of the model and more discriminative features is learned. Finally, we have chosen 0.7 as the value of the hyperparameter $\lambda$.

In Equation (5), $\lambda _{mix}$ is the coefficient used to mix two different embeddings, calculated by the beta distribution with parameter $\alpha$, $\lambda _{mix} \sim Beta(\alpha )+1$. We reported the corresponding performance change of $\alpha$ from 0.2 to 2.0 in Fig. 6(c). It can be observed that different $\alpha$ leads to significant distinctions in the model’s performance, and the model provides the most accurate results when $\alpha =2.0$. Finally, we have chosen 2.0 as the value of the hyperparameter $\alpha$.

4.2.4 Limitation

Based on t-SNE [67], Fig. 7 displayed the feature distributions for all classes in the SSL_pCLE test set. Compared with the visualization results of SimSiam, observed that in the same class, the features learned by the proposed method can be clustered into multiple clusters without overlapped feature distributions in each cluster aligned with our primary motivation (reducing large intra-class variance). However, we discovered that the features distribution of inter-class was not adequately separated, i.e., small inter-class variance in the pCLE dataset was not effectively resolved.

Fig. 7. Visualized feature distributions learned on SSL_pCLE using t-SNE

Download Full Size | PDF

4.3 Few-shot learning stage

4.3.1 Implementation

Several works [18,20] pointed out that the performance of few-shot learning is highly dependent on an effective pre-trained model. In this paper, we adopted the pre-trained model (removing all fully-connected (FC) layers) obtained from stage one for few-shot learning. The FS_pCLE dataset was randomly divided into 8 classes for training and 6 classes for testing, with input an image size of 224 $\times$ 224 and an output dimension of 512. It should be noted that data between any two classes will not be collected from the same patient. The optimizer chosen was Adam, and the learning rate was 0.0005. Each class contains 10 query points per episode during training and testing. It is trained for 50 epochs, and the learning rate is reduced by half per 10 epochs.

4.3.2 Evaluation

We conducted two common tasks (5-way 1-shot task and 5-way 5-shot task) on the FS_pCLE dataset to evaluate the proposed method and compare it with $\text {ProtoNet}^{+}$ [45], DN4 [47], and SSLFL [18]. To make a fair comparison, we adopted the official codes provided in $\text {ProtoNet}^{+}$ [45], DN4 [47], and SSLFL [18] for retraining. It can be observed from Table 2 that the proposed method achieved competitive performance compared with state-of-the-art methods, with improvements of 2.8% and 0.8% over $\text {ProtoNet}^{+}$ [45] and SSLFL [18], respectively, for the 5-way 1-shot experiment. For the former, ResNet18 [8] pre-trained on the SSL_pCLE dataset is used as the embedding model, and other procedures following by the proposed method. We observed similar improvement in the 5-way 5-shot experiment that demonstrated our approach’s effectiveness again.

Table 2. Few-shot classification accuracy on FS_pCLE dataset

View Table | View all tables in this article

4.3.3 Ablation study

We investigated the effect of Cosine distance and Euclidean distance on the classification accuracy of the 5-way test. As seen in Fig. 8, Euclidean distance has significantly improved over Cosine distance, which indicates that Euclidean distance is a better choice for calculated prototype centers because the Cosine distance is not with Bregman divergence [45]. On the contrary, the Euclidean distance with Bregman divergence always finds the minimum value in space. That is, the Euclidean distance is convex while the cosine distance is nonconvex, so using the Euclidean distance as a metric function is better than the cosine distance.

Fig. 8. Comparison of the effect of different distance metrics on the classification accuracy of 5-way 1-shot and 5-way 5-shot on the FS_pCLE dataset.

Download Full Size | PDF

5. Conclusion

Learning the discriminative visual features intrinsic to the data is an essential part of self-supervised learning. We investigated the existing siamese-based methods indicated that they are failed to satisfy the intra-class high variance problem arising from the pCLE dataset since they are learned only from a simple randomly augmented view. To this end, we proposed the Feature-Level MixSiam method based on the siamese structure to introduce more information relevant to the task. Consequently, the risk of over-fitting the minimum mutual information is reduced. Further, the discriminating features of the instances are extracted from the corresponding mixed features in order to reduce intra-class variance and improve robustness. As compared to the baseline method, the proposed method is more superior and effective. In the future, we will strive to address the limitations of the proposed method in Section 4.2.4 and conduct additional experiments to demonstrate its generalizability on downstream tasks.

Few-shot learning is highly dependent upon an effective pre-trained model. In this paper, we proposed to employ the model obtained through self-supervised learning for pCLE few-shot image classification. By fine-tuning in an episodic fashion, the resulting model has been evaluated on the pCLE few-shot dataset, demonstrated that superior to state-of-the-art methods. Based on the pCLE image classification system presented in this paper, operators are provided with more clinically relevant information to facilitate the clinical diagnosis of pCLE.

Funding

National Natural Science Foundation of China (81971692).

Acknowledgments

The authors thank the Biopsee (Wuhan) medical technology Co., Ltd. for their pCLE dataset.

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. L. Lin, L. Yan, Y. Liu, F. Yuan, H. Li, and J. Ni, “Incidence and death in 29 cancer groups in 2017 and trend analysis from 1990 to 2017 from the global burden of disease study,” J. Hematol. Oncol. 12(1), 96 (2019). [CrossRef]

2. M. I. Scarano, E. Rippa, F. Altieri, C. S. Di, G. M. Stadio, and P. Arcari, “Role of gastrokine 1 in gastric cancer: a potential diagnostic marker and antitumor drug,” Cutting Edge Therapies for Cancer in the 21st Century p. 253 (2014).

3. M. B. Wallace, A. Meining, M. I. Canto, P. Fockens, S. Miehlke, T. Roesch, C. J. Lightdale, H. Pohl, D. Carr-Locke, and M. Löhr, “The safety of intravenous fluorescein for confocal laser endomicroscopy in the gastrointestinal tract,” Alimentary pharmacology & therapeutics 31(5), 548–552 (2010). [CrossRef]

4. J. Yserbyt, C. Dooms, V. Ninane, M. Decramer, and G. Verleden, “Perspectives using probe-based confocal laser endomicroscopy of the respiratory tract,” Swiss Med Wkly 143, w13764 (2013). [CrossRef]

5. T. P. Chang, D. R. Leff, S. Shousha, D. J. Hadjiminas, R. Ramakrishnan, M. R. Hughes, G.-Z. Yang, and A. Darzi, “Imaging breast cancer morphology using probe-based confocal laser endomicroscopy: towards a real-time intraoperative imaging tool for cavity scanning,” Breast Cancer Res. Treat. 153(2), 299–310 (2015). [CrossRef]

6. K. Wu, J.-J. Liu, W. Adams, G. A. Sonn, K. E. Mach, Y. Pan, A. H. Beck, K. C. Jensen, and J. C. Liao, “Dynamic real-time microscopy of the urinary tract using confocal laser endomicroscopy,” Urology 78(1), 225–231 (2011). [CrossRef]

7. Y. Gu, K. Vyas, M. Shen, J. Yang, and G.-Z. Yang, “Deep graph-based multimodal feature embedding for endomicroscopy image retrieval,” IEEE Trans. Neural Netw. Learning Syst. 32(2), 481–492 (2021). [CrossRef]

8. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE Computer Society, 2016), pp. 770–778.

9. L.-C. Chen, G. Papandreou, I. Kokkinos, K. P. Murphy, and A. L. Yuille, “Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018). [CrossRef]

10. S. Shurrab and R. Duwairi, “Self-supervised learning methods and applications in medical imaging analysis: a survey,” arXiv, arXiv:2109.08685 (2021). [CrossRef]

11. L. Jing and Y. Tian, “Self-supervised visual feature learning with deep neural networks: a survey,” IEEE Trans. Pattern Anal. Mach. Intell. 43(11), 4037–4058 (2021). [CrossRef]

12. S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” arXiv, ArXiv: 1803.07728 (2018). [CrossRef]

13. M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European conference on computer vision, vol. 9910 of Lecture Notes in Computer Science (Springer, 2016), pp. 69–84.

14. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning, vol. 119 of Proceedings of Machine Learning Research (PMLR, 2020), pp. 1597–1607.

15. K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE Computer Society, 2020), pp. 9726–9735.

16. J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko, “Bootstrap your own latent - a new approach to self-supervised learning,” in Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Inc., 2020), pp. 21271–21284.

17. X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), pp. 15745–15753.

18. D. Chen, Y. Chen, Y. Li, F. Mao, Y. He, and H. Xue, “Self-supervised learning for few-shot image classification,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE), 2021), pp. 1745–1749.

19. X. Jiang, M. Havaei, F. Varno, G. Chartrand, N. Chapados, and S. Matwin, “Learning to learn with conditional class dependencies,” in 7th International Conference on Learning Representations, (OpenReview.net, 2019).

20. S. Qiao, C. Liu, W. Shen, and A. L. Yuille, “Few-shot image recognition by predicting parameters from activations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE Computer Society, 2018), pp. 7229–7238.

21. R. Zhu, B. Zhao, J. Liu, Z. Sun, and C. W. Chen, “Improving contrastive learning by visualizing feature transformation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (IEEE Computer Society, 2021), pp. 10306–10315.

22. C. Doersch and A. Zisserman, “Multi-task self-supervised visual learning,” in Proceedings of the IEEE International Conference on Computer Vision, (IEEE Computer Society, 2017), pp. 2070–2079.

23. C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE International Conference on Computer Vision, (IEEE Computer Society, 2015), pp. 1422–1430.

24. D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: feature learning by inpainting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE Computer Society, 2016), pp. 2536–2544.

25. G. Larsson, M. Maire, and G. Shakhnarovich, “Learning representations for automatic colorization,” in European Conference on Computer Vision, (Springer, 2016), pp. 577–593.

26. R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in European Conference on Computer Vision, (Springer, 2016), pp. 649–666.

27. A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox, “Discriminative unsupervised feature learning with convolutional neural networks,” in Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 27 (MIT Press, 2014), pp. 766–774.

28. T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton, “Big self-supervised models are strong semi-supervised learners,” arXiv, arXiv:2006.10029 (2020). [CrossRef]

29. Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE Computer Society, 2018), pp. 3733–3742.

30. J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: self-supervised learning via redundancy reduction,” in Proceedings of the 38th International Conference on Machine Learning, vol. 139 of Proceedings of Machine Learning Research (PMLR, 2021), pp. 12310–12320.

31. A. Ermolov, A. Siarohin, E. Sangineto, and N. Sebe, “Whitening for self-supervised representation learning,” in International Conference on Machine Learning, (PMLR, 2021), pp. 3015–3024.

32. J. Zhu, Y. Li, Y. Hu, K. Ma, S. K. Zhou, and Y. Zheng, “Rubik’s cube+: a self-supervised feature learning framework for 3d medical image analysis,” Med. Image Anal. 64, 101746 (2020). [CrossRef]

33. X. Zhuang, Y. Li, Y. Hu, K. Ma, Y. Yang, and Y. Zheng, “Self-supervised feature learning for 3d medical images by playing a rubik’s cube,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2019), pp. 420–428.

34. L. Chen, P. Bentley, K. Mori, K. Misawa, M. Fujiwara, and D. Rueckert, “Self-supervised learning for medical image analysis using image context restoration,” Med. Image Anal. 58, 101539 (2019). [CrossRef]

35. Z. Zhou, V. Sodha, J. Pang, M. B. Gotway, and J. Liang, “Models genesis,” Med. Image Anal. 67, 101840 (2021). [CrossRef]

36. X. Xie, J. Chen, Y. Li, L. Shen, K. Ma, and Y. Zheng, “Instance-aware self-supervised learning for nuclei segmentation,” in Medical Image Computing and Computer Assisted Intervention, vol. 12265 of Lecture Notes in Computer Science (Springer, (2020), pp. 341–350.

37. F. Haghighi, M. R. H. Taher, Z. Zhou, M. B. Gotway, and J. Liang, “Learning semantics-enriched representation via self-discovery, self-classification, and self-restoration,” in Medical Image Computing and Computer Assisted Intervention, vol. 12261 of Lecture Notes in Computer Science (Springer, (2020), pp. 137–147.

38. A. Taleb, W. Loetzsch, N. Danz, J. Severin, T. Gaertner, B. Bergner, and C. Lippert, “3d self-supervised methods for medical imaging,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, vol. 33 (Curran Associates, Inc., 2020), pp. 18158–18172.

39. H. Sowrirajan, J. Yang, A. Y. Ng, and P. Rajpurkar, “Moco-cxr: moco pretraining improves representation and transferability of chest x-ray models,” in ACM Conference on Health, Inference, and Learning (ACM-CHIL) Workshop 2021, (2021).

40. S. Chakraborty, A. R. Gosthipaty, and S. Paul, “G-simclr: self-supervised contrastive learning with guided projection via pseudo labelling,” in 2020 International Conference on Data Mining Workshops (ICDMW), (IEEE Computer Society, (2020), pp. 912–916.

41. H. Zhou, C. Lu, S. Yang, X. Han, and Y. Yu, “Preservational learning improves self-supervised medical image models by reconstructing diverse contexts,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (IEEE Computer Society, (2021), pp. 3479–3489.

42. W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang, “A closer look at few-shot classification,” in 7th International Conference on Learning Representations, (OpenReview.net, 2019).

43. G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in Proceedings of the 38th International Conference on Machine Learning Workshop, vol. 2 (Lille, 2015).

44. O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, (Curran Associates Inc., (2016), pp. 3637–3645.

45. J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in Proceedings of the 31th International Conference on Neural Information Processing Systems, (Curran Associates Inc., (2017), pp. 4080–4090.

46. F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H Torr, and T. M. Hospedales, “Learning to compare: relation network for few-shot learning,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2018), pp. 1199–1208.

47. W. Li, L. Wang, J. Xu, J. Huo, Y. Gao, and J. Luo, “Revisiting local descriptor based image-to-class measure for few-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 7260–7268.

48. M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas, “Learning to learn by gradient descent by gradient descent,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, (Curran Associates Inc., (2016), pp. 3988–3996.

49. C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International Conference on Machine Learning, (PMLR, (2017), pp. 1126–1135.

50. A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell, “Meta-learning with latent embedding optimization,” in 7th International Conference on Learning Representations, (OpenReview.net, 2019).

51. A. Nichol and J. Schulman, “Reptile: a scalable metalearning algorithm,” arXiv, arXiv:1803.02999 (2018).

52. Y. An, H. Xue, X. Zhao, and L. Zhang, “Conditional self-supervised learning for few-shot classification,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, (International Joint Conferences on Artificial Intelligence Organization, Montreal, Canada, 2021), pp. 2140–2146.

53. B. Andre, T. Vercauteren, A. Perchant, A. M. Buchner, M. B. Wallace, and N. Ayache, “Endomicroscopic image retrieval and classification using invariant visual features,” in 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, (IEEE, 2009), pp. 346–349.

54. B. Andre, T. Vercauteren, A. M. Buchner, M. B. Wallace, and N. Ayache, “Endomicroscopic video retrieval using mosaicing and visualwords,” in 2010 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, (IEEE, 2010), pp. 1419–1422.

55. Y. Gu, K. Vyas, J. Yang, and G.-Z. Yang, “Weakly supervised representation learning for endomicroscopy image analysis,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2018), pp. 326–334.

56. B. Andre, T. Vercauteren, A. M. Buchner, M. B. Wallace, and N. Ayache, “Retrieval evaluation and distance learning from perceived similarity between endomicroscopy videos,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2011), pp. 297–304.

57. B. Andre, T. Vercauteren, A. M. Buchner, M. B. Wallace, and N. Ayache, “Learning semantic and visual similarity for endomicroscopy video retrieval,” IEEE Trans. Med. Imaging 31(6), 1276–1288 (2012). [CrossRef]

58. M. K. Tafresh, N. Linard, B. André, N. Ayache, and T. Vercauteren, “Semi-automated query construction for content-based endomicroscopy video retrieval,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2014), pp. 89–96.

59. Y. Gu, K. Vyas, J. Yang, and G.-Z. Yang, “Unsupervised feature learning for endomicroscopy image retrieval,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2017), pp. 64–71.

60. Y. Gu, J. Yang, and G.-Z. Yang, “Multi-view multi-modal feature embedding for endomicroscopy mosaic classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, (2016), pp. 11–19.

61. Y. Gu, K. Vyas, J. Yang, and G.-Z. Yang, “Transfer recurrent feature learning for endomicroscopy image recognition,” IEEE Trans. Med. Imaging 38(3), 791–801 (2019). [CrossRef]

62. H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: beyond empirical risk minimization,” in 6th International Conference on Learning Representations, (OpenReview.net, 2018).

63. X. Guo, T. Zhao, Y. Lin, and B. Du, “Mixsiam: a mixture-based approach to self-supervised representation learning,” arXiv, arXiv:2111.02679 (2021). [CrossRef]

64. P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: training imagenet in 1 hour,” arXiv, arXiv:1706.02677 (2017). [CrossRef]

65. A. Paszke, S. Gross, F. Massa, et al., “Pytorch: an imperative style, high-performance deep learning library,” Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).

66. P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, (Curran Associates Inc., 2019), pp. 15509–15519.

67. L. van der Maaten and G. Hinton, “Visualizing data using t-sne,” J. Mach. Learning Res. 9, 2579–2605 (2008).

Method	Epoch	Linear evaluation	KNN evaluation	Confident interval
Supervised (one) [8]	100	72.44%	-	0.6832, 0.7658
Supervised (two) [8]	100	74.36%	-	0.7044, 0.7889
AMDIM (one) [66]	1000	67.56%	63.64%	0.6322, 0.7188
AMDIM (two) [66]	1000	70.32%	63.64%	0.6628, 0.7444
SimSiam (one) [17]	800	68.22%	64.68%	0.6392, 0.7253
SimSiam (two) [17]	800	71.11%	64.68%	0.6692, 0.7528
MixSiam (one) [63]	800	62.22%	52%	0.5774, 0.6671
MixSiam (two) [63]	800	62%	52%	0.5752, 0.6648
Ours (one)	800	74.33%	66.52%	0.7011, 0.7852
Ours (two)	800	76.82%	66.52%	0.7299, 0.8078

Methods	5-Way 1-Shot	5-Way 5-Shot
${ProtoNet}^{+}$ [45]	60.75%	72.62%
DN4 [47]	61.59%	71.63%
SSLFL [18]	62.78%	73.64%
Ours	63.62%	74.37%

Method	Epoch	Linear evaluation	KNN evaluation	Confident interval
Supervised (one) [8]	100	72.44%	-	0.6832, 0.7658
Supervised (two) [8]	100	74.36%	-	0.7044, 0.7889
AMDIM (one) [66]	1000	67.56%	63.64%	0.6322, 0.7188
AMDIM (two) [66]	1000	70.32%	63.64%	0.6628, 0.7444
SimSiam (one) [17]	800	68.22%	64.68%	0.6392, 0.7253
SimSiam (two) [17]	800	71.11%	64.68%	0.6692, 0.7528
MixSiam (one) [63]	800	62.22%	52%	0.5774, 0.6671
MixSiam (two) [63]	800	62%	52%	0.5752, 0.6648
Ours (one)	800	74.33%	66.52%	0.7011, 0.7852
Ours (two)	800	76.82%	66.52%	0.7299, 0.8078

Methods	5-Way 1-Shot	5-Way 5-Shot
${ProtoNet}^{+}$ [45]	60.75%	72.62%
DN4 [47]	61.59%	71.63%
SSLFL [18]	62.78%	73.64%
Ours	63.62%	74.37%

Boosting few-shot confocal endomicroscopy image recognition with feature-level MixSiam

Abstract

1. Introduction

2. Related works

2.1 Self-supervised learning

2.2 Few-shot learning

2.3 pCLE classification

3. Method

3.1 Self-supervised learning stage

3.2 Few-shot learning stage

4. Experimental evaluation

4.1 Datasets

4.1.1 SSL_pCLE dataset

4.1.2 FS_pCLE dataset

4.2 Self-supervised learning stage

4.2.1 Implementation details

4.2.2 Evaluation

4.2.3 Ablation study

4.2.4 Limitation

4.3 Few-shot learning stage

4.3.1 Implementation

4.3.2 Evaluation

4.3.3 Ablation study

5. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (8)

Tables (3)

Equations (11)

Biomedical Optics Express