NeuroSeg-III: efficient neuron segmentation in two-photon Ca<sup>2&#x002B;</sup> imaging data using self-supervised learning

Yukun Wu; Zhehao Xu; Shanshan Liang; Lukang Wang; Meng Wang; Hongbo Jia; Hongbo Jia; Xiaowei Chen; Xiaowei Chen; Zhikai Zhao; Zhikai Zhao; Xiang Liao; Xiang Liao

doi:10.1364/BOE.521478

1. Introduction

The utilization of two-photon microscope and Ca²⁺ indicators for imaging neuronal activity is crucial in contemporary neuroscience research [1–6]. The advancement in imaging techniques allows high speed and large field of view recording in vivo, and it has resulted in an influx of experimental data requiring efficient processing [7,8]. A comprehensive pipeline for handling two-photon imaging data has been established, encompassing video denoising, registration, neuron segmentation, and extracting Ca²⁺ signals from the monitored neurons [9,10]. The precise neuron segmentation is a necessary prerequisite for analysis, while neuroscientists normally undertake a laborious task to manually annotate this data. Although manual segmentation remains the gold standard ground truth (GT) marking for determining whether the segmented areas are neurons, it is a solution that is inefficient and labor-intensive. Previously, numerous automated segmentation methods have been developed to address this issue, capable of performing segmentation in real time with promising accuracy.

As deep learning continues to gain traction in the field of neuroscience, neuron segmentation methods can broadly be classified into unsupervised and supervised approaches, depending on their reliance on manual labels. In the category of unsupervised learning methods, some of them obtained the position of neurons by signal filtering in each frame of the video. These methods employ pixel-value thresholding, convolution with specific filters, graph-based clustering, or combinations thereof to emphasize the structural features of neuron boundaries, which facilitates the neuron segmentation [11–15]. Another specific scenario within unsupervised learning involves the utilization of matrix factorization approaches. In the previous studies, principal component/independent component analysis (PCA/ICA) [16], non-negative matrix factorization (NMF) [17], constrained non-negative matrix factorization (CNMF) [18], Suite2p [19], or online analysis of streaming Ca²⁺ imaging data (OnACID) have been developed for neuron segmentation [20]. Despite the unsupervised learning methods mentioned above do not require manual labeling, further improvements are needed in terms of segmentation accuracy and speed.

For the category of supervised learning methods, robust classifiers are trained for neuron segmentation using manually annotated masks. Typically, these methods necessitate a substantial quantity of manual annotations for training, extracting features from the training data, i.e., labelled image (2D) data or video (3D) data, and generalizing to the new data with similar characteristics. Convolutional neural networks (CNNs) and the U-Net architecture are widely employed methods in supervised learning. Specifically, CNNs are typically used for pixel classification [21], while the U-Net architecture is commonly applied for image segmentation tasks (e.g., UNet2Ds and Shallow U-Net Neuron Segmentation(SUNS)) [22–24]. Additionally, techniques such as our previous work utilize the network based on Mask-RCNN to extract neuronal features from 2D two-photon Ca²⁺ imaging data [25,26]. The 2D projected image approach offers the benefit of processing speed but faces the drawback of information loss due to the projected images may not effectively capture the features of neurons: those with slow firing rates and those with fluorescence emissions weaker than the baseline fluorescence or background. In contrast to approaches focused on 2D image processing, spatiotemporal methods that process 3D video offer the potential for improved accuracy in detecting sparsely firing and overlapping neurons, albeit at the cost of increased computational complexity. One notable example is STNeuroNet, an end-to-end model that utilizes a 3D CNN to segment active neurons [27]. Another approach, known as CaImAn, combines both unsupervised and supervised learning paradigms. It employs unsupervised learning to identify active components and then utilizes supervised learning to refine these components [28].

To leverage the high segmentation accuracy of supervised learning methods while minimizing the need for extensive manual labeling, we propose NeuroSeg-III, a novel self-supervised learning approach specifically designed to enhance neuron segmentation. This approach innovatively combines a self-supervised pre-training model with an improved segmentation network, leveraging the transformation invariance and covariance contrast (TiCo) learning method to reduce reliance on annotated data [29]. Our segmentation network incorporates YOLOv8s, FasterNet [30], EMA attention mechanism [31], and BiFPN [32] to boost accuracy while reducing computational load. We enhance input data for bolstering spatiotemporal information by fusing maximum projection and correlation map images [19,26]. NeuroSeg-III stands out for its generalizability across various Ca²⁺ indicators and imaging scales, outperforming existing methods in segmentation speed and accuracy.

2. Materials and methods

2.1 Dataset

2.1.1 Dataset from our laboratory

In this study, we conducted two-photon Ca²⁺ imaging experiments using C57BL/6J mice provided by the Laboratory Animal Center at the Third Military Medical University. The experimental procedures were conducted in accordance with the Third Military Medical University Animal Care and Use Committee’s approved protocols.

During the experiments, we first exposed the auditory cortex region [33,34], followed by the injection of dye (GCaMP6f or Cal-520 AM, or OGB-1 AM) into the same area. Following a 2-hour incubation period, Ca²⁺ imaging was performed and imaging data was recorded with a custom-built two-photon microscope system (LotosScan, Suzhou Institute of Biomedical Engineering and Technology, Suzhou, China) [35,36].

In total, 212 imaging videos (OGB-1: 61 samples; Cal-520: 132 samples; GCaMP6f: 19 samples) are generated in our laboratory. Three skilled annotators individually labelled each neuron, and the resulting labels were compared to generate a final consensus, which served as the ground truth encompassing 30-240 neurons in each imaging plane.

2.1.2 Dataset from Allen Brain Observatory (ABO)

There are two groups of ABO datasets utilized in this work. The first group was used for mixed training with the dataset from our lab to demonstrate the segmentation model's generalization capability, while the second group was employed for comparison with segmentation methods. The ABO-mixed dataset comprises 132 images extracted from 71 two-photon videos. These videos cover various brain regions and layers. Specifically, there are 66 images captured at imaging depths of 175 µm and another 66 images obtained at imaging depths of 275 µm. For the data acquired at imaging depth of 175 µm, there are 17 images recorded from primary visual cortex (VISp), 12 images from posterolateral visual cortex (VISpm), 12 images from lateral visual cortex (VISl), 6 images from rostrolateral visual cortex (VISrl), 6 images from anteromedial visual cortex (VISam) and 13 images from anterolateral visual cortex (VISal). As for the images in the depth of 275 µm, there are 24 pieces from VISp, 9 pieces from VISpm, 15 pieces from VISl, and 18 pieces from VISal. Three skilled annotators individually labelled each neuron, and the resulting labels were compared to generate a final consensus as GT.

The second group included 10 videos acquired at imaging depth of 275 µm and 10 videos acquired at imaging depth of 175 µm using two-photon microscopy (VISp, Experiment ID: 501271265, 501484643, 501574836, 501704220, 501729039, 501836392, 502115959, 502205092, 502608215, 503109347, 504637623, 510214538, 510514474, 510517131,524691284, 527048992, 531006860, 539670003, 540684467, 545446482). Each frame of the 20 ABO videos was cropped from 512 × 512 pixels to 487 × 487 pixels, the black boundary regions were removed. The ground truth used in this dataset was carefully proof read from the work of STNeuroNet [27]. To assess the efficacy of neuron segmentation methods, the aforementioned dataset was utilized in a two-round generalization cross-validation. This rigorous evaluation involved considering different recording depths. Specifically, 10 ABO 275 µm videos and 10 ABO 175 µm videos were separately employed as the training and validation datasets. The ABO 175 µm videos are used for training, and ABO 275 µm videos are used for validation.

All mice in two groups of the ABO dataset expressed the GCaMP6f indicator and each video in ABO dataset contains approximately 100-400 neurons.

2.2 Data preprocessing

In our previous work, we employed the image fusion strategy [26]. However, considering the imaging characteristics of the ABO dataset, we opted for fusing the maximum projection instead of the average projection with a correlation map [37–40]. The final image was produced by linearly weighting the maximum projection and the correlation map in a ratio of 1:1, followed by normalization. Here, we created the correlation map by calculating the weighted multi-dimensional correlation of each pixel with its neighboring pixel, as follows:

(1)$${c_w}({{f_1},{f_2}, \cdots } )= \frac{{\parallel \mathop \sum \nolimits_i {g_i}{f_i}{\parallel ^2}}}{{\mathop \sum \nolimits_i {g_i}\parallel {f_i}{\parallel ^2}}}$$

where ${g_i}$ denotes Gaussian kernel used for filtering. This process can be interpreted as performing a Gaussian filtering operation on ${f_i}$ or ${\parallel} {f_i}{\parallel ^2}$, ${f_i}$ refers the neighboring pixels’ traces, and ${c_w}$ denoting the correlation for different dimensions [19]. The potential neuron locations can be indicated with correlation map values. These new data were used as image data for training and validation with the proposed method. Considering the requirements for training and computational costs, we performed a maximum projection of the raw two-photon videos every 20 frames, equivalently down-sampled the original data to 1/20th of its initial frequency. This dataset consisting of maximum projection is used as the training set for self-supervised learning.

2.3 Framework of NeuroSeg-III

NeuroSeg-III consists of two major parts: the first is a self-supervised learning network (TiCo) that acquires intricate feature representations through the utilization of unlabeled data samples [29], followed by the second stage, which is a segmentation network improved from YOLOv8s and fine-tuned to perform neuron segmentation [30–32]. Figure 1 demonstrates the framework of the proposed approach.

Fig. 1. The framework of NeuroSeg-III. (A) Schematic of the self-supervised learning process based on TiCo. The input of TiCo is a series of maximum projection images generated from each video batch (n = 20 frames). // indicates a stop-gradient to halt backpropagation, • symbolizes a bifurcation, ⊕ represents an addition, △ denotes multiplication by a scalar value, and Δt signifies a temporal delay of one unit. Contractions legend: AAP: adaptive average pooling, Proj: projector, FC: fully connected layer. (B) Schematic of the proposed neuron segmentation method utilizing YOLOv8s architecture. The input of the backbone module is derived from the fusion of maximum projection and correlation map, which complemented the spatiotemporal information. We transfer the pre-trained weights from TiCo to initialize the weights of the backbone module in the training process.

Download Full Size | PDF

2.3.1 Self-supervised learning

The self-supervised learning-based network is presented in Fig. 1(A). This technique simultaneously fine-tunes the objectives of transformation invariance and covariance contrast, thereby efficiently normalizing the covariance matrix within the embeddings. It serves as a dual-purpose approach, functioning as a method for contrastive learning as well as for reducing redundancy. After using a stochastic data augmentation, two identical encoders $\textrm{}{f_\theta }\textrm{}$ and ${f_\xi }$ (the backbone module of the segmentation network) and two projectors ${g_\theta }$ and ${g_\xi }$, which are dampened sharing the parameters ($\theta $ and $\xi $) and weights, generate feature representations $z_i^\mathrm{^{\prime}}$ and $z_i^{\mathrm{^{\prime\prime}}}$, where $z_i^{\mathrm{^{\prime\prime}}} = {g_\xi }({{f_\xi }({x_i^{\mathrm{^{\prime\prime}}}} )} )$, $z_i^\mathrm{^{\prime}} = {g_\theta }({{f_\theta }({x_i^\mathrm{^{\prime}}} )} )$, and $z_i^\mathrm{^{\prime}},z_i^{\mathrm{^{\prime\prime}}} \in {\mathrm{\mathbb{R}}^d}$. The projection network utilizes the feature maps produced by the backbone module of the segmentation network (Fig. 1(B) and Fig. 2). It incorporates adaptive average pooling, ReLU activation, batch normalization (BN), and fully connected layers (FC). The ultimate feature representations are produced through an additional FC layer.

Fig. 2. Elaborate network architecture of NeuroSeg-III. (A) Overview of the improved architecture of YOLOv8s. Block with a light lavender gray background describes the backbone module and Block with a pale orange background demonstrates the neck module, individually. In the module of backbone, we modified the block of C2f by incorporating the FasterNet block and EMA attention mechanism. For the neck module, BiFPN, is adopted and allows efficient multiscale feature fusion, and continue to use the C2f block proposed in YOLOv8s. (B) The structures of the Faster-EMA, Bottleneck, CBS and SPPF blocks used in the segment network, ⊕ denotes tensor concatenation. The Faster-EMA and Bottleneck blocks take input from the split block and their output is concatenated as the input for the CBS block. (C) The structures of the C2f-Faster-EMA block used in backbone module and the C2f block used in neck module. These blocks receive information from feature fusion as input and output it to the segment head. (D) The structure of the segmentation head and the projection in TiCo. (E) Neuron segmentation result by NeuroSeg-III (ABO Expreiment ID: 501704220). CBS: convolution, batch normalization, and SiLU activation function; SPPF: Spatial pyramid pooling-fast; Up: upsampling; Conv: convolution; PConv: partial convolution; BN: batch normalization; MP: max pooling; FC: fully connected layer.

Download Full Size | PDF

In the settings of TiCo, we used a momentum encoder technique, i.e., only the parameter $\theta $ is updated within backpropagation process, meanwhile the parameter $\xi $ is updated as the exponential moving average of the parameter $\theta $:

(2)$${\xi _t} = \alpha {\xi _{t - 1}} + ({1 - \alpha } ){\theta _t}$$

where $\alpha \in [{0,1} ]$ is a hyperparameter and t is time step. During training, the covariance matrix ${C_t}$ is updated at time step $\textrm{}t$ :

(3)$${C_t} = \left\{ \begin{array}{ll} {0},&{t\; = 0}\\ {\beta {C_{t - 1}} + ({1 - \beta } )\frac{1}{n}\mathop \sum \limits_{i = 1}^n z_i^\mathrm{^{\prime}}z_i^{\mathrm{^{\prime}}T}}&otherwise \end{array} \right.$$

where $\beta \in [{0,1} ]$ is a hyperparameter. Hence, the loss function is designed to jointly optimize two objectives:

(4)$${\ell _{\textrm{TiCo}}}({z_1^\mathrm{^{\prime}}, \ldots ,z_n^\mathrm{^{\prime}},z_1^{\mathrm{^{\prime\prime}}}, \ldots ,z_n^{\mathrm{^{\prime\prime}}}} )={-} \frac{1}{{2n}}\mathop \sum \limits_{i = 1}^n \parallel z_i^\mathrm{^{\prime}} - z_i^{\mathrm{^{\prime\prime}}}{\parallel ^2} + \frac{\rho }{n}\mathop \sum \limits_{i = 1}^n z_i^{\mathrm{^{\prime}}T}{C_t}z_i^\mathrm{^{\prime}}$$

The first term aims to minimize the difference between embeddings of various data augmentations of the same image. The second term aims to constrain each vector towards the subspace associated with smaller eigenvalues of the covariance matrix.

2.3.2 Segmentation network

The fast development of the two-photo Ca²⁺ imaging technology and the Ca²⁺ indicators pose challenges to the accuracy and speed of neuron segmentation [5,6]. The newly developed YOLO (You Only Look Once) models offer several advantages including compact model size, rapid processing, and exceptional accuracy [41], and they have been extensive applied in the domain of object detection [42–44]. In this study, we adopted the latest YOLOv8s model with considering the balance between hardware capabilities and segmentation accuracy. The original YOLOv8s backbone architecture consists of Conv, C2f, and SPPF modules, wherein the C2f module plays a vital role in learning residual features.

Within the backbone structure, we introduced a modification by replacing the original C2f module with the enhanced C2f-Faster-EMA module. This hybrid module combines the functionality of the C2f module with FasterNet and integrates the EMA attention mechanism (Fig. 2(A)). Furthermore, we opted not to replace the C2f module in the neck module. Instead, we enhanced the neck module by incorporating a BiFPN (Fig. 2(A-C)) to replace the original PANet structure that enables effective cross-scale connections and weighted feature fusion. It is worth noting that there were no modifications made to the segment head module (Fig. 2(D)). These improvements of the segment network simultaneously reduce the parameter count and enhance the model’s segmentation performance through the incorporation of an attention mechanism. We employed the BiFPN within the neck module of YOLOv8s to enhance the depth of information mining and further improve the multi-scale neuronal feature extraction capability of the model, meanwhile reducing the model's parameters. While integrating the attention mechanism increased the computational complexity of the model, it demonstrated its utmost effectiveness during image feature extraction. Each Faster-EMA block consists of a PConv layer [30], followed by two Conv layers and the EMA (Fig. 2(B)). As for the input of the network, we utilize the fused image obtained from the maximum projection and correlation map, and the segmentation result of the network is illustrated in Fig. 2(E).

We visualized how the attention module (Fig. 3(A)) highlights specific features to demonstrate the necessity of the EMA module. Figure 3(B) demonstrates that when using different Ca²⁺ indicators and varying quantities of neurons (n = 7, 19, 34, 70) in imaging data, the inclusion of EMA mechanism in the model leaded to a significant increase in the deep red area on the map, compared to the model that lacked the attention mechanism and thus was not able to focus on neurons. This indicates that the network, when integrated with EMA, effectively learned to make use of information from neuron regions and consolidate features from these regions.

Fig. 3. Attention mechanism used in backbone module. (A) Diagram of EMA. The terms ‘X Avg Pool’ and ‘Y Avg Pool’ are the calculations for 1D horizontal and vertical global pooling, respectively. ⊕ represents an addition module. (B) Visualization of attention mechanisms across different imaging scales and Ca²⁺ indicators. It noted that the EMA module identifies more neurons with increased details.

Download Full Size | PDF

2.4 Data augmentation

We employed data augmentation strategies before feeding the data into TiCo and during the training of the segmentation network. Here we used the same augmentation strategy used in TiCo. Each input image after the maximum projection is transformed twice to produce the two distorted views shown in Fig. 1(A). The augmentation pipeline consists of random cropping (1.0, probability), resizing to 224 × 224 (1.0, probability), randomly flipping the images horizontally (0.5, probability), adding Gaussian blurring (0.5, probability), and applying solarization (1.0, probability). During the training procession of the improved YOLOv8s model, the image augmentation pipeline consists of the following transformations: mosaic data augmentation, random blurring, median blurring, randomly changing brightness and contrast, applying contrast limited adaptive histogram equalization and image compression. The first transformation (mosaic data augmentation) is always applied, while the other six are applied randomly, with a probability of 0.05.

2.5 Model training strategies

2.5.1 Model training with TiCo

To perform training of our proposed model, we opted for the SGD optimizer and used a fixed learning rate of (0.2 × batch size/256). Due to limitation in GPU memory, we selected a batch size of 128. Pre-training was conducted using the maximum projection dataset, which was extracted from 20 ABO videos (10 videos in 175 µm and another 10 videos in 275 µm) used for comparing the neuron segmentation methods. This dataset was sampled at a rate of 1/20 and did not take annotations into account. In total, we trained for 1000 epochs, selecting the model with the lowest loss as our pre-trained model to be transferred to the segmentation network.

2.5.2 Model training with mixed dataset

To evaluate the performance of model modification compared to the original YOLOv8s, we utilized a dataset containing 344 images (212 from our lab and 132 from the ABO). These images were divided into three distinct sub-datasets (training set: 270 images; validation set: 37 images; testing set: 37 images). The segmentation network underwent training of 200 epochs (batch size: 8). The training phase incorporated an early-stopping strategy to improve the model's generalization ability. During the final 10 epochs, data augmentation using mosaic was disabled to further enhance the model's performance [45].

2.5.3 Model training with the ABO dataset

To evaluate neuron segmentation performance, all methods underwent two-round generalization cross-validation using 20 ABO videos. During the evaluation process, which involved methods including CITE-On [46], SUNS [24], STNeuroNet [27], and the classifier of CaImAn [28], or ROI classifiers in Suite2p [19], the models were trained and tested using the video data. All these methods were optimized according to their literatures. For evaluating the training and validation of NeuroSeg-III, we employed image data. The best result of NeuroSeg-III (split = 8) is compared with other neuron segmentation methods.

The training of NeuroSeg-III was conducted as follows: First, we employed the training strategy to train the encoder of the segmentation network with the self-supervised learning network (TiCo). Subsequently, we initialized the encoder of the segmentation network using a transfer learning strategy and then proceeded to fine-tune all the weights of the segmentation network. During the process of fine-tuning the segmentation network weights, we also implemented an early-stopping strategy. During the final 10 epochs, we disabled data augmentation.

2.6 Evaluation metrics

Three metrics (precision, recall, and F1-score) were employed [27] for the evaluation of the segmentation performance:

(5)$$\textrm{Precision} = \textrm{}\frac{{{N_{\textrm{TP}}}}}{{{N_{\textrm{GT}}}}}$$

(6)$$\textrm{Recall} = \frac{{{N_{\textrm{TP}}}}}{{{N_{\textrm{detected}}}}}$$

(7)$$\textrm{F1} = \frac{{2 \times \textrm{Recall} \times \textrm{Precision}}}{{\textrm{Recall} + \textrm{Precision}}}$$

where ${N_{GT}}$ is the number of GT neurons, ${N_{TP}}$ is the number of true-positive neurons, and ${N_{detected}}$ is the number of detected neurons. The degree of overlap between the detected neuron and the GT masks is quantified using the Intersection-over-Union (IoU) metric, which is measured as:

(8)$$\textrm{IoU}({{m_1}\textrm{,}{\textrm{m}_2}} )= \frac{{|{{m_1} \cap {m_2}} |}}{{|{{m_1} \cup {m_2}} |}}$$

where ${m_1}$ and ${m_2}$ are two binary masks. The distance (Dist) between masks is calculated as:

(9)$$\textrm{Dist}({m_i^{\textrm{GT}},{M_j}} )= \left\{ {\begin{array}{{c}} {1 - \textrm{IoU}({m_i^{\textrm{GT}},{M_j}} ),\quad \quad \textrm{IoU}({m_i^{\textrm{GT}},{M_j}} )\ge 0.5}\\ {0,\quad \quad m_i^{\mathrm{GT}} \subseteq M_j \text { or } M_j \subseteq m_i^{\mathrm{GT}}}\\ {\infty ,\quad \quad \quad \quad \quad \quad \quad \quad \textrm{otherwise}} \end{array}} \right.$$

where ${M_j}$ and $m_i^{\textrm{GT}}$ are the masks for the detected and GT neurons, respectively. Subsequently, the Hungarian algorithm was utilized for the matching calculations.

3. Results

The data preprocessing, model training and testing were performed on a device consisting of Intel Xeon Gold 6258R CPU, NVIDIA RTX A6000 GPU, 640 GB RAM.

3.1 C2f-Faster-EMA and BiFPN improved segmentation accuracy meanwhile reducing the model's parameters

To demonstrate the efficacy of the model improvement, we initially carried out ablation experiments where we selectively removed the FasterNet, EMA, and BiFPN components from the enhanced segmentation network. The results of these ablation experiments are illustrated in Fig. 4(A), which indicate that our enhanced model (0.9394 ± 0.0288) exhibits a significant increase in F1-score compared to original YOLOv8s (0.9254 ± 0.035, P < 0.0001). Notably, the EMA module has the most pronounced impact on model accuracy. While the FasterNet and BiFPN modules also contribute to a slight improvement in model accuracy, their primary contribution lies in reducing the model's computational cost (GFLOPS) and parameters (Fig. 4(B)). This reduction facilitates faster training and shorter inference times, making it easier to generalize to new data and deploy the model on mobile computing devices, especially when hardware resources are limited. Although the EMA structure does introduce a slight increase in computational costs and parameters, this increment is negligible compared to its enhancement in segmentation accuracy.

Fig. 4. The improvement over the YOLOv8s and the pretrained weights obtained via self-supervised learning achieve higher segmentation accuracy while reducing the parameters of models. (A) Ablation experiment for evaluating the network components including FasterNet, EMA, and BiFPN modules (*P < 0.05; **P < 0.005; ****P < 0.0001; n = 37 images; two-sided Wilcoxon signed-rank test; error bars are SD; n.s., not significant). (B) The segmentation module of NeuroSeg-III significantly reduced GFLOPS and the amount of the model's parameters. (C) Pretrained weight transferring via self-supervised learning enhances the segmentation ability of our proposed model, especially with a few labelled training data. We used 10 ABO 275 µm videos and 10 ABO 175 µm videos separately as the training and validation dataset. The horizontal axis represents the number of partitions applied to the original ABO video.

Download Full Size | PDF

3.2 Self-supervised module improved the model's segmentation accuracy while reducing the reliance on GT

Existing neuron segmentation methods based on supervised learning often require a large amount of labelled data to achieve good performance. So, how can we achieve comparable performance with a small amount of labelled data? To tackle this challenge, we investigated the potential of integrating self-supervised learning techniques with segmentation network. We divided each of the 20 ABO videos into 1 to 10 segments. For each segment of a video, we generated a merged image by combining features from max projection and correlation maps. Each training set was composed of merged images created from the segmented video segments and 10 merged images generated from complete videos. Additionally, each validation dataset consisted of 10 merged images generated from complete videos. We meticulously verified the ground truth for each image. In total, we conducted 10 groups of two-round generalization cross-validation experiments to assess the impact of self-supervised pre-training on the segmentation model's performance.

From Fig. 4(C), it is evident that the model trained with pre-trained weights consistently outperforms the model trained without them. The improvement in model performance is more significant when the number of labelled data is smaller, particularly noticeable when split = 1, where loading self-supervised pre-trained weights has the most significant impact. When split = 5, the segmentation network with TiCo pre-training (0.8787 ± 0.226) achieved a similar F1-score to that of split = 10 without loading pre-trained weights (0.8781 ± 0.223), nearly halving the amount of training data required. This demonstrates that the framework of NeuroSeg-III, which combines the self-supervised module with the segmentation network, can achieve high accuracy with fewer training samples, significantly reducing the labor for labeling data.

3.3 Segmentation network achieved precise and generalized neuron segmentation

By training the segmentation network of NeuroSeg-III with a mixed dataset, we evaluated its performance in generalized neuron segmentation tasks. The dataset encompassed images derived from various Ca²⁺ indicators, imaging scales and depths, brain regions, and neuron activity patterns. This enabled the proposed model to undergo extensive learning and acquire generalized neuron segmentation capabilities.

After completing the hybrid training, we evaluated the model’s performance on two-photon imaging datasets, which included three different Ca²⁺ indicators and imaging scales [47–49]. The results, visualized in Fig. 5, demonstrated the segmentation network’s exceptional performance across different Ca²⁺ indicators and imaging scales. Furthermore, we delved into the activities of the segmented neurons, as illustrated in Figs. 5(A-C). Our model effectively identified active neurons, indicating that the spatiotemporal information was integrated successfully by image fusion preprocessing, thereby contributing to good segmentation performance. Consequently, the learned segmentation network of NeuroSeg-III amalgamates diverse neuron characteristics from imaging data, laying the groundwork for generalized neuron segmentation task.

Fig. 5. Example results of neuron segmentation with NeuroSeg-III. (A–C) Representative examples of neuron segmentation in imaging data for (A) OGB-1 (F1 = 0.9552; data from our lab), (B) Cal-520 (F1 = 0.9870; data from our lab) and (C) GCaMP6f (F1 = 0.896; data from ABO dataset, Experiment ID: 527048992). The left side shows the segmented neurons, and the right side shows the neuronal activities. GT neurons are outlined in yellow, while those detected by NeuroSeg-III are outlined in orange. Scale bar: OGB-1, 20 µm; Cal-520, 20 µm; GCaMP6f, 50 µm.

Download Full Size | PDF

3.4 Comparing the accuracy and speed of NeuroSeg-III with other neuron segmentation methods reveals notable enhancement

To compare neuron segmentation with other existing techniques, including CITE-On [46], SUNS [24], STNeuroNet [27], CaImAn [28], and Suite2p [19], we utilized the ABO dataset to perform a two-round cross-validation. The representative image (Fig. 6(A)), showing the segmented neurons alongside the GT, serves as concrete evidence of our network’s impressive performance on this dataset. Figure 6(B) provides a clearer view, showing that only our method accomplished precise neuron segmentation without any omissions or misidentifications. The performance of each method was evaluated in terms of precision, recall, F1-score, and inference speed, as demonstrated in Figs. 6(C, D).

Fig. 6. NeuroSeg-III exhibits superior performance in terms of both accuracy and speed compared to other methods. (A) Representative examples (ABO Experiment ID: 527048992) demonstrating the results of NeuroSeg-III (F1 = 0.9044), CITE-On (F1 = 0.885), SUNS (F1 = 0.881), STNeuroNet (F1 = 0.8521), CaImAn (F1 = 0.8035) and Suite2p (F1 = 0.7616). The GT neurons (yellow) and segmented neurons (other colors) are overlaid on the two-photon images. Scale bar, 50 µm. (B) Zoomed-in images of the green-boxed regions depict examples of segmented neurons. Scale bar, 20 µm. (C) Comparison of NeuroSeg-III with other methods in terms of the segmentation metrics. The score for each image is represented as gray dot. (D) Comparison of NeuroSeg-III with other methods in terms of processing speed. **P < 0.005; ***P < 0.001; ****P < 0.0001; n = 20 images or videos; two-sided Wilcoxon signed-rank test; n.s., not significant; error bars are SD.

Download Full Size | PDF

Figure 6(C) shows that our method (0.8836 ± 0.0231) has surpassed the current state-of-the-art segmentation method, SUNS (0.8564 ± 0.0253; P = 0.0002), in terms of the F1-score. It also outperformed all other algorithms in F1-score (P < 0.0001). NeuroSeg-III’s precision (0.9073 ± 0.0297) was higher than SUNS (0.8764 ± 0. 051, P = 0.0081) and significantly (P < 0.0001) higher than those of CITE-On (0.8309 ± 0.0419), STNeuroNet (0.7183 ± 0.0624), CaImAn (0.7693 ± 0.0475), and Suite2p (0.7626 ± 0.0872). NeuroSeg-III’s recall rate (0.8621 ± 0.0313) is comparable, albeit statistically insignificantly, to some of the methods (CITE-On: 0.8632 ± 0.03; SUNS: 0.838 ± 0.05; STNeuroNet: 0.8670 ± 0.0278; P > 0.5). However, it is still significantly (P < 0.0001) higher than Suite2p (0.7673 ± 0.0746) and CaImAn (0.625 ± 0.0665).

In addition to higher accuracy, the inference speed of NeuroSeg-III (Fig. 6(D)) also surpassed that of SUNS, albeit statistically insignificantly (P = 0.0501), which processed videos at about 2 ms/frame and was significantly (P < 0.0001) faster than the other methods.

4. Discussion

Self-supervised learning methods enable the learning of data features without the need for labelled data, representing a future trend in deep learning. We chose TiCo as our self-supervised learning network, which is based on a combination of contrastive learning and redundancy reduction. To enhance the information contained in each training image and reduce the training time, we performed maximum projection every 20 frames, which is equivalent to subsampling the original video. Additionally, we made slight adjustments to the training parameters specifically tailored to the features of imaging data. The evaluation results (Fig. 4(C)) indicate that our trained self-supervised model could efficiently capture the data features of the ABO dataset. It improved the segmentation performance of the model, regardless of the amount of training data available for the segmentation network. This improvement is particularly pronounced when working with smaller subsets of training images. Moreover, we achieved the best results using only half of the ground truth data for training, which outperformed the performance achieved by using only the segmentation network.

The proposed segmentation network was built upon YOLOv8s and underwent the following modifications: it incorporates the backbone module with FasterNet and the EMA attention mechanism while utilizing BiFPN connections in the neck module. Before training the network of the segmentation network, we applied preprocessing operations similar to our previous work [26]. However, in this case, we tailored the image fusion to the imaging characteristics of the ABO dataset by combining maximum projection instead of average projection with correlation maps. This operation allows each 2D input image to contain more spatiotemporal information. We performed ablation experiments to individually demonstrate the effectiveness for each of our improvement modules. Combining FasterNet and BiFPN can effectively reduce the model's GFLOPS and parameters (Fig. 4(B)), and enhancing computational efficiency. EMA, a novel and efficient multi-scale attention mechanism, introduces a restructuring approach wherein a portion of the channels is reorganized into the batch dimension, while the channel dimension is divided into multiple sub-features. This restructuring ensures that spatial semantic features are evenly distributed within each feature group, effectively retaining channel-level information while minimizing computational overhead. As a result, EMA not only significantly enhances the segmentation performance of the model but also requires only a negligible increase in computational cost. In the visualized heatmap (Fig. 3(B)), it is evident that EMA can focus more on the neuron regions across various neuron numbers and imaging scales in two-photon Ca²⁺ imaging data. Combining the results from the ablation experiments (Fig. 3(A)), it is evident that FasterNet, EMA, and BiFPN, when integrated with YOLOv8s, have a synergistic effect, further enhancing the model's segmentation capability.

5. Conclusion

In this study, we propose an efficient automated neuron segmentation approach named NeuroSeg-III. The approach based on self-supervised learning consists of two modules: a self-supervised network and an improved YOLOv8s segmentation network. We used the self-supervised network to pre-train the encoder network by using unsupervised feature representations of the data. Subsequently, we fine-tuned the segmentation network for a downstream neuron segmentation task with a limited number of labelled data samples. Our approach is a generalized model that achieves promissing segmentation effects in various brain regions, imaging depths, calcium indicators, and across different scales of two-photon imaging data. Furthermore, it doesn't require complex parameter tuning. When testing the segmentation performance of NeuroSeg-III on a public dataset (ABO), our approach outperformed the state-of-the-art methods (CITE-On, Suite2p, CaImAn, STNeuroNet, and SUNS), achieving the highest performance in terms of precision and F1-score. Additionally, NeuroSeg-III not only had the highest segment accuracy, but also was much faster than other methods.

In the future, we plan to develop an end-to-end neuron segmentation approach based on self-supervised learning. With such an approach, there would be no need for weight transfer; instead, the model could be trained directly on raw data to extract neuron features and perform neuron segmentation.

Funding

National Natural Science Foundation of China (31925018, 32127801, 32171096); Guangxi Science and Technology Base & Talents Fund (GUIKE AD22035948).

Acknowledgments

The authors are grateful to Ms. Jia Lou for technical assistance. X.C. is a member of the CAS Center for Excellence in Brain Science and Intelligence Technology.

Disclosures

The authors declare no conflicts of interest.

Data availability

The imaging data supporting the results of this article are available from the corresponding authors upon reasonable request. All the code of NeuroSeg-III together with the pre-trained self-supervised model and the segmentation model are freely available at [50]. The public Allen Brain Observatory (ABO) dataset used in this work can be found at [51].

References

1. F. Helmchen and W. Denk, “Deep tissue two-photon microscopy,” Nat. Methods 2(12), 932–940 (2005). [CrossRef]

2. B. F. Grewe, D. Langer, H. Kasper, et al., “High-speed in vivo calcium imaging reveals neuronal network activity with near-millisecond precision,” Nat. Methods 7(5), 399–405 (2010). [CrossRef]

3. T. W. Chen, T. J. Wardill, Y. Sun, et al., “Ultrasensitive fluorescent proteins for imaging neuronal activity,” Nature 499(7458), 295–300 (2013). [CrossRef]

4. H. Dana, Y. Sun, B. Mohar, et al., “High-performance calcium sensors for imaging activity in neuronal populations and microcompartments,” Nat. Methods 16(7), 649–657 (2019). [CrossRef]

5. C. Stringer, M. Pachitariu, N. Steinmetz, et al., “Spontaneous behaviors drive multidimensional, brainwide activity,” Science 364(6437), 255 (2019). [CrossRef]

6. Y. Zhang, M. Rózsa, Y. Liang, et al., “Fast and sensitive GCaMP calcium indicators for imaging neural populations,” Nature 615(7954), 884–891 (2023). [CrossRef]

7. N. J. Sofroniew, D. Flickinger, J. King, et al., “A large field of view two-photon mesoscope with subcellular resolution for in vivo imaging,” eLife 5, e14472 (2016). [CrossRef]

8. T. H. Kim and M. J. Schnitzer, “Fluorescence imaging of large-scale neural ensemble dynamics,” Cell 185(1), 9–41 (2022). [CrossRef]

9. E. A. Pnevmatikakis, “Analysis pipelines for calcium imaging data,” Curr. Opin. Neurobiol. 55, 15–21 (2019). [CrossRef]

10. C. Stringer and M. Pachitariu, “Computational processing of neural recordings from calcium imaging data,” Curr. Opin. Neurobiol. 55, 22–31 (2019). [CrossRef]

11. T. Liu, G. Li, J. Nie, et al., “An automated method for cell detection in zebrafish,” Neuroinformatics 6(1), 5–21 (2008). [CrossRef]

12. J. Tomek, O. Novak, and J. Syka, “Two-photon processor and SeNeCA: a freely available software package to process data from two-photon calcium imaging at speeds down to several milliseconds per frame,” J. Neurophysiol. 110(1), 243–256 (2013). [CrossRef]

13. P. Kaifosh, J. D. Zaremba, N. B. Danielson, et al., “SIMA: Python software for analysis of dynamic fluorescence imaging data,” Front. Neuroinform. 8, 80 (2014). [CrossRef]

14. A. I. Mohammed, H. J. Gritton, H. A. Tseng, et al., “An integrative approach for analyzing hundreds of neurons in task performing mice using wide-field calcium imaging,” Sci. Rep. 6(1), 20986 (2016). [CrossRef]

15. J. Guan, J. Li, S. Liang, et al., “NeuroSeg: automated cell detection and segmentation for in vivo two-photon Ca²⁺ imaging data,” Brain. Struct. Funct. 223(1), 519–533 (2018). [CrossRef]

16. E. A. Mukamel, A. Nimmerjahn, and M. J. Schnitzer, “Automated analysis of cellular signals from large-scale calcium imaging data,” Neuron 63(6), 747–760 (2009). [CrossRef]

17. R. Maruyama, K. Maeda, H. Moroda, et al., “Detecting cells using non-negative matrix factorization on calcium imaging data,” Neural Netw. 55, 11–19 (2014). [CrossRef]

18. E. A. Pnevmatikakis, D. Soudry, Y. Gao, et al., “Simultaneous Denoising, Deconvolution, and Demixing of Calcium Imaging Data,” Neuron 89(2), 285–299 (2016). [CrossRef]

19. M. Pachitariu, C. Stringer, M. Dipoppa, et al., “Suite2p: beyond 10,000 neurons with standard two-photon microscopy,” bioRxiv, bioRxiv:061507 (2017).

20. A. Gioyannucci, J. Friedrich, M. Kaufman, et al., “OnACID: online analysis of calcium imaging data in real time,” in 31st Annual Conference on Neural Information Processing Systems (pp.1–11) (2017).

21. B. Kayalibay, G. Jensen, and P. Smagt, “CNN-based segmentation of medical imaging data,” arXiv, arXiv:1701.03056 (2017).

22. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 (pp. 234–241) (2015).

23. A. Klibisz, D. Rose, M. Eicholtz, et al., “Fast, simple calcium imaging segmentation with fully convolutional networks,” in International Workshop on Deep Learning in Medical Image Analysis (pp. 285–293) (2017).

24. Y. Bao, S. Soltanian-Zadeh, S. Farsiu, et al., “Segmentation of Neurons from Fluorescence Calcium Recordings Beyond Real-time,” Nat. Mach. Intell. 3(7), 590–600 (2021). [CrossRef]

25. K. He, G. Gkioxari, P. Dollár, et al., “Mask R-CNN,” in Proceedings of the IEEE International Conference on Computer Vision (pp. 2980–2988) (2017).

26. Z. Xu, Y. Wu, J. Guan, et al., “NeuroSeg-II: A deep learning approach for generalized neuron segmentation in two-photon Ca²⁺ imaging,” Front. Cell. Neurosci. 17, 1127847 (2023). [CrossRef]

27. S. Soltanian-Zadeh, K. Sahingur, S. Blau, et al., “Fast and robust active neuron segmentation in two-photon calcium imaging using spatiotemporal deep learning,” Proc. Natl. Acad. Sci. U.S.A. 116(17), 8554–8563 (2019). [CrossRef]

28. A. Giovannucci, J. Friedrich, P. Gunn, et al., “CaImAn an open source tool for scalable calcium imaging data analysis,” Elife 8, e38173 (2019). [CrossRef]

29. J. Zhu, R. M. Moraes, S. Karakulak, et al., “TiCo: Transformation Invariance and Covariance Contrast for Self-Supervised Visual Representation Learning,” arXiv, arXiv: 2206.10698 (2022).

30. J. Chen, S. H. Kao, H. He, et al., “Run, don't walk: chasing higher flops for faster neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12021–12031) (2023).

31. D. Ouyang, S. He, G. Zhang, et al., “Efficient multi-scale attention module with cross-spatial learning,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 1–5) (2023).

32. M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and Efficient Object Detection,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10778–10787) (2020).

33. J. Li, X. Liao, J. Zhang, et al., “Primary auditory cortex is required for anticipatory motor response,” Cereb. Cortex 27(6), 3254–3271 (2017). [CrossRef]

34. M. Wang, X. Liao, R. Li, et al., “Single-neuron representation of learned complex sounds in the auditory cortex,” Nat. Commun. 11(1), 4361 (2020). [CrossRef]

35. H. Jia, N. L. Rochefort, X. Chen, et al., “Dendritic organization of sensory input to cortical neurons in vivo,” Nature 464(7293), 1307–1312 (2010). [CrossRef]

36. H. Jia, Z. Varga, B. Sakmann, et al., “Linear integration of spine Ca^2 + signals in layer 4 cortical neurons in vivo,” Proc. Natl. Acad. Sci. U.S.A. 111(25), 9277–9282 (2014). [CrossRef]

37. H. Foroosh, J. B. Zerubia, and M. Berthod, “Extension of phase correlation to subpixel registration,” IEEE Trans. on Image Process. 11(3), 188–200 (2002). [CrossRef]

38. A. Alba, J. F. Vigueras-Gomez, E. R. Arce-Santana, et al., “Phase correlation with sub-pixel accuracy: A comparative study in 1D and 2D,” Comput. Vis. Image Und. 137, 76–87 (2015). [CrossRef]

39. S. P. Shen, H. A. Tseng, K. R. Hansen, et al., “Automatic cell segmentation by adaptive thresholding (ACSAT) for large-scale calcium imaging datasets,” eNeuro 5(5), ENEURO.0056-18.2018 (2018). [CrossRef]

40. D. Stoyanov, Z. Taylor, G. Carneiro, et al., Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support (pp. 285–293) (2018).

41. J. Terven and D. Cordova-Esparza, “A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond,” arXiv, arXiv:2304.00501 (2023).

42. J.-H. Kim, N. Kim, Y. W. Park, et al., “Object detection and classification based on YOLO-V5 with improved maritime dataset,” J. Mar. Sci. Eng. 10(3), 377 (2022). [CrossRef]

43. A. Farid, F. Hussain, K. Khan, et al., “A fast and accurate real-time vehicle detection method using deep learning for unconstrained environments,” Appl. Sci. 13(5), 3059 (2023). [CrossRef]

44. M. Hussain, “YOLO-v1 to YOLO-v8, the rise of YOLO and its complementary nature toward digital manufacturing and industrial defect detection,” Machines 11(7), 677 (2023). [CrossRef]

45. Z. Ge, S. Liu, F. Wang, et al., “Yolox: Exceeding yolo series in 2021,” arXiv, arXiv:2107.08430 (2021).

46. L. Sità, M. Brondi, P. Lagomarsino de Leon Roig, et al., “A deep-learning approach for online cell identification and trace extraction in functional two-photon calcium imaging,” Nat. Commun. 13(1), 1529 (2022). [CrossRef]

47. C. Xu, W. Zipfel, J. B. Shear, et al., “Multiphoton fluorescence excitation: new spectral windows for biological nonlinear microscopy,” Proc. Natl. Acad. Sci. U.S.A. 93(20), 10763–10768 (1996). [CrossRef]

48. M. Tada, A. Takeuchi, M. Hashizume, et al., “A highly sensitive fluorescent indicator dye for calcium imaging of neural activity in vitro and in vivo,” Eur. J. Neurosci. 39(11), 1720–1728 (2014). [CrossRef]

49. J. P. Gilman, M. Medalla, and J. I. Luebke, “Area-specific features of pyramidal neurons—a comparative study in mouse and rhesus monkey,” Cereb. Cortex 27(3), 2078–2094 (2016). [CrossRef]

50. Y. Wu, Z. Xu, and S. Liang, “NeuroSeg-III: pre-trained self-supervised model and the segmentation model,” Github, 2024, https://github.com/zimo-k/NeuroSeg3

51. ABO, “Allen Brain Observatory,” ABO, 2018, https://observatory.brain-map.org/visualcoding

NeuroSeg-III: efficient neuron segmentation in two-photon Ca²⁺ imaging data using self-supervised learning

Abstract

1. Introduction

2. Materials and methods

2.1 Dataset

2.1.1 Dataset from our laboratory

2.1.2 Dataset from Allen Brain Observatory (ABO)

2.2 Data preprocessing

2.3 Framework of NeuroSeg-III

2.3.1 Self-supervised learning

2.3.2 Segmentation network

2.4 Data augmentation

2.5 Model training strategies

2.5.1 Model training with TiCo

2.5.2 Model training with mixed dataset

2.5.3 Model training with the ABO dataset

2.6 Evaluation metrics

3. Results

3.1 C2f-Faster-EMA and BiFPN improved segmentation accuracy meanwhile reducing the model's parameters

3.2 Self-supervised module improved the model's segmentation accuracy while reducing the reliance on GT

3.3 Segmentation network achieved precise and generalized neuron segmentation

3.4 Comparing the accuracy and speed of NeuroSeg-III with other neuron segmentation methods reveals notable enhancement

4. Discussion

5. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (6)

Equations (9)

Biomedical Optics Express