Cross-modal attention network for retinal disease classification based on multi-modal images

Zirong Liu; Yan Hu; Yan Hu; Zhongxi Qiu; Yanyan Niu; Dan Zhou; Xiaoling Li; Junyong Shen; Hongyang Jiang; Heng Li; Jiang Liu; Jiang Liu

doi:10.1364/BOE.516764

1. Introduction

Retinal diseases often impair a patient’s vision or even lead to blindness [1–3]. With the advancement of imaging technology, ophthalmologists often rely on different modal medical images for diagnosis, such as optical coherence tomography (OCT) [4] and color fundus photography (CFP) [5], which offer differential perspectives of retina. The large amount of images bring more references for diagnosis, but they also bring a significant burden to read and determine the results [6]. Therefore, to aid disease diagnosis, researchers have tried to propose various automatic algorithms for high-accuracy disease classification.

Depending on the input images, the algorithms can be classified as single-model based and multiple-mode based methods. For the former, just one modal retinal image is used as input [5,7–10]. For example, Shen et al. [7] proposed a structure-oriented Transformer framework to further construct the relationship between lesions and age-related macular degeneration (AMD) on OCT images. Li et al. [9] developed a clinically feasible deep learning system for predicting and stratifying the risk of glaucoma onset and progression from CFP and performed clinical validation in an external population cohort. Peng et al. [8] used Inception V3 [11] and ResNet-50 [12] for CFP image classification for early diagnosis of retinal diseases. However, ophthalmologists often determine the final diagnosis based on multiple modal information [13], as single-modal data only provides limited information, which easily results in biased assistance.

The view of lesions provided by the different data helps ophthalmologists to build a comprehensive perception of the disease situation [13–16]. For example, OCT and CFP are two commonly-applied [17], as shown in Fig. 1. The CFP image provides a 2D projection of the retina with wide field of view, and the OCT captures a 3D cross-sectional view. Several automatic diagnosis algorithms using multi-modal images as input have emerged [14,16,18], including early input concatenation [18] or late features fusion for classification [14,16]. Hua et al. [18] proposed a network to predict the severity of diabetic retinopathy (DR) based on the concatenation of the CFP images and swept-source optical coherence tomography angiography images. Wang et al. [16] extracted features CFP and OCT based on two different branches and then classified using their late-fused features. Expanding on the work [16], He et al. [14] applied attention mechanisms to different branches of their model, which is opt to emphasize modality-specific features. Thus, these algorithms tend to treat images of different modalities as independent images, but in fact there is a correspondence between images of the same patient in different modalities.

Fig. 1. Examples of CFP and OCT images. CFP provides a two-dimensional projection view of the retinal, and OCT provides detailed images of changes in different layers of the retina.

Download Full Size | PDF

Referring to ophthalmologists’ diagnosis based on multi-modal images, before giving the determination, they often repeatedly compare these images and observe their correlations with diseases. Therefore, we propose a new cross-modal retinal disease diagnosis network (CRD-Net), which mines the correlation between multi-modal images for retinal disease diagnosis. To achieve the purpose, we propose a cross modal attention (CMA) module to capture and emphasize the relationships between different modal images. CMA module extracts relevant information while suppressing irrelevant informative features, thereby improving the overall discriminative power of the model. Therefore, the contributions of the paper are listed as:

1. We propose a new cross-modal retinal disease diagnosis network (CRD-Net) for multiple retinal diseases diagnosis based on multiple modal images.
2. We propose a cross modal attention (CMA) module for CRD-Net to capture relationships between different modal images.
3. We propose multiple loss functions based on different modal inputs to provide suitable constraints for the network.
4. Extensive experiments based on three publicly available retinal disease datasets are conducted to prove the effectiveness of our proposed CRD-Net.

2. Related works

2.1. Disease diagnosis based on multi-modal images

The success of deep learning based visual recognition in various applications has stimulated interest in solving retinal disease classification tasks from multi-modal images [14,15,19–23]. Yoo et al. [20] employed random forests and VGG networks to extract features from OCT and CFP images, and then experimented with feature concatenation to aid in the multi-modal image diagnosis of AMD. Zou et al. [21] introduced a multi-modal evidence fusion pipeline for eye disease screening that provides a single-modal confidence measure and integrates multi-modal information from a multiple distributional fusion perspective. However, this simple concatenation method can result in the loss of relevant information between modalities. To reduce the loss of relevant features, researchers have explored different algorithms. Chen et al. [22] designed a vertical plane feature fusion method for multi-modal fusion to predict AMD disease use infrared reflectance and OCT images. Li et al. [15] proposed a multi-modal multi-instance Learning deep learning framework using CFP and OCT, selectively fusing CFP and OCT modalities feature. Song et al. [23] developed a multi-modal information bottleneck network (MMIB-Net) leveraging information bottleneck theory for feature representation in multiple modalities. Moreover, attention mechanisms are also used to retain modality-specific features for representation.

2.2. Attention mechanism

The concept of attention mechanism is inspired by the way humans selectively focus on relevant information while ignoring irrelevant details. The basic idea is to introduce a learnable mechanism in the model to automatically select and weigh important features [24]. Most of the existing studies [14,25–31] fall into the category of self-attention [14]. The self-attention receives a feature map F and outputs the self-attention feature map $F_{SA}$ of the same shape as $F$. To produce the self-attention feature maps, the input feature maps $F$ are fed into three learnable 1 $\times$ 1 convolution operations, respectively to produce three tensors (i.e., queries Q, keys K, and values V). The attention weights are generated by calculating the dot products of the Q with all K. The mathematical representation is as follows:

(1)$$\begin{aligned}{F_{SA}}& =Attention(Q, K, V)\\ & = softmax(\frac{QK^T}{\sqrt{d_k}})V \end{aligned}$$

where $T$ denotes the transpose of a matrix, $d_k$ represent the feature dimensions. Chen et al. [31] incorporated trained single-modal self-attention into the CNN layers of their multi-modal network for feature extraction and fine-tuning, but this approach necessitates separate pre-training of single-modal models before their integration into the multi-modal framework. He et al. [14] introduced the self-attention mechanism into multi-modal retinal aided diagnosis. They proposed a modality-specific attention network (MSAN) to obtain the specificity of CFP and OCT by applying different self-attention mechanisms to CFP and OCT respectively. However, this method requires huge computational overhead and mainly focuses on the specificity of the retinal modality itself. It lacks the ability to leverage multi-modal relevant information like ophthalmologists do. To solve these limitations, our framework uses cross-modal attention to extract and fuse lesion-related features.

3. Method

3.1. Framework structure

Clinical ophthalmologists typically consider multiple imaging modalities simultaneously and integrate relevant information between modalities to diagnose eye diseases. Our CRD-Net simulates the real-world diagnosis process, as shown in Fig. 2. Firstly, due to the noticeable difference in the visual appearance of CFP and OCT, as shown in Fig. 1 (OCT and CFP), the ophthalmologist separately identify the CFP and OCT images to look for lesion information. We simulate the step with two CNNs to extract instance-level features for the two modalities and map them to specific feature representations ${F}_{x}$ and ${F}_{y}$. Secondly, when ophthalmologists identify a lesion on one of the modes, they carry the information to find the corresponding features on the other modes and integrate them. In the CRD-Net, this process is completed through the CMA module. Finally, the ophthalmologist makes a diagnosis based on the matching features of the lesions on CFP and OCT. Correspondingly, the combination of features from different modalities is fed into the classifier to predict the class of retina-related diseases in an end-to-end manner. We provide a detailed introduction to each step of the proposed method in the following sections.

Fig. 2. CRD-Net architecture for multi-modal retinal image classification. Feature extraction is performed through two independent CNNs. Then cross-modal feature interactive fusion through CMA. Finally, the interactive features are connected and a fully connected layer is used to predict different categories. We use MHSA to denote multi-head self-attention and UFFN for unified feedforward networks.

Download Full Size | PDF

3.2. Cross modal attention

When clinicians read multimodal medical images, they search for lesion information from the global scale of the image, and then carry the lesion information to another modality to query the corresponding representation. In CRD-Net, the last layer of the backbone produces global-scale feature representations ${F}_{x}$ and ${F}_{y}$. However, ${F}_{x}$ and ${F}_{y}$ tend to pay loose attention to important areas at the global scale. Attention mechanisms make the network to focus more on feature related to disease. Therefore, in the CMA module, we first pay attention to the lesion area of a specific modality through the attention mechanism, and then perform cross-modal related feature fusion.

We use multi-head self-attention (MHSA) to focus on disease-related features, while fine-tuning uses unify feedforward network (UFFN) and 1 $\times$ 1 convolutional layers. The output results are mapped to ${F}_{x}^{'}$ and ${F}_{y}^{'}$. Considering the efficiency of this stage, we construct MHSA and UFFN according to [32]. This process can be expressed as:

(2)$${f}^{'}=CONV_{1\times 1}(MHSA(CONV_{1\times 1}({F})))$$

(3)$${F}^{'}= UFFN({f}^{'} + {F})$$

where ${F}$ denotes ${F}_{x}$ and ${F}_{y}$, ${f}^{'}$ represents the intermediate feature between MHSA and UFFN, and ${F}^{'}$ represents ${F}_{x}^{'}$ and ${F}_{y}^{'}$.

After obtaining the modal-specific features ${F}_{x}^{'}$ and ${F}_{y}^{'}$, we propose the modal interaction part of cross-modal attention, based on the operation of multimodal image observation by ophthalmologists in the previous article. Specifically, in the CFP branch, ${F}_{x}^{'}$ is mapped to key and value, and the feature ${F}_{y}^{'}$ from the OCT branch is mapped to query. Since ${F}_{x}^{'}$ and ${F}_{y}^{'}$ contain modality-specific lesion information, by calculating the attention of ${F}_{x}^{'}$ on ${F}_{y}^{'}$, we can get the lesion features associated with OCT on the CFP branch. In the OCT branch, ${F}_{y}^{'}$ is mapped to key and value, and the feature ${F}_{y}^{'}$ from the CFP branch is mapped to query. Since ${F}_{y}^{'}$ and ${F}_{x}^{'}$ contain modality-specific lesion information, by calculating the attention of ${F}_{y}^{'}$ on ${F}_{x}^{'}$, we can get the lesion features associated with CFP on the OCT branch. In addition, we use residual connections in the module to ensure the efficiency of network propagation. The representation of the entire block is:

(4)$$a_{x,y} = \text{softmax}\left(\frac{W^{Q}_{x}{W^{K}_{y}}^\top}{\sqrt{d_x}}\right) W^{V}_{y} + {F}_{x}^{'}$$

(5)$$a_{y,x} = \text{softmax}\left(\frac{W^{Q}_{y}{W^{K}_{y}}^\top}{\sqrt{d_y}}\right) W^{V}_{x} + {F}_{y}^{'}$$

where ${F}_{x}^{'}$ and ${F}_{y}^{'}$ are the input feature maps, $W^{Q}_{x}$, $W^{K}_{x}$, $W^{V}_{x}$ and $W^{Q}_{y}$, $W^{K}_{y}$, $W^{V}_{y}$ are the parameter matrices obtained after convolution operations on ${F}_{x}^{'}$ and ${F}_{y}^{'}$ respectively. The dimensions are ${d_x}$ and ${d_y}$ respectively. $a_{x,y}$ and $a_{y,x}$ represent the outputs of the CFP branch and OCT branch, respectively.

3.3. Classifier and objective function

After the lesion correlation features are generated, we concatenate them to obtain the multi-modal features and train a fully connected classifier to make the final prediction. We also predict retinal diseases from single OCT and CFP modality, respectively. Due to the diverse sources of features fed into the input classifier, employ independent loss functions to constrain features from different sources can be beneficial for learning modality-specific characteristics and enhance the synergy between different modalities. For training the three classification branches, we design a multiple loss module in our CRD-Net. The use of multiple loss functions provides comprehensive constraints on the overall optimization process of the network, thereby reduce the impact of a single branch on bias induction. We can represent the whole module as:

(6)$$\Gamma = SGD\left ( L_{OCT}+L_{Both}+L_{Fundus} \right )$$

where $L_{OCT}$ and $L_{Fundus}$ represent the losses of each single branch computed using cross-entropy. $L_{Both}$ represents the cross-entropy loss obtained by fusion both modalities. We add these three losses together and then optimize using stochastic gradient descent (SGD). During the training process, the goal of the entire model is to minimize the primary loss function $\Gamma$, which is the sum of the three losses. In the ablation experiment stage, we demonstrated the superiority of this constraint.

4. Experiments

4.1. Datasets

We evaluate the effectiveness of our method based on three publicly available datasets (MMC-AMD [16,33], APTOS-2021 [34], and GAMMA [35]). Compared to acquiring volume 3D OCT data, obtaining single 2D OCT images is a simpler process. Therefore, the input for the CRD-Net consists of one 2D CFP image with one 2D OCT image. For the selection of 2D OCT images, we take different approaches depending on the datasets. For the MMC-AMD dataset, the 2D OCT images were selected by professional ophthalmologist, focusing on the diseased areas of the patients. This process is specifically described in Ref. [16,33]. For the APTOS-2021 and GAMMA datasets, we selected the center frame of 3D OCT volume data as input. For APTOS-2021, the images are collected for some macular related diseases, which is also the focus center while captured, so the center of macular is the main concern of ophthalmologist. For GAMMA datasets, it is related to glaucoma mainly related to optics, but the ophthalmologists often check the condition of macular. Therefore, based on the knowledge discussed with doctors, we selected the center frame of 3D OCT volume as the input of OCT branch in our framework. The details are listed in Table 1 and illustrated as follows:

Table 1. Dataset Size Table

View Table | View all tables in this article

MMC-AMD: The MMC-AMD dataset [16,33] comprises two modalities: CFP and OCT. To adapt to the task of aid diagnosis using multimodal images of the same patient, we organized a total of 768 pairs of samples by associating CFP images with OCT images through matching patient identifiers. This dataset encompasses four categories: normal (195), dry AMD (57), Polypoidal Choroidal Vasculopathy (PCV) (185), and wet AMD (331). We divide the dataset into train and test sets in an 8:2 ratio use random partition. Specifically, 615 pairs of samples are allocated for train, while 153 pairs are designated for test.

APTOS-2021: The APTOS-2021 dataset [34] is present by the asia-pacific tele-ophthalmology society (APTOS) during the 2021 asia-pacific ophthalmology society’s big data competition. Its image modalities include CFP (fundus photography) and OCT images. To ensure that the primary research question is not affected by class imbalance, we select three diseases to validate the performance of our method, with a total of 1,497 pairs of samples. These diseases include wet AMD (596), PCV (406), and diabetic macular edema (DME) (495). Following the official data partition, 1,298 pairs of samples are allocated for training and 199 pairs for testing.

GAMMA: The GAMMA dataset [35] originates from the 2021 MICCAI Contest: GAMMA Challenge Task 1: Multi-modal Glaucoma Grading. The dataset comprises 200 pairs of clinical multi-modal images, of which 100 pairs for training and 100 pairs for testing. It includes CFP and OCT modalities, divided into three categories: Normal, Early Glaucoma, and Advanced Glaucoma. Due to the unavailability of labels for the test set in the challenge, this study solely utilized the training dataset. We reconfigure the dataset into train and test sets at an 8:2 ratio using a random split approach. Specifically, 80 pairs of samples are allocated for train, while 20 pairs are reserved for test purposes.

4.2. Implementation details

We implement the models with the PyTorch [36] framework and all experiments are conducted on a Geforce RTX 2080 Ti GPU. The optimizer, weight decay, momentum, epoch, and batch size are SGD, 1$e-$4, 0.9, 150, and 8, respectively. The initial learning rate is 1$e-$4 and decreases to 0.1$\times$ at epochs 50, and 100. We adopted the same input resolution as in the case of [16] to capture more intricate image details. During the training process, we perform data augmentation strategies on each CFP and OCT image using RandomHorizontalFlip with a probability of 0.5 and RandomRotation within a range of $\pm$30$^{\circ }$. We refer [14] for standardisation and normalisation of the input image. Parameters for the comparative methods are configured based on the implementation details in the respective papers, and the best performance is reported. All experiments used pre-trained weights on IMAGENET1K.

4.3. Evaluation metrics

We comprehensively evaluate the performance of our method using six metrics: Recall, Precision (Pre), Specificity (Spe), F1 score, Kappa, and Accuracy (ACC). Sensitivity is equivalent to Recall. The Kappa coefficient and F1 score provide insights into the method’s reliability. The notations for these evaluation metrics are as follows:

(7)$$\begin{aligned}Recall & = Sensitivity = \frac{TP}{TP+FN}\\ Precision & = \frac{TP}{TP+FP}\\ Specificity & = \frac{TN}{FP+TN}\\ F1& = \frac{2*Precision*Recall}{Precision+Recall} = \frac{2*TP}{2*TP+FP+FN}\\ p_{0}& =\frac{TP+TN}{TP+TN+FP+FN}\\ p_{e}& =\frac{\sum_{i=1}^{C}(a_{i}\times b_{i})}{N\times N}\\ Kappa& =\frac{P_{o}-P_{e}}{1-p_{e}}\\ Accuracy& = \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$

where $TP$ represents true positives, $FP$ stands for false positives, $FN$ denotes false negatives, $TN$ represents true negatives, $a_i$ indicates the actual number of samples in class $i$, $b_i$ refers to the predicted number of samples in class $i$, C represents the number of classes, and $N$ stands for the total number of samples. Particularly, the multi-class precision, specificity, sensitivity, and F1 results in the following table are the weighted averages across all classes.

4.4. Ablation study

This section introduces our ablation experiments on the MMC-AMD dataset from the following four aspects: composition of the CMA module, the position of CMA, the number of loss branches and the different modules of the algorithm. We use multi-modal CNN (MM-CNN) [16] as the baseline model and compare the corresponding model performance by adding different modules at different positions. The "✓" in the following tables indicates that the network contains the corresponding module.

Ablation study of CMA module The CMA module simulates the operation of an ophthalmologist observing multi-modal images through two parts. To facilitate the study of the contributions of the different parts of the CMA module, we packaged the part before the UFFN as the specific modal attention (SMA) module, and the part after the UFFN as the cross modal fusion (CMF) module. Thus, the main components of CMA are SMA and our proposed CMF. The ablation studies about the two elements are analyzed in this section. To verify the validation of SMA and CMF in our CMA, we delete SMA and CMF respectively on CRD-Net and conducted experiments, as shown in Table 2. The first line is the experimental results of the baseline model. Our SMA enables the network to get performance improvement as shown in the second row of Table 2, which means that our SMA can better focus on lesion features. Our CMF provides the highest precision, as shown in the third row of Table 2, which illustrates the effectiveness of our proposed fusion strategy for cross-modal querying of relevant lesion features. The last line of results shows that our model achieves the best performance on the five metrics with both SMA and CMF. Thus, our CMA module is helpful to dig the relevant features of multi-modal images, which improves the accuracy and stability performance on the aided diagnosis.

Table 2. Comparison of cross-modal attention module combinations. The best results in this table are marked in bold.

View Table | View all tables in this article

Ablation study of CMA module locations The attention mechanism has limitations in its inductive bias capabilities. Its performance is affected by whether the input features are fully extracted. Therefore, we analyze the impact of the location of the CMA module on network performance. We conducted two sets of experiments, introducing the CMA module into the third and fourth layers of the backbone respectively.

The experimental results are shown in Table 3, showing the effect of CMA module position on performance. When the CMA module is integrated into the fourth layer (CMA-Layer4), it achieves the best results on the six evaluation metrics. In the same setup, the performance decreases when the CMA module is integrated into the third layer (CMA-Layer3). Due to the inherent limitations of the attention mechanism’s inductive bias ability, input feature extraction is relatively underdeveloped, which results in a bias while capturing target features. Therefore, we place the CMA module at the end (Layer4) of the backbone.

Table 3. The impact of the placement of the CMA module on performance. The reported MM-CNN refers to the baseline method, and Layer 3 or 4 after CMA indicates the placement position of the CMA module. The best results in this table are marked in bold.

View Table | View all tables in this article

Ablation Study of multiple losses The objective function plays a key role in the optimization of the algorithm. Our objective function (Eq. (6)) combines losses from three branches. To explore the best combination of different branch objective functions, we perform ablation experiments by different combinations of $L_{Fundus}$, $L_{OCT}$ and $L_{Both}$ on the overall structure of the model. We run a total of 4 sets of experiments on single branch loss, double branch loss and triple branch loss.

The results of combining different branching objective functions are shown in Table 4. The first row shows the baseline results. We find that using a combination of $L_{Fundus}$, $L_{OCT}$ and $L_{Both}$ effectively improves the performance of classification. When only $L_{Both}$ is used, all evaluation metrics exceed the baseline. However, adding either $L_{Fundus}$ or $L_{OCT}$ to $L_{Both}$ leads to a decline in network performance compared to $L_{Both}$, as shown in the third and fourth row of Table 2. Considering the model structure, adding only $L_{Fundus}$ or $L_{OCT}$ to $L_{Both}$ implies that the network pays more attention to one of OCT or CFP, leading to a bias in network learning. We believe that adding only one loss branch to $L_{Both}$ may distort the learning. When combining $L_{Fundus}$, $L_{OCT}$ and $L_{Both}$ the network gets the best results on the five metrics. It indicates that the combination of the three loss functions provides the most suitable constraints for the learning process of the network.

Table 4. Ablation experiment results with multiple loss functions. The first row is the result of the baseline. The best results in this table are marked in bold.

View Table | View all tables in this article

Ablation Study of Modules Finally, we analyse the effectiveness of our proposed modules applied in our network. We conduct ablation study experiments by adding CMA and multiple loss functions to the baseline model, the results shown in Table 5. When CMA or multiple losses are independently added to the baseline model, the model performance declines. However, when the two modules act together, the model performance reaches its optimum. This suggests that the CMA module requires appropriate constraints for capturing cross-modal lesion-related features. On the other hand, when the objective function collaborates with the CMA module, all six metrics achieve their optimal scores, indicating the overall effectiveness of the model structure.

Table 5. The ablative study results of the contributions to model performance by different modules. The best results in this table are marked in bold.

View Table | View all tables in this article

4.5. Comparison experiments

To prove the effectiveness of our proposed method, we compared with the following state-of-the-art neural networks on three publicly available datasets (MMC-AMD, APOTS, and GAMMA). We divide current methods into two kinds of structure: single-modal model (ResNet18 [12], InceptionV3 [37], VIT [38], SEResNet18 [26], FlexiViT [39], ConvNeXt [40]) and multi-modal model (MM-CNN [16], MSAN [14], DeepGuide [41]). The model based on single mode is further divided into using OCT or CFP as input. We add "OCT" or "CFP" prefixes before the method name to identify single-model input. These models are initialized with pretrained weights provided by the Timm package [42].

The experimental results based on three datasets are shown in Table 6, 7, 8, respectively. The advantages of different methods vary on different datasets, and our method has achieved the best results on the three datasets. Table 6 shows the comparison results on the MMC-AMD dataset, which shows that the same method of using OCT as input generally yields better classification results than using CFP. For AMD disease grading, OCT images are the gold standard for diagnosis. Our CRD-Net relies on capturing the relevant features of lesions on multi-modal images and achieves the best classification results: ACC of 85.62%, precision of 81.77%, sensitivity of 87.75%, specificity of 95.11%, F1 score is 84.66%, and kappa is 79.57%. It outperforms OCT-VIT with ACC score of approximately 83.01%. We use multi-modal inputs in our model, which can make more accurate diagnoses by focusing on the correlation of diseases under different modalities. The same situation also occurs in the APTOS-2021 dataset, as shown in Table 7. The situation changes on the GAMMA dataset, where the use of CFP as input generally obtains better classification results than the use of OCT as input, as the change in the optic disc from fundus images can more likely predict the onset of glaucoma than in OCT images. Even in such case, our method still achieves the best performance, proving the robustness of our method. By comparing the performance of different methods on GAMMA and MMC-AMD datasets, for ophthalmic disease diagnosis, different modal images hold their specific advantages aided for the diagnosis, which further proves the necessity of using the correlation of multi-modal images aiding for disease diagnosis.

Table 6. Performance comparison on MMC-AMD dataset: Results obtained from different methods using single-modal (OCT or CFP) and multi-modal, with the best result highlighted.

View Table | View all tables in this article

Table 7. Performance comparison on APTOS-2021 dataset: Results obtained from different methods using single-modal (OCT or CFP) and multi-modal, with the best result highlighted.

View Table | View all tables in this article

Table 8. Performance comparison on GAMMA dataset: Results obtained from different methods using single-modal (OCT or CFP) and multi-modal, with the best result highlighted.

View Table | View all tables in this article

To further analyze the classification results, we also list the detailed confusion matrices and the class activation mapping (CAM) visualisation by the Grad-CAM method [43] of our algorithm and other comparison algorithms. As shown in Fig. 3, 4 and 5, the overall number of correctly classified images based on CRD-Net is higher than other methods. Our classification for each class superiors to most of other methods without bias. Figures also tell us that our model achieves more comprehensive and robust classification for all classes, as we adopt the correlation between multi-modal images. In the context of medical diagnosis, the interpretability of model decisions is crucial. We try to further understand the mechanism of the network through the result of Grad CAM. The reddish areas mean the highest contribution to the classification followed by yellowish pixels while the bluish ones contribute the lowest to the classification. As shown in Fig. 6, the model exhibits strong attention (red) towards the diseased areas. Additionally, the model attends to corresponding regions of the disease in both modalities. This shows that the multi-modal images inputted into the model contribute to network inference and the multi-modal clinical knowledge aids in the network’s decision-making process. Therefore, our algorithm achieves a promising performance.

Fig. 3. The confusion matrices of results on the AMD-MMC dataset. The four diseases normal, dry_AMD, PCV, and wet_AMD are represented by I, II, III, and IV, respectively.

Download Full Size | PDF

Fig. 4. The confusion matrices of results on the APTOS-2021 dataset. The Wet_AMD, PCV, and DME are represented by I, II, and III, respectively. The results show that proposed method has the smallest sample error and the best classification performance compared to other methods.

Download Full Size | PDF

Fig. 5. The confusion matrices of results on the GAMMA dataset. The Normal, Early Glaucoma, and Advanced Glaucoma are indicated with I, II, and III, respectively.

Download Full Size | PDF

Fig. 6. Class activation map (CAM) visualization of the proposed CRD-Net method. The red area contributes the most to the classification, followed by the yellow pixels, while the blue area contributes the least to the classification. The results show that CRD-Net focuses well on the corresponding lesion locations of CFP and OCT.

Download Full Size | PDF

The results show that proposed method has the smallest sample error and achieves the best performance in classification compared to other methods.

5. Discussion

There are inherent advantages to using a single-modal to design an aided diagnostic algorithm. It is often easier to obtain single-modal images of the same patient than paired multi-modal images. Although aided diagnosis methods based on single-modal medical images have achieved good performance, in clinical practice, doctors can be more confident in their diagnosis by referring to the patient’s different modalities (e.g. CFP and ophthalmic OCT). Therefore, the design of aided diagnostic algorithms based on multi-modal images has attracted much attention. The same lesion is often correlated with different modal medical images with different expressions. It has become a new challenge to design an algorithm that focuses on the information related to retinal diseases in different modalities.

This work focuses on using multi-modal images of patients, designing a cross-modal retinal disease diagnosis network that can focus on relevant information between different modalities, and using multiple losses to help the model learn better features. However, although our method achieves good classification results on multiple clinical datasets, the limitations of our method still exist.

Doctors often fully consider whether patients suffer from multiple diseases simultaneously to avoid the missed diagnosis. However, the existing algorithm cannot consider this factor, as we cannot obtain such related data for validation, which may limit the application scenarios of our algorithm. Due to the use of publicly datasets in our research, patient demographic information was not disclosed; thus, we did not discuss patient demographics in the paper. We will continue to pay attention to the dynamics of these datasets, and if relevant information is disclosed in the future, we can conduct some experimental discussions on this.

6. Conclusion

We introduced an CRD-Net algorithm that integrates the CMA module with multiple loss functions to achieve comprehensive optimization. Extensive experiments proves the effectivity of our proposed module and loss function. Comparison experiments confirms that the relevant features from different modal images are helpful aided diseases diagnosis.

Funding

The National Natural Science Foundation of China (82102189 and 82272086); Shenzhen Stable Support Plan Program (20220815111736001).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data used for training and test underlying the results presented in this paper are available in Ref. [16,33–35].

The code will be publicly at [44].

References

1. L. S. Lim, P. Mitchell, J. M. Seddon, et al., “Age-related macular degeneration,” The Lancet 379(9827), 1728–1738 (2012). [CrossRef]

2. C. M. G. Cheung, T. Y. Lai, P. Ruamviboonsuk, et al., “Polypoidal choroidal vasculopathy: definition, pathogenesis, diagnosis, and management,” Ophthalmology 125(5), 708–724 (2018). [CrossRef]

3. T. A. Ciulla, A. G. Amador, and B. Zinman, “Diabetic retinopathy and diabetic macular edema: pathophysiology, screening, and novel therapies,” Diabetes Care 26(9), 2653–2664 (2003). [CrossRef]

4. D. Huang, E. A. Swanson, C. P. Lin, et al., “Optical coherence tomography,” Science 254(5035), 1178–1181 (1991). [CrossRef]

5. D. Milea, R. P. Najjar, Z. Jiang, et al., “Artificial intelligence to detect papilledema from ocular fundus photographs,” N. Engl. J. Med. 382(18), 1687–1695 (2020). [CrossRef]

6. E. R. Dow, T. D. Keenan, E. M. Lad, et al., “From data to deployment: the collaborative community on ophthalmic imaging roadmap for artificial intelligence in age-related macular degeneration,” Ophthalmology 129(5), e43–e59 (2022). [CrossRef]

7. J. Shen, Y. Hu, X. Zhang, et al., “Structure-oriented transformer for retinal diseases grading from oct images,” Comput. Biol. Med. 152, 106445 (2023). [CrossRef]

8. Y. Pan, J. Liu, Y. Cai, et al., “Fundus image classification using inception v3 and resnet-50 for the early diagnostics of fundus diseases,” Front. Physiol. 14, 160 (2023). [CrossRef]

9. F. Li, Y. Su, F. Lin, et al., “A deep-learning system predicts glaucoma incidence and progression using retinal photographs,” J. Clin. Invest. 132(11), e157968 (2022). [CrossRef]

10. X. Zhang, Z. Xiao, R. Higashita, et al., “A novel deep learning method for nuclear cataract classification based on anterior segment optical coherence tomography images,” in 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), (2020), pp. 662–668.

11. C. Szegedy, V. Vanhoucke, S. Ioffe, et al., “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), pp. 2818–2826.

12. K. He, X. Zhang, S. Ren, et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), pp. 770–778.

13. A. Arrigo, E. Aragona, M. B. Parodi, et al., “Quantitative approaches in multimodal fundus imaging: state of the art and future perspectives,” Prog. Retinal Eye Res. 92, 101111 (2023). [CrossRef]

14. X. He, Y. Deng, L. Fang, et al., “Multi-modal retinal image classification with modality-specific attention network,” IEEE Trans. Med. Imaging 40(6), 1591–1602 (2021). [CrossRef]

15. X. Li, Y. Zhou, J. Wang, et al., “Multi-modal multi-instance learning for retinal disease recognition,” in Proceedings of the 29th ACM International Conference on Multimedia (Association for Computing Machinery, 2021), MM ’21, pp. 2474–2482.

16. W. Wang, X. Li, Z. Xu, et al., “Learning two-stream CNN for multi-modal age-related macular degeneration categorization,” IEEE J. Biomed. Health Inform. 26(8), 4111–4122 (2022). [CrossRef]

17. M. Hadziahmetovic, P. Nicholas, S. Jindal, et al., “Evaluation of a remote diagnosis imaging model vs dilated eye examination in referable macular degeneration,” JAMA Ophthalmol. 137(7), 802–808 (2019). [CrossRef]

18. C.-H. Hua, K. Kim, T. Huynh-The, et al., “Convolutional network with twofold feature augmentation for diabetic retinopathy recognition from multi-modal images,” IEEE J. Biomed. Health Inform. 25(7), 2686–2697 (2021). [CrossRef]

19. X. Qian, J. Pei, H. Zheng, et al., “Prospective assessment of breast cancer risk from multimodal multiview ultrasound images via clinically applicable deep learning,” Nat. Biomed. Eng. 5(6), 522–532 (2021). [CrossRef]

20. T. K. Yoo, J. Y. Choi, J. G. Seo, et al., “The possibility of the combination of oct and fundus images for improving the diagnostic accuracy of deep learning for age-related macular degeneration: a preliminary experiment,” Med. Biol. Eng. Comput. 57(3), 677–687 (2019). [CrossRef]

21. K. Zou, T. Lin, X. Yuan, et al., “Reliable multimodality eye disease screening via mixture of student’s t distributions,” arXiv, arXiv:2303.09790 (2023). [CrossRef]

22. M. Chen, K. Jin, Y. Yan, et al., “Automated diagnosis of age-related macular degeneration using multi-modal vertical plane feature fusion via deep learning,” Med. Phys. 49(4), 2324–2333 (2022). [CrossRef]

23. J. Song, Y. Zheng, J. Wang, et al., “Multicolor image classification using the multimodal information bottleneck network (mmib-net) for detecting diabetic retinopathy,” Opt. Express 29(14), 22732–22748 (2021). [CrossRef]

24. A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30I. Guyon, U. V. Luxburg, S. Bengio, et al., eds. (Curran Associates, Inc., 2017).

25. Z. Li, Y. He, S. Keel, et al., “Efficacy of a deep learning system for detecting glaucomatous optic neuropathy based on color fundus photographs,” Ophthalmology 125(8), 1199–1206 (2018). [CrossRef]

26. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), pp. 7132–7141.

27. M. Jaderberg, K. Simonyan, A. Zisserman, et al., “Spatial transformer networks,” in Advances in Neural Information Processing Systems, vol. 28C. Cortes, N. Lawrence, D. Lee, et al., eds. (Curran Associates, Inc., 2015).

28. F. Wang, M. Jiang, C. Qian, et al., “Residual attention network for image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), pp. 3156–3164.

29. A. G. Roy, N. Navab, and C. Wachinger, “Concurrent spatial and channel squeeze & excitation in fully convolutional networks,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part I, (Springer, 2018), pp. 421–429.

30. Y. Zhu, C. Zhao, H. Guo, et al., “Attention couplenet: Fully convolutional attention coupling network for object detection,” IEEE Trans. on Image Process. 28(1), 113–126 (2019). [CrossRef]

31. Q. Chen, T. D. Keenan, A. Allot, et al., “Multimodal, multitask, multiattention (M3) deep learning detection of reticular pseudodrusen: Toward automated and accessible classification of age-related macular degeneration,” J. Am. Med. Informatics Assoc. 28(6), 1135–1148 (2021). [CrossRef]

32. Y. Li, J. Hu, Y. Wen, et al., “Rethinking vision transformers for mobilenet size and speed,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2023), pp. 16889–16900.

33. W. Wang, Z. Xu, W. Yu, et al., “Two-stream CNN with loose pair training for multi-modal AMD categorization,” in MICCAI, (2019), pp. 156–164.

34. APTOS, “Aptos cross-country datasets benchmark,” Tianchi, 2021, https://tianchi.aliyun.com/specials/promotion/APTOS?spm=a2c22.12281978.0.0.

35. J. Wu, H. Fang, F. Li, et al., “Gamma challenge: glaucoma grading from multi-modality images,” Med. Image Anal. 90, 102938 (2023). [CrossRef]

36. A. Paszke, S. Gross, F. Massa, et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, et al., eds. (Curran Associates, Inc., 2019), pp. 8024–8035.

37. C. Szegedy, V. Vanhoucke, S. Ioffe, et al., “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016).

38. A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, (2021).

39. L. Beyer, P. Izmailov, A. Kolesnikov, et al., “Flexivit: One model for all patch sizes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023), pp. 14496–14506.

40. Z. Liu, H. Mao, C.-Y. Wu, et al., “A convnet for the 2020s,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 11976–11986.

41. M. Mallya and G. Hamarneh, “Deep multimodal guidance for medical image classification,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII (Springer, 2022), pp. 298–308.

42. R. Wightman, “Pytorch image models,” Github, 2019, https://github.com/rwightman/pytorch-image-models.

43. R. R. Selvaraju, M. Cogswell, A. Das, et al., “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, (2017), pp. 618–626.

44. Z. Liu, Y. Hu, Z. Qui, et al., “Cross-modal attention network for retinal disease classification based on multi-modal images,” Github, 2024, https://github.com/ZirongLiu/CRD-Net.

Dataset	Modal	Class	Data size (pair)
MMC-AMD	OCT CFP	Normal(195) dryAMD(57) PCV(185) wetAMD(331)	Trainset 615 Testset 153
APTOS-2021	OCT CFP	wetAMD(596) PCV(406) DME(495)	Trainset 1298 Testset 199
GAMMA	OCT CFP	Normal(50) Early Glaucoma(26) Advanced Glaucoma(24)	Trainset 80 Testset 20

SMA	CMF	Recall	Pre	Spe	F1	Kappa	Acc
		76.70%	76.62%	91.66%	79.03%	71.88%	80.39%
$✓$		84.22%	79.21%	94.11%	81.63%	75.82%	83.01%
	$✓$	82.64%	82.42%	93.45%	82.53%	75.13%	83.01%
$✓$	$✓$	87.75%	81.77%	95.11%	84.66%	79.57%	85.62%

Modal	Recall	Pre	Spe	F1	Kappa	Acc
MM-CNN	76.70%	76.62%	91.66%	79.03%	71.88%	80.39%
CMA-Layer3	37.54%	41.79%	80.89%	39.55%	24.29%	51.63%
CMA-Layer4	87.75%	81.77%	95.11%	84.66%	79.57%	85.62%

$L_{B o t h}$	$L_{F u n d u s}$	$L_{O C T}$	Recall	Pre	Spe	F1	Kappa	Acc
			76.70%	76.62%	91.66%	79.03%	71.88%	80.39%
$✓$			82.64%	82.42%	93.45%	82.53%	75.13%	83.01%
$✓$	$✓$		85.98%	78.62%	93.61%	82.13%	74.07%	81.70%
$✓$		$✓$	86.87%	78.14%	93.97%	82.27%	74.49%	81.70%
$✓$	$✓$	$✓$	87.75%	81.77%	95.11%	84.66%	79.57%	85.62%

CMA	Multiple Losses	Recall	Pre	Spe	F1	Kappa	Acc
		76.70%	76.62%	91.66%	79.03%	71.88%	80.39%
	$✓$	75.51%	80.10%	91.50%	77.73%	68.51%	79.08%
$✓$		81.50%	76.73%	92.56%	79.05%	70.78%	79.74%
$✓$	$✓$	87.75%	81.77%	95.11%	84.66%	79.57%	85.62%

Cross-modal attention network for retinal disease classification based on multi-modal images

Abstract

1. Introduction

2. Related works

2.1. Disease diagnosis based on multi-modal images

2.2. Attention mechanism

3. Method

3.1. Framework structure

3.2. Cross modal attention

3.3. Classifier and objective function

4. Experiments

4.1. Datasets

4.2. Implementation details

4.3. Evaluation metrics

4.4. Ablation study

4.5. Comparison experiments

5. Discussion

6. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (6)

Tables (8)

Equations (7)

Biomedical Optics Express

Modal	Recall	Pre	Spe	F1	Kappa	Acc
OCT-ResNet18	83.14%	77.73%	93.07%	80.35%	72.05%	80.39%
OCT-InceptionV3	80.05%	76.52%	92.28%	78.25%	69.81%	79.08%
OCT-VIT	82.64%	82.42%	93.45%	82.53%	75.13%	83.01%
OCT-SEResNet18	81.82%	78.73%	93.03%	80.25%	72.65%	81.05%
OCT-FlexiViT	81.31%	82.37%	93.74%	81.84%	76.01%	83.66%
OCT-ConvNeXt	81.19%	82.42%	93.35%	81.80%	74.29%	82.35%
CFP-ResNet18	58.45%	81.01%	88.67%	67.91%	57.56%	72.55%
CFP-InceptionV3	59.59%	62.48%	88.47%	61.00%	55.19%	69.93%
CFP-VIT	65.03%	71.89%	88.70%	68.28%	57.92%	72.55%
CFP-SEResNet18	64.65%	66.91%	87.97%	65.76%	53.71%	68.63%
CFP-FlexiViT	70.32%	70.93%	89.34%	70.62%	58.82%	71.90%
CFP-ConvNeXt	53.71%	53.71%	87.92%	53.71%	53.49%	69.93%
MM-CNN	76.70%	76.62%	91.66%	79.03%	71.88%	80.39%
MSAN	63.94%	71.10%	83.77%	67.33%	5234%	70.35%
DeepGuide	54.35%	55.90%	87.69%	55.11%	53.88%	70.59%
Ours	87.75%	81.77%	95.11%	84.66%	79.57%	85.62%

Modal	Recall	Pre	Spe	F1	Kappa	Acc
OCT-ResNet18	67.02%	69.61%	85.09%	68.29%	56.08%	72.36%
OCT-InceptionV3	67.78%	68.40%	85.24%	68.09%	55.92%	71.86%
OCT-VIT	68.45%	68.43%	85.10%	68.44%	55.08%	70.85%
OCT-SEResNet18	66.86%	67.01%	84.20%	66.94%	52.63%	69.35%
OCT-FlexiViT	65.94%	65.66%	83.79%	65.80%	51.23%	68.34%
OCT-ConvNeXt	70.36%	71.30%	86.29%	70.83%	59.19%	73.87%
CFP-ResNet18	64.05%	64.53%	83.64%	64.29%	50.97%	68.84%
CFP-InceptionV3	60.12%	59.60%	80.36%	59.86%	40.23%	60.30%
CFP-VIT	60.25%	80.92%	82.29%	69.07%	48.17%	68.34%
CFP-SEResNet18	61.75%	61.40%	81.85%	61.57%	45.01%	64.32%
CFP-FlexiViT	61.70%	59.39%	83.35%	60.52%	50.95%	69.85%
CFP-ConvNeXt	56.57%	58.70%	79.80%	57.62%	39.76%	62.31%
MM-CNN	70.92%	70.63%	86.37%	70.77%	58.91%	73.37%
MSAN	50.03%	48.90%	76.51%	49.46%	29.83%	56.63%
DeepGuide	59.49%	54.15%	82.09%	56.70%	46.61%	66.83%
Ours	73.33%	74.39%	87.62%	73.86%	63.20%	76.38%

Modal	Recall	Pre	Spe	F1	Kappa	Acc
OCT-ResNet18	61.11%	67.59%	82.37%	64.19%	42.45%	60.00%
OCT-InceptionV3	68.52%	67.78%	85.11%	68.15%	53.85%	70.00%
OCT-VIT	63.62%	70.00%	82.39%	66.66%	42.86%	60.00%
OCT-SEResNet18	77.7%8	80.00%	87.50%	78.87%	57.75%	70.00%
OCT-FlexiViT	59.26%	47.14%	83.68%	52.51%	50.41%	70.00%
OCT-ConvNeXt	59.26%	46.67%	83.22%	52.21%	50.00%	70.00%
CFP-ResNet18	84.26%	82.50%	92.80%	83.37%	76.83%	85.00%
CFP-InceptionV3	84.26%	82.50%	93.27%	83.37%	77.01%	85.00%
CFP-VIT	75.00%	89.77%	91.38%	81.72%	75.10%	85.00%
CFP-SEResNet18	68.52%	68.06%	86.07%	68.29%	54.89%	70.00%
CFP-FlexiViT	88.89%	85.71%	93.75%	87.27%	77.70%	85.00%
CFP-ConvNeXt	80.56%	78.57%	90.72%	79.55%	69.70%	80.00%
MM-CNN	80.56%	78.57%	90.72%	79.55%	69.70%	80.00%
MSAN	52.38%	46.67%	79.25%	49.36%	39.13%	65.00%
DeepGuide	84.26%	82.50%	92.80%	83.37%	76.83%	85.00%
Ours	92.59%	88.89%	95.83%	90.70%	84.85%	90.00%