Multicolor image classification using the multimodal information bottleneck network (MMIB-Net) for detecting diabetic retinopathy

Jingqi Song; Yuanjie Zheng; Jing Wang; Muhammad Zakir Ullah; Wanzhen Jiao

doi:10.1364/OE.430508

1. Introduction

Multicolor (MC) Images can be obtained by combing multicolor scanning laser ophthalmoscope and optical coherence tomography (OCT) fluorescein angiography system, developed by Heidelberg Engineering [1]. It provides insights into many ocular pathologies that affect the treatment of retinal disorders and leads to an accurate diagnosis of retinal abnormalities. Multicolor imaging is more sensitive compared to conventional fundus imaging due to the ability to detect symptomatic features that are reflected in different spectral ranges. Due to the ability to suppress light scattering and reveal retinal findings at different depths, multicolor imaging provides a novel way to examine the retina and may help to diagnose diabetic retinopathy earlier than conventional fundoscopy.

MC Images are of increasing interest to retinal physicians as such images can record en face cSLO fundus/angiographic and cross-sectional imaging with a single device. MC imaging employs three different wavelength laser, viz., blue reflectance (BR, 488 nm), green reflectance (GR, 515 nm), and infrared reflectance (IR, 820 nm), to create a discrete spectral slice [2]. By combining three laser wavelength reflectance images into a single image for each wavelength in one of the three color channels (red, green, blue), a combined-pseudocolor image (multicolor image) will also be obtained. An example of the MC Images is presented in Fig. 1. The blue wavelength scanning light does not penetrate deeper than the retinal nerve fiber layer and can capture details of superficial retinal structures, such as epiretinal membranes, retinal nerve fiber layer thinning, and macular pigment changes. The green wavelength light penetrates the retina and can provide vascular details of the retina such as hard exudates and blood vessels. Due to the longer wavelength, the infrared laser light can reach the retinal pigment epithelium (RPE) and allow better imaging of the retinal pigment epithelium and the choroid [3]. Each slice is a cumulative en face spectral layer of the fundus with specific laser wavelengths [4]. In addition, each modality has two separate field angles. Thus, a complete MC examination of a patient yields 16 images (8 images for each right and left eye), i.e. BR, GR, IR and combined-pseudocolor image (30$^{\circ }$ and 55$^{\circ }$ views).

Fig. 1. A collection of MC Images(55$^{\circ }$) from our dataset. From left to right are Blue Reflectance, Green Reflectance, Infrared Reflectance and combined-pseudocolor image.

Download Full Size | PDF

Screening and early detection of diabetic retinopathy to prevent vision loss can be beneficial to the individual patient. These properties of MC Images can be used as an option clinically to improve the detection rate of diabetic retinopathy. From the ophthalmologist’s point of view, one advantage of multicolor imaging is that a single device can perform multiple imaging modalities required. Patients can benefit from a one-stop-shop, which can be a time-saving and cost-effective approach. In addition to this, the use of small single wavelength laser spots in multicolor imaging is also friendly to patients with photophobia [1]. As MC Images provide structural information from different depths within the retina, it is commonly carried out by examining multiple slices of the clinicians’ diagnosis. However, there are some problems associated with the manual diagnosis of diabetic retinopathy, such as labor-intensive and costly, so there is a growing need for automated quantitative analysis of MC Images. Compared with traditional machine learning models, automatic diagnosis of DR by deep learning models can achieve extremely high accuracy. Therefore, deep learning methods can be used to analyze multicolor images containing diabetic retinopathy.

To the best of our knowledge, there are relatively few studies using deep learning to analyze MC images. Section 2 briefly describes the methods used for DR that use other types of images, such as optical coherence tomography [5] or digital fundus images [6], and these methods are not applicable to the analysis of MC Images. Many traditional machine learning approaches on DR are mostly based on hand-designed features which require an in-depth study of the various options and tedious parameter settings [7]. Most deep learning-based techniques for identification of DR lesions only convert the input natural image into retinal fundus image. However, unlike other fundus images, MC Images contain multimodal detection images. If a single modal is used for analysis, the diagnostic result is bound to be inaccurate. Those simple assumptions prevent these techniques from dealing with MC Images in the analysis of DR. In addition, there are redundant information representations between different modal images. Finding common representations among different modalities of MC Images that are relevant to prediction while reducing redundant information can improve the accuracy of classification.

To effectively considering the specific characteristics of MC Images, we proposed a multimodal information bottleneck network (MMIB-Net) (shown in Fig. 2) for analyzing DR in MC Images. Our approach achieves promising results due to several novel implications of our approach. First, we build a multimodal deep learning model to analyze MC Images for diagnosing DR, which can simultaneously extract multiple features for classification. Secondly, to the best of our knowledge, we first innovatively introduce information bottleneck theory into MC Images analysis, thus enabling the model to extract representation relevant to classification. Finally, we expand the information bottleneck theory into a multimodal model, thus enabling the proposed approach to remove irrelevant information from multiple modalities and learn joint representations across modalities.

Fig. 2. Multimodal information bottleneck network (MMIB-Net) framework diagram.

Download Full Size | PDF

The rest of this paper is divided into the following sections. Section 2 presents a review of related works. Section 3 includes the details of the proposed methodology. Section 4 shows all experimental results of the proposed method. In Section 5, we evaluate the effect of different parameters of the proposed method. In Section 6, we describe the conclusion of the paper.

2. Related work

2.1 Classification of fundus images to diagnose DR

The classification of fundus images to diagnose diabetic retinopathy has been widely studied in the field of medical image analysis. Table 1 collates the studies on DR classification in recent years. The main image modalities utilized in studies on DR classification are: color fundus image, fluorescein angiography, optical coherence tomography and Ultra-wide field SLO. The classification methods can be broadly classified into traditional machine learning-based techniques [8–11] and deep learning-based methods [12–14]. Math et al. [11] proposed an adaptive machine learning classification method for the detection of diabetic retinopathy. In the study performed by Y Kang et al. [15], an automated detection model of diabetic retinopathy based on the statistical method and Nave Bayesian classifier was proposed. In [16], the authors used fractional max-pooling to derive more discriminative features for the identification of DR. T. Li et al. [12] fine-tuned the GoogleNet [17], ResNet-18 [18], DenseNet-121 [19], VGG-16 [20], and SE-BN-Inception [21] to detect the DR. X. Li et al. [13] classified the public IDRiD dataset [22] into five DR stages by using the ResNet50 with four attention mechanisms. T.R. Gadekallu et al. [23] built a principal component analysis-based deep neural network model using Grey Wolf Optimization algorithm to determine whether an image was referable DR. The color fundus image allows rapid acquisition of fundus images with different fields of view. However, the images of this modality do not easily detect deep retinal microangiomas. X. Pan et al. [24] concluded that DenseNet is a suitable model for automatic detection and analysis of fundus fluorescein angiography (FFA) by comparing three convolutional neural networks. T. Nagasawa et al. [25] conducted training with the VGG-16 model using fluorescein angiography and demonstrated it has a high sensitivity of 94.7% and a high specificity of 97.2%, with an AUC of 0.969 in the detection of proliferative diabetic retinopathy. FFA is highly sensitive in the early diagnosis of DR and has a high confirmation rate. However, it is time-consuming and labor-intensive, and patients may have allergic reactions. Optical coherence tomography (OCT) [26] is also a medical imaging modality used to diagnose associated eye diseases such as vitreous traction or diabetic retinopathy. A. Tanboly et al. [27] developed a deep fusion classification network for the classification of normal and diabetic retinopathy using optical coherence tomography (OCT) images. D. Le et al. [28] achieved automatic classification of DR in OCTA images by performing transfer learning on VGG networks. OCT can show the thickness changes of the retinal nerve fiber layer, but it cannot determine the presence of microvascular tumors. F. Tang et al. [14,29] used deep learning methods to classify diabetic retinopathy on ultra-wide field scanning laser ophthalmoscopy (UWF-SLO) with excellent results. UWF-SLO is a valuable imaging tool for detecting fundus anomalies, but it does not provide multiple modal images like multi-color images.

Table 1. Research of DR with different modal images in recent years.

View Table | View all tables in this article

2.2 Information bottleneck theory

Information Bottleneck (IB) [30] is a technique based on information theory. It formalizes the intuition of information and provides a measure of the "relevance" of information, which is implemented by optimizing the constrained problem

(1)$$\mathop {\arg \max }_{t \in \Delta } I(y;t){\ }s.t.\, I(x;t) \le r,$$

where x is an input variable, y is the output variable, r is a level of compression as the constraint. IB attempts to find the "bottleneck" variable t which is a representation of x by maximizing the mutual information I (y; t), while minimizing mutual information I (x; t)between x and t. $\Delta$ is the set of random variables t that obey the Markov condition y - x - t [31]. This Markov condition guarantees that any information in t about y is extracted from x and t is conditionally independent of y for a given x. [32]. In practice, it is hard to obtain the desired compression level r in Eq. (1) [33]. The typical way for solving Equation (1) is to transform it into the so-called IB Lagrangian [30,34,35]

(2)$${L_{IB}}(r) = I(y;r) - \beta I(x;r),$$

where $\beta$ is the Lagrange multiplier which controls the tradeoff between compression and prediction and is typically set in [0,1] [33].

According to the information bottleneck theory, the network squeezes information out of a bottleneck, removes noise input data with irrelevant details, and only retains the features most related to the general concept [36]. In [36], Tishby reveals the relationship between information bottleneck and deep learning. Since then, information bottleneck has become a popular research topic in recent years. The IB approach has found extraordinary applications in unsupervised [37–41], supervised [42–45] and self-supervised [46–50] learning problems. Because there is no exact solution when learning potential representation with deep neural networks, there are also many kinds of research [51–53] on how to accurately solve mutual information. In addition, IB has a commonly used in domains such as attribution [54,55], classification [43,50,56], speech decomposition [41], and image recognition [57]. Similar to our study, IB in multi-view studies [40,53,58–61] aims to find the MI between different views. [61] proposed a supervised multi-view information bottleneck method to remove superfluous information and learned an accurate joint representation from multiple views. In [40], the authors extended the information bottleneck to the unsupervised multi-view setting to learn a joint latent representation.

3. Our approach

We believe that a successful technique for achieving accurate DR classification of MC Images depends on several key factors: the ability of the neural network to extract key features and the capability to integrate multiple feature representations. To target these, we developed a multimodal information bottleneck network (MMIB-Net) for the classification of MC Images. It adds the loss function of IB to the network of classification. Besides, it uses multimodal information bottleneck theory to extract the feature representations related to classification in different modal images. Moreover, it integrates the feature representations of different modalities of MC Images to improve classification accuracy.

3.1 Structure of MMIB-Net

As described above, discrimination performance can be improved by using multimodal images. Inspired by this motivation, we propose a novel network to classify MC Images. As shown in Fig. 2, the model employs four modalities: blue reflectance, green reflectance, infrared reflectance, and combined-pseudocolor image. The backbone network can be one of the well-known deep learning architecture. The feature map of each modality is used as an input to the information bottleneck module and fed into the multilayer perceptron (MLP) to obtain a representation. These image representations are subsequently used to compose the mean and variance of the Gaussian distribution when computing mutual information. At the same time, the representation of each modality is concatenated together and fed to the decoder to obtain the joint representation of all modalities. The feature representations of each modality and their joint feature representations are subsequently used to compute the mutual information with the labels and inputs to obtain the loss function of the information bottleneck principle. Meanwhile the image features of the four modalities obtained by the backbone network are fed into the fully connected layer separately. The four feature vectors output from the fully connected layer are concatenated together and then input to the Sigmoid activation layer to get the classification results and obtain the loss function for classification. Using the information bottleneck theory, the model can identify redundant information in each modality and thus find the feature representation associated with the label. In addition, the proposed MMIB-Net fuses the representations of multiple modalities and learns the potential representations between different modalities. With the multimodal feature extraction module of the model, we can obtain features of different modalities at the same time and improve the performance of the model by combining these features.

3.2 Problem formulation

Given a set of MC Images{(x$_{ij}$,y$_{i}$), $i$ = 1,2,…, $N$;$j$ = 1,2,…, $M$}, where x$_{i}$ denotes the ith set of input MC Image. The ${j}$ denotes the number of modalities of MC images. y$_{i}$ denotes the ground truth label assigned to ith input set. $N$ is the number of our case. $M$ is the number of the modalities, and in our method $M$ = 4. Our goal is to classify them to determine if they are DR. In the field of computer vision, there are many networks for natural image classification. Using the classification network we can obtain a predictive classification label $\hat {y}$ for the MC Images. To better classify MC Images, we added the information bottleneck loss function on top of that. The overall loss function of our method can be written in the following form.

(3)$$L\textrm{oss} = los{s_1}(y,\hat y) + los{s_2}(IB).$$

$Los{s_1}$ is for the simple classification network which is defined as

(4)$$los{s_1} = \frac{1}{N}\sum_{i = 1}^N {H({y_i},{{\hat y}_i})},$$

where H() is the cross-entropy function for measuring the dissimilarity between $\hat {y}_{i}$ and y$_{i}$.

According to IB theory, $los{s_2}$ (IB) should like be the Eq. (2). However, in MC Images classification, we have four modal images, so we should consider extending Eq. (2) to a multimodal function. These modalities are learned jointly by the information bottleneck as

(5)$$los{s_2}(IB) = \mathop {\max }_{R,{R_1},{R_2},\ldots{R_m}} I(Y;R) - \sum_\textrm{j}^m {{\lambda _j}I({X_j};{R_j})} \textrm{ }s.t.\textrm{ }R = {f_\theta }({R_1},{R_2},\ldots{R_m}),$$

where ${\lambda _j}$ is the regularization parameter. ${X_j}$ is the input image. ${R_j}$ denote the latent representation of ${X_j}$. $m$ is the number of the modality. ${f_\theta }$ is the network with parameter ${\theta }$. $R$ is the learned joint representation from all modalities of MC Images. This loss consists of three terms. The first term is to learn joint potential representations in all modalities by maximizing the mutual information between the joint representation and the label ${Y}$. Each ${X_j}$ has a latent representation ${R_j}$ corresponding to it. The second term minimizes the mutual information between the latent representation of each modality and the original input. With the first two terms, we can obtain the most relevant representation of the label ${Y}$ in the input ${X}$. The last term is where we learn the joint representation $R$ of each modality using a neural network ${f}$ with parameter ${\theta }$.

3.3 Optimization

The key to solving Eq. (5) is the calculation of the mutual information. The mutual information between any two random variables $A$ and $B$ is defined as:

(6)$$I(A;B) = \mathbb{E}{_{p(a,b)}}\left[ {\log \frac{{p(a|b)}}{{p(a)}}} \right] = \int {dbda} p(a,b)\log \frac{{p(a|b)}}{{p(a)}},$$

where $p(a)$ is the marginal probability density function of $A$. $p(a,b)$ is the joint probability density function of $a$ and $b$. $p(a|b)$ is the conditional probability distribution of $a$ given $b$ [52]. In practice, the estimation of mutual information is computationally intractable as the underlying distributions of data are not available in high dimension [52,62]. In order to establish tractable objectives, variational methods [45,51,52,57,63] can be used to solve such problems by maximizing the lower bound on the variance of the objective function. Specifically, these methods use some known distributions to approximate the intractable distributions and to obtain a lower bound on the original objective function [61]. In our experiments, we performed a z-Test on our null hypothesis that the extracted deep features by our network come from a normal distribution and the z-Test function returns 0 which means it obtain a goodness of fit related to Gaussian distribution. By restricting the lower bound, we can obtain the approximate solution of the mutual information.

To simplify the analysis, we explain Eq. (5) with an example of images of two modalities. After that, we will expand it to multiple modalities in general. Given two modalities of MC Images ${X_1}$, ${X_2}$ and the class labels Y, Eq. (5) would then be simplified to the following form.

(7)$$\mathop{\max }\limits_{R,{R_1},{R_2}} I(Y;R) - {\lambda _1}I({X_1};{R_1}) - {\lambda _2}I({X_2};{R_2}), s.t.R = {f_\theta }({R_1},{R_2}).$$

As for the mutual information $I(Y;R)$, we can use the following equation to solve.

(8)$$I(Y;R) = \int {dydrp(y,r)} \log \frac{{p(y|r)}}{{p(y)}}.$$

Since ${p(y|r)}$ is intractable, let a distribution ${q(y|r)}$ be a approximation to it. The Kullback Leibler divergence (KL-divergence) between ${p(y|r)}$ and ${q(y|r)}$ is always positive [45]. we have

(9)$$KL\left[ {p(y|r),q(y|r)} \right] \ge 0 \Rightarrow \int {dydrp(y,r)\log (p(y|r))} \ge \int {dydrp(y,r)\log (q(y|r))},$$

and hence

(10)$$I(Y;R) \ge \int {dydrp(y,r)} \log \frac{{q(y|r)}}{{p(y)}} = \int {dydrp(y,r)} \log q(y|r) - \int {dyp(y)\log p(y)} .$$

Notice that the second term in Eq. (10) is the entropy of label $H(Y)$ which is independent of the optimization. So we can drop it in the procedure. The variation lower bound of $I(Y;R)$ is

(11)$$I(Y;R) \ge \int {dydrp(y,r)} \log q(y|r).$$

Focusing on the $p(y,r)$, we can rewrite it as

(12)$$p(y,r) \textrm{ = }\int {d{x_1}d{x_2}d{r_1}d{r_2}} p({x_1},{x_2},{r_1},{r_2},y,r).$$

Therefore, Eq. (11) can be rewritten as

(13)$$I(Y;R) \ge \int {dydrd{x_1}d{x_2}d{r_1}d{r_2}} p({x_1},{x_2},{r_1},{r_2},y,r)\log q(y|r).$$

In order to solve Eq. (13), we need to find the joint probability density function of all variables to obtain its variational lower bound. Leveraging Markov assumption $p({x_1},{x_2},{r_1},{r_2},y,r)$ can be represented as

(14)$$p({x_1},{x_2},{r_1},{r_2},y,r) = p(r|{r_1},{r_2},{x_1},{x_2},y)p({r_1}|{r_2},{x_1},{x_2},y)p({r_2}|{x_1},{x_2},y)p({x_1},{x_2},y).$$

$x_1$ and $x_2$ are two modalities of MC Images. $r_1$ and $r_2$ are learned from them respectively. Therefore, we assume given $x_1$, $r_1$ is independent of $x_2$, $r_2$ and $y$. Accordingly, we also assume that given $x_2$, $r_2$ is independent of $x_1$, $r_1$ and $y$. In addition, we assume that $r_1$, $r_2$, $r$ is independent of $x_1$, $x_2$ and $y$. The Markov chain between these variables is shown in Fig. 3. Substituting Eq. (14) into Eq. (13) while applying our assumptions, we can obtain a new lower bound of the mutual information between $Y$ and $R$ :

(15)$$I(Y;R) \ge \int {d{x_1}d{x_2}dy} p({x_1},{x_2},y)\int {drd{r_1}d{r_2}p(r|{r_1},{r_2})} p({r_1}|{x_1})p({r_2}|{x_2})\log q(y|r).$$

Fig. 3. The Markov Chain assumed in our eqution.

Download Full Size | PDF

For the second term of Eq. (7), the mutual information between $X_1$ and $R_1$ is

(16)$$I({X_1};{R_1}) = \int {d{r_1}d{x_1}p({x_1},{r_1})} \log \frac{{p({r_1}|{x_1})}}{{p({r_1})}}.$$

Since it is difficult to compute $p(r_1)$, so let $t_1(r_1)$ to approximate it. Because of the property of the KL-divergence $KL[p(r_1),t_1(r_1)]$ $\ge$ 0 $\Rightarrow$ $\int {dr} p({r_1})\log p({r_1}) \ge \int {drp({r_1})\log } {t_1}({r_1})$ , we have the following upper bound of $I({X_1};{R_1})$:

(17)$$I({X_1};{R_1}) \le \int {d{r_1}d{x_1}p({x_1},{r_1})} \log \frac{{p({r_1}|{x_1})}}{{{t_1}({r_1})}} = \int {d{x_1}d{x_2}dy} d{r_1}p({x_1},{x_2},{r_1},y)\log \frac{{p({r_1}|{x_1})}}{{{t_1}({r_1})}}.$$

Using our assumption that given $X_1$, $R_1$ is independent of $X_2$, $R_2$, $Y$, we have

(18)$$I({X_1};{R_1}) \le \int {d{x_1}d{x_2}dy} p({x_1},{x_2},y)\int {d{r_1}p({r_1}|{x_1})\log } \frac{{p({r_1}|{x_1})}}{{{t_1}({r_1})}}.$$

Using a similar derivation procedure for $I({X_2};{R_2})$, we can obtain the mutual information between $X_2$ and $R_2$:

(19)$$I({X_2};{R_2}) \le \int {d{x_1}d{x_2}dy} p({x_1},{x_2},y)\int {d{r_2}p({r_2}|{x_2})\log } \frac{{p({r_2}|{x_2})}}{{{t_2}({r_2})}}.$$

Combining these bounds in Eq. (15), Eq. (18) and Eq. (19), we have the variational lower bound:

(20)$$\begin{aligned} &I(Y;R) - {\lambda _1}I({X_1};{R_1}) - {\lambda _2}I({X_2};{R_2}) \ge \int {d{x_1}d{x_2}dy} p({x_1},{x_2},y)\\ &\{ \int {drd{r_1}d{r_2}p(r|{r_1},{r_2})} p({r_1}|{x_1})p({r_2}|{x_2})\log q(y|r)\\ &- {\lambda _1}\int {d{r_1}p({r_1}|{x_1})\log } \frac{{p({r_1}|{x_1})}}{{{t_1}({r_1})}} - {\lambda _2}\int {d{r_2}p({r_2}|{x_2})\log } \frac{{p({r_2}|{x_2})}}{{{t_2}({r_2})}}\}. \end{aligned}$$

Next, we derive how to calculate Eq. (20) in practice. We can use the empirical data distribution to approximate the integral over $X_1$, $X_2$ and $y$ (Monte Carlo sampling [64]), and Eq. (20) can be written as

(21)$$\begin{aligned} &I(Y;R)-{\lambda _1}I({X_1};{R_1})-{\lambda _2}I({X_2};{R_2}) \ge \\ &\frac{1}{N}\mathop{\sum}\limits_{i}^{N} \{\int {drd{r_1}d{r_2}p(r|{r_1},{r_2})} p({r_1}|{x_1})p({r_2}|{x_2})\log q(y|r) \\ &- {\lambda _1}\int {d{r_1}p({r_1}|{x_1})\log } \frac{{p({r_1}|{x_1})}}{{{t_1}({r_1})}} - {\lambda _2}\int {d{r_2}p({r_2}|{x_2})\log } \frac{{p({r_2}|{x_2})}}{{{t_2}({r_2})}}\}. \end{aligned}$$

We assume that $p({r_1}|{x_1}) = \textrm {N}({\mu _1}({x_1};{\varphi _1}),{\delta _1}({x_1};{\varphi _1}))$, $p({r_2}|{x_2}) = \textrm {N}({\mu _2}({x_2};{\varphi _2}),{\delta _2}({x_2};{\varphi _2}))$ and $p(r|{r_1},{r_2}) = \textrm {N}(\mu ({r_1},{r_2};\varphi ),\delta ({r_1},{r_2};\varphi ))$ are Gaussian distributions, and their means(${\mu _1}$, ${\mu _2}$, ${\mu }$) and variances(${\delta _1}$, ${\delta _2}$, ${\delta }$) are learned by the MLPs. ${\varphi _1}$, ${\varphi _2}$, ${\varphi }$ are the parameters of MLPs, respectively. The outputs of MLP are K-dimensional mean ${\mu }$ and $K \times K$ covariance matrix ${\delta }$. We transform ${r_1}$, ${r_2}$, ${r}$ into deterministic functions using the reparameterization trick [65], i.e., ${r_1} = \mu ({x_1};{\varphi _1}) + \delta ({x_1};{\varphi _1}){\varepsilon _1}$, ${r_2} = \mu ({x_2};{\varphi _2}) + \delta ({x_2};{\varphi _2}){\varepsilon _2}$, $r = \mu ({r_1},{r_2};\varphi ) + \delta ({r_1},{r_2};\varphi )\varepsilon$. where ${\varepsilon _1}$, ${\varepsilon _2}$, ${\varepsilon }$ $\sim$ $\textrm {N}$ $(0,I)$ are Gaussian random variables. Therefore the Eq. (7) is transformed into the following form.

(22)$$\max \frac{1}{N}\sum_{}^N {(\mathbb{E}{_\varepsilon }\mathbb{E}{_{{\varepsilon _1}}}\mathbb{E}{_{{\varepsilon _2}}}\log q(y|r) - {\lambda _1}\mathbb{E}{_{{\varepsilon _1}}}\log \frac{{p({r_1}|{x_1})}}{{{t_1}({r_1})}} - {\lambda _2}\mathbb{E}{_{{\varepsilon _2}}}\log \frac{{p({r_2}|{x_2})}}{{{t_2}({r_2})}})},$$

where $\mathbb {E}$ is the expected value. Next, we add the corresponding information constraint terms to generalize Eq. (22) to settings with more than two modalities.

(23)$$\max \frac{1}{N}\sum_{}^N {(\mathbb{E}{_\varepsilon }{\mathbb{E}_{{\varepsilon _1}}}{\mathbb{E}_{{\varepsilon _2}}}\cdots{\mathbb{E}_{{\varepsilon _m}}}\log q(y|r) - {\lambda _j}{\mathbb{E}_{{\varepsilon _j}}}\log \frac{{p({r_j}|{x_j})}}{{{t_j}({r_j})}})},$$

where ${\varepsilon }$ , ${\varepsilon _1}$, ${\varepsilon _2}$ $\cdots$ ${\varepsilon _m}$ $\sim$ $\textrm {N}$ $(0,I)$. ${t_j}({r_j})$ are assumed as ${t_j}({r_j})$ $\sim$ $\textrm {N}$ $(0,I)$. $p({r_j}|{x_j})$ are assumed as Gaussian with means ${\mu }$ and variances ${\delta }$ which are learned from the MLPs. Finally, the total loss can be writen as:

(24)$$L\textrm{oss} = \frac{1}{N}\sum_{i = 1}^N {H({{\hat y}_i},{y_i})} + \max \frac{1}{N}\sum_{}^N {(\mathbb{E}{_\varepsilon }{\mathbb{E}_{{\varepsilon _1}}}{\mathbb{E}_{{\varepsilon _2}}}\cdots{\mathbb{E}_{{\varepsilon _m}}}\log q(y|r) - {\lambda _j}{\mathbb{E}_{{\varepsilon _j}}}\log \frac{{p({r_j}|{x_j})}}{{{t_j}({r_j})}})}.$$

4. Experiment

4.1 Data and evaluation

We report the performance of the proposed model on the in-house dataset. We retrospectively collected the MC examinations performed from 24 patients with a total of 240 images (some cases with missing images). Classification models always perform well when the dataset is large enough. The number of raw images is far from adequate for our needs. Data augmentation can be used to increase the number of images in the dataset, thus preventing overfitting when training on little data. With the image enhancement method, the number of images has increased from 240 to 11576. In our experiments, we use 70% (8103 images) of the data set as the training set, 20% (2315 images) as the validation set, and 10% (1158 images) as the test set. All experimental results are obtained on the test set. These images are 24-bit depth, provided in the JPG format. The typical size of the image is 768$\times$868. The ophthalmologist manually labeled them as DR or not. These labels are treated as the ground truth for validating our algorithm. In order to quantitatively measure the performance of the proposed DR diagnosis method, the assessment of accuracy, precision, recall, F1_score, specificity and AUC are used in this study.

4.2 Implementation details

The optimization process is run for 50 epochs. We conducted a set of experiments to empirically select the optimal value of the parameters in Eq. (24). In our method, the structure of MLP is M-1024-1024-2K. M is the size of the output features of the backbone network, which depends on the size of the network. The size of the bottleneck is parameter K. The first K is used to compose the ${\mu }$ and the latter K is used to determine the variance ${\delta }$. We set the parameter K to 144. The hyperparameter ${\lambda _1}$ = ${\lambda _2}$ = ${\lambda _3}$ = 0.3, ${\lambda _4}$ = 0.05. The selection of parameters K and ${\lambda _j}$ will be verified in the subsequent experiments. We altered the open-source in [61] to meet our four modalities needs. In our experimental work, Pytorch is used as the programming language. The software environment is Ubuntu 16.04 and the hardware environment is CPU E5-2630 and NVIDIA GV100GL Tesla V100 32GB GPU. We set the learning rate as 0.01, and the batch size is set as 16. All methods are optimized by Adam optimizer [66]. Figure 4 shows the loss and accuracy curves during the training and validation process. As the training proceeds, the loss on the training and validation sets gradually decreases and then level off. Meanwhile, the corresponding accuracy rate gradually increases and then reaches a maximum. This shows that our method has obtained favourable results in the training process.

Fig. 4. Loss and accuracy curves for networks with different modalities during training and validation. The blue box, green box, yellow box and colored box represent the results when experiments are performed using BR images, GR images, IR images and multimodal images, respectively.

Download Full Size | PDF

4.3 Evaluation

We first use a deep learning model (ResNet50 [67]) to classify a single-modality MC Images to determine whether the image is normal or DR. Each case includes images of multiple modalities, and we start by using images of only one modality as input to the network to evaluate its classification performance. If the results of each modal image indicate that the case is DR, then the final diagnosis of the patient is likely to be positive. As mentioned before, the pseudo-color image is synthesized from the other three images. To investigate whether the combined-pseudocolor image can improve the performance of the model, we also designed a model in which only three images, BR, GR, and IR, are used as inputs and tested its performance. As depicted in Table 2, the accuracy of the model is mostly above 91% when only one modality is occupied to classify MC Images. The model using combined-pseudocolor images alone (ResNet50-Pseudocolor) performs slightly better than the other unimodal models. When we use three images as input, the model (ResNet50-Three) performs better than any of the unimodal states. The performance of the model (ResNet50-Multimodal) in reaching the optimum when we extract these four image features simultaneously. It indicates that the images of each modality provide useful information for model classification.

Table 2. Performance of classification using different unimodal MC Images. The first four rows are the results obtained with a single modal image as model input. ResNet50-Three is the network using both BR,GR and IR images as input. ResNet50-Multimodal is the network that takes four modalities of MC images as input.

View Table | View all tables in this article

To understand the regions that the network focuses on when classifying, we visualized the discriminative regions using Grad-Class Activation Mapping (Grad-CAM) [68]. The activation maps allow us to understand which parts of the image contribute more to the final output of the model. With the Grad-CAM tool, we can see which part of the image activates the classification the most. The brighter the part of the graph, the higher the level of network attention. Figure 5 presents a plot of the results of the Grad-CAM for a set of MC Images. We can also see from the figure that the network focuses on different areas of the image for different modalities, so the information provided by different modalities is not the exact same.

Fig. 5. Visualization of class activation mapping of MC images. The first row shows the images of the different modalities of the input. The second row is the corresponding class activation mapping.

Download Full Size | PDF

It is a bit complicated to train a model for each modal image to perform diagnosis, so we extract features for multiple modal images simultaneously, and it is possible to improve classification accuracy by considering the features of multiple modalities simultaneously. To verify our conjecture, we adapt the ResNet to fit our multimodal image input and evaluate its performance on the test set. We first use images of a single modality as the input to the network and then test its performance. The results are in the first four rows of Table 2. Since the combined-pseudocolor images are synthesized from the other three images, in order to test whether we can get better results using only these three images, we only use BR, GR, and IR images as the input of the model to test its performance. Finally, we train the images of the four modalities simultaneously as the input of the model and record the results. As shown in Table 2, with the exception of the precision metric, all five metrics receive a performance improvement of approximately 2%. From the Table 2 we can conclude that the multimodal based network is useful and it performs much better than the single modality because multimodal networks can learn more features.

To demonstrate the generality of multimodal networks, other attempts also involved modeling the dataset using deep learning models such as SE-ResNet [69], ResNeXt [70], WRN [71] and Res2Net [72]. The results of the other multimodal networks for classification are shown in Table 3. As can be seen from the Table 3, satisfactory results are also obtained for the relevant evaluation parameters, which point to the effectiveness of our proposed multimodal scenario for MC Images classification.

Table 3. Classification results of different multimodal networks for the diagnosis of DR

View Table | View all tables in this article

In order to remove redundant representations between different modalities while integrating the joint representation, we added the IB technique to the above network. The classification results of the network model with the addition of IB for the DR diagnosis of MC Images as demonstrated in Table 4. From the Table 4, we find that the ResNet50-based network has the best performance in four metrics, with the classification accuracy, precision, F1_score, and AUC of 96.994%, 100.000%, 0.968, 0.972, respectively. In the other hand, from the comparison of Table 3 and Table 4, the results are clearly consistent, with all IB-based networks outperforming the multimodal-based classification networks alone in general. The method that uses the multimodal information bottleneck model improves the accuracy by an average of 2.2% compared to the network that uses multimodality alone. The results show that the MMIB-Net obtain better performance output and achieve promising results.

Table 4. The performance of network model with IB added for the diagnosis of DR

View Table | View all tables in this article

From the above results, we can see that our algorithm achieves a promising performance. This leads us to verify several of the previously mentioned statements. First, our algorithm which is designed specifically for classification of MC Images outperforms generic image classification approaches, possibly because of the integration of different modalities of images. Second, we discard irrelevant and noisy information and extract diagnostically relevant representations by IB theory, a strategy that is better than traditional deep learning-based classification methods.

5. Discussion

This work presents a multimodal deep learning model for DR diagnosis using MC imaging data with performance improvement exploiting information bottleneck theory. The final numerical outcomes indicated the outperformance of the multimodal and IB-based deep learning algorithms over the single-modal classifier. In this section we design several experiments to evaluate the effect of two important parameters of this method, K and ${\lambda }$, on the experimental results.

The choice of $K$ value is related to the amount of sampling. If $K$ is too small, there is very little information to get past the bottleneck. If $K$ is too large, it does not serve the purpose of compressing the information. We designed a set of experiments to determine the optimal value of $K$. Figure 6 shows that when the value of $K$ is too small, the accuracy rate is not very satisfactory. As the $K$ value increases, the bottleneck slowly becomes larger. More information can be passed, and the accuracy rate gradually increases. However, when $K$ increases to a certain degree ($K$=144), increasing the number of samples will inevitably introduce redundant information and lead to a decrease in accuracy.

Fig. 6. Classification accuracy of the model at different $K$ values. The solid line is the accuracy of the model classification corresponding to different $K$ values. The dashed line is the trend of its accuracy for different $K$ values.

Download Full Size | PDF

We go a step further to evaluate the effect of parameter ${\lambda }$ on the experimental results. For each modal image, we consider its contribution to the final result to be the same, so for each modal parameter ${\lambda _j}$($j$=1,2,3,4), we set the same value. We use resnet18 as the base network architecture to extract image features, and keep changing the value of ${\lambda }$ to explore its effect on the experimental results. Figure 7 provides a visualization of the variation in model performance. The accuracy tends to increase as the parameter value increases and reaches a maximum of ${\lambda }$ = 0.3, and then starts to decrease. At the beginning, as ${\lambda }$ increases, the input redundant information gradually becomes less and the accuracy is increased. When ${\lambda }$ exceeds 0.35, continuing to increase the value of ${\lambda }$ causes less useful information to pass through, which causes the accuracy rate to begin to decrease.

Fig. 7. Classification accuracy of the model at different ${\lambda }$ values. The solid line is the accuracy of the model classification corresponding to different ${\lambda }$ values. The dashed line is the trend of its accuracy for different ${\lambda }$ values.

Download Full Size | PDF

To better understand the impact of each channel, we tested the impact of ${\lambda }$ values separately. To achieve this, we alter one of the values separately, while fixing the rest of the values equal to 0.3, and the experimental results are shown in Fig. 8. As can be seen from the graph, the experimental results fluctuate more when the values of ${\lambda _2}$ and ${\lambda _3}$ are changed. It can be guessed that ${\lambda _2}$ and ${\lambda _3}$ have more influence on the final experiment. While the fluctuations of ${\lambda _1}$ and ${\lambda _4}$ are relatively smooth. The accuracy of the model reaches a maximum of 96.994% when ${\lambda _4}$ is taken as 0.05 and the other values are taken as 0.3.

Fig. 8. The effect of different ${\lambda }$ values on the experiment. ${\lambda _1}$ means that we fix the other three parameters and change the value of ${\lambda _1}$. The other three representations of ${\lambda }$ are the same.

Download Full Size | PDF

6. Conclusion and future work

In this paper, we provide a solution to an important problem of MC Images classification. Our three significant technical contributions are: First, We build a model that can simultaneously take into account the joint representation of different modalities and can extract cross-modal image features when performing DR classification. Second, We novelly introduce the information bottleneck theory into the classical deep learning framework for MC Images classification. The information bottleneck (IB) can discard irrelevant information of classification and extract image representation associate with DR, which is validated to help produce better accuracy due to obtaining the relative feature of classification. Finally, to learn the relationship between modalities, we use multimodal information bottleneck to take the joint representation of different modalities into account when performing DR classification. Our future work will not only include further validations of the method using more MC Images, but also finer classification of DR diseases using our algorithm. In addition, we are interested in investigating the detection of other ocular diseases in MC Images.

Funding

Taishan Scholar Project of Shandong Province (TSHW201502038); Natural Science Foundation of Shandong Province (ZR2018ZB0419); National Natural Science Foundation of China (61773246, 81871508).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. A. Giridhar and P. Ranjith, “Clinical applications of multicolor imaging technology in epiretinal membrane,” Kerala J. Ophthalmol. 30(2), 72 (2018). [CrossRef]

2. Z. Zhang, M. Li, Y. Sun, Y. Wei, and S. Zhang, “Multicolor scanning laser ophthalmoscopy strengthens surgeons?’ preoperative decision-making and intraoperative performance on epiretinal membrane,” Trans. Vis. Sci. Tech. 9(13), 36 (2020). [CrossRef]

3. A. C. Tan, M. Fleckenstein, S. Schmitz-Valckenberg, and F. G. Holz, “Clinical application of multicolor imaging technology,” Ophthalmologica 236(1), 8–18 (2016). [CrossRef]

4. V. Govindahari, S. Fraser-Bell, A. Ayachit, A. Invernizzi, U. Nair, D. Nair, M. Lupidi, S. Randhir Singh, A. Rajendran, D. Zur, R. Gallego-Pinazo, R. Dolz-Marco, C. Cagini, M. Cozzi, and J. Chhablani, “Multicolor imaging in macular telangiectasia—a comparison with fundus autofluorescence,” Graefe’s Arch. Clin. Exp. Ophthalmol. 258(11), 2379–2387 (2020). [CrossRef]

5. J. Wang, T. Hormel, Q. S. You, Y. Guo, X. Wang, L. Chen, T. Hwang, and Y. Jia, “Robust non-perfusion area detection in three retinal plexuses using convolutional neural network in oct angiography,” Biomed. Opt. Express 11(1), 330 (2020). [CrossRef]

6. R. Sayres, A. Taly, E. Rahimy, K. Blumer, D. Coz, N. Hammel, J. Krause, A. Narayanaswamy, Z. Rastegar, D. Wu, S. Xu, S. Barb, A. Joseph, M. Shumski, J. Smith, A. Sood, G. Corrado, L. Peng, and D. Webster, “Using a deep learning algorithm and integrated gradients explanation to assist grading for diabetic retinopathy,” Ophthalmology 126(4), 552–564 (2019). [CrossRef]

7. N. Asiri, M. Hussain, F. Al Adel, and N. Alzaidi, “Deep learning based computer-aided diagnosis systems for diabetic retinopathy: A survey,” Artif. Intelligence Medicine 99, 101701 (2019). [CrossRef]

8. D. K. Prasad, L. Vibha, and K. Venugopal, “Early detection of diabetic retinopathy from digital retinal fundus images,” in 2015 IEEE Recent Advances in Intelligent Computational Systems (RAICS), (IEEE, 2015), pp. 240–245.

9. V. Raman, P. Then, and P. Sumari, “Proposed retinal abnormality detection and classification approach: Computer aided detection for diabetic retinopathy by machine learning approaches,” in 2016 8th IEEE International Conference on Communication Software and Networks (ICCSN), (IEEE, 2016), pp. 636–641.

10. M. U. Akram, S. Khalid, A. Tariq, S. A. Khan, and F. Azam, “Detection and classification of retinal lesions for grading of diabetic retinopathy,” Comput. Biol. Med. 45, 161–171 (2014). [CrossRef]

11. L. Math and R. Fatima, “Adaptive machine learning classification for diabetic retinopathy,” Multimed. Tools Appl. 80(4), 5173–5186 (2021). [CrossRef]

12. T. Li, Y. Gao, K. Wang, S. Guo, H. Liu, and H. Kang, “Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening,” Inf. Sci. 501, 511–522 (2019). [CrossRef]

13. X. Li, X. Hu, L. Yu, L. Zhu, C. Fu, and P. Heng, “Canet: cross-disease attention network for joint diabetic retinopathy and diabetic macular edema grading,” IEEE Trans. Med. Imaging 39(5), 1483–1493 (2020). [CrossRef]

14. F. Tang, P. Luenam, A. R. Ran, A. A. Quadeer, R. Raman, P. Sen, R. Khan, A. Giridhar, S. Haridas, M. Iglicki, D. Zur, A. Loewenstein, H. P. Negri, S. Szeto, B. K. Y. Lam, C. C. Tham, S. Sivaprasad, M. Mckay, and C. Y. Cheung, “Detection of diabetic retinopathy from ultra-widefield scanning laser ophthalmoscope images: A multicenter deep learning analysis,” Ophthalmology Retina (2021).

15. Y. Kang, Y. Fang, and X. Lai, “Automatic detection of diabetic retinopathy with statistical method and bayesian classifier,” J. Med. Imaging Heal. Inf. 10(5), 1225–1233 (2020). [CrossRef]

16. Y. Li, N. Yeh, S. Chen, and Y. Chung, “Computer-assisted diagnosis for diabetic retinopathy based on fundus images using deep convolutional neural network,” Mob. Inf. Syst. 2019, 1–10 (2019). [CrossRef]

17. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2015), pp. 1–9.

18. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 770–778.

19. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2017), pp. 4700–4708.

20. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 (2014).

21. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 7132–7141.

22. P. Porwal, S. Pachade, R. Kamble, M. Kokare, G. Deshmukh, V. Sahasrabuddhe, and F. Meriaudeau, “Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research,” Data 3(3), 25 (2018). [CrossRef]

23. T. R. Gadekallu, N. Khare, S. Bhattacharya, S. Singh, P. K. R. Maddikunta, and G. Srivastava, “Deep neural networks to predict diabetic retinopathy,” Journal Of Ambient Intelligence and Humanized Computing pp. 1–14 (2020).

24. X. Pan, K. Jin, J. Cao, Z. Liu, J. Wu, K. You, Y. Lu, Y. Xu, Z. Su, J. Jiang, K. Yao, and J. Ye, “Multi-label classification of retinal lesions in diabetic retinopathy for automatic analysis of fundus fluorescein angiography based on deep learning,” Graefe’s Arch. Clin. Exp. Ophthalmol. 258(4), 779–785 (2020). [CrossRef]

25. T. Nagasawa, H. Tabuchi, H. Masumoto, H. Enno, M. Niki, Z. Ohara, Y. Yoshizumi, H. Ohsugi, and Y. Mitamura, “Accuracy of ultrawide-field fundus ophthalmoscopy-assisted deep learning for detecting treatment-naïve proliferative diabetic retinopathy,” Int. Ophthalmol. 39(10), 2153–2159 (2019). [CrossRef]

26. S. Baamonde, J. de Moura, J. Novo, P. Charló, and M. Ortega, “Automatic identification and characterization of the epiretinal membrane in o“detection of diabetic retinopathy from ultra-widefield scanning laser ophthalmoscope images: Act images,” Biomed. Opt. Express 10(8), 4018 (2019). [CrossRef]

27. A. ElTanboly, M. Ghazal, A. Khalil, A. Shalaby, A. Mahmoud, A. Switala, M. El-Azab, S. Schaal, and A. El-Baz, “An integrated framework for automatic clinical assessment of diabetic retinopathy grade using spectral domain oct images,” in 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), (IEEE, 2018), pp. 1431–1435.

28. D. Le, M. Alam, C. K. Yao, J. I. Lim, Y.-T. Hsieh, R. V. Chan, D. Toslak, and X. Yao, “Transfer learning for automated octa detection of diabetic retinopathy,” Trans. Vis. Sci. Tech. 9(2), 35 (2020). [CrossRef]

29. F. Tang, P. Luenam, A. A. Quadeer, A. Ran, S. Sivaprasad, P. Sen, R. Raman, G. Anantharaman, M. Gopalakrishnan, S. Haridas, M. R. Mckay, and C. Y. lui Cheung, “Detection of referable and vision-threatening diabetic retinopathy using deep learning on ultra-wide field scanning laser ophthalmoscope images,” Investigative Ophthalmology & Visual Science 61, 1646 (2020).

30. N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” University of Illinois 411, 368–377 (2000).

31. H. Witsenhausen and A. Wyner, “A conditional entropy bound for a pair of discrete random variables,” IEEE Trans. Inf. Theory 21(5), 493–501 (1975). [CrossRef]

32. A. Kolchinsky, B. D. Tracey, and D. H. Wolpert, “Nonlinear information bottleneck,” Entropy 21(12), 1181 (2019). [CrossRef]

33. Z. Pan, L. Niu, J. Zhang, and L. Zhang, “Disentangled information bottleneck,” arXiv preprint arXiv:2012.07372 (2020).

34. B. Rodríguez Gálvez, R. Thobaben, and M. Skoglund, “The convex information bottleneck lagrangian,” Entropy 22(1), 98 (2020). [CrossRef]

35. R. Gilad-Bachrach, A. Navot, and N. Tishby, “An information theoretic tradeoff between complexity and accuracy,” in Learning Theory and Kernel Machines, (Springer, 2003), pp. 595–609.

36. N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in 2015 IEEE Information Theory Workshop (ITW), (IEEE, 2015), pp. 1–5.

37. Y. Luo, P. Liu, T. Guan, J. Yu, and Y. Yang, “Significance-aware information bottleneck for domain adaptive semantic segmentation,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (2020).

38. J. Dong, Y. Cong, G. Sun, B. Zhong, and X. Xu, “What can be transferred: Unsupervised domain adaptation for endoscopic lesions segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2020), pp. 4023–4032.

39. X. Yan, Z. Lou, S. Hu, and Y. Ye, “Multi-task information bottleneck co-clustering for unsupervised cross-view human action categorization,” ACM Trans. Knowl. Discov. Data 14(2), 1–23 (2020). [CrossRef]

40. M. Federici, A. Dutta, P. Forré, N. Kushman, and Z. Akata, “Learning robust representations via multi-view information bottleneck,” arXiv preprint arXiv:2002.07017 (2020).

41. K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. Cox, “Unsupervised speech decomposition via triple information bottleneck,” in International Conference on Machine Learning, (PMLR, 2020), pp. 7836–7846.

42. A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox, “On the information bottleneck theory of deep learning,” J. Stat. Mech.: Theory Exp. 2019(12), 124020 (2019). [CrossRef]

43. H. Hafez-Kolahi and S. Kasaei, “Information bottleneck and its applications in deep learning,” arXiv preprint arXiv:1904.03743 (2019).

44. M. Vera, P. Piantanida, and L. R. Vega, “The role of the information bottleneck in representation learning,” in 2018 IEEE International Symposium on Information Theory (ISIT), (IEEE, 2018), pp. 1580–1584.

45. A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,” arXiv preprint arXiv:1612.00410 (2016).

46. Á. S. Hervella, J. Rouco, J. Novo, and M. Ortega, “Self-supervised multimodal reconstruction of retinal images over paired datasets,” Expert. Syst. with Appl. 161, 113674 (2020). [CrossRef]

47. A. Taleb, C. Lippert, T. Klein, and M. Nabi, “Multimodal self-supervised learning for medical image analysis,” arXiv preprint arXiv:1912.05396 (2019).

48. V. Ngampruetikorn, W. Bialek, and D. Schwab, “Information-bottleneck renormalization group for self-supervised representation learning,” Bulletin of the American Physical Society65 (2020).

49. P. West, A. Holtzman, J. Buys, and Y. Choi, “Bottlesum: Unsupervised and self-supervised sentence summarization using the information bottleneck principle,” arXiv preprint arXiv:1909.07405 (2019).

50. R. A. Amjad and B. C. Geiger, “Learning representations for neural network-based classification using the information bottleneck principle,” IEEE Trans. Pattern Anal. Mach. Intell. 42(9), 2225–2239 (2020). [CrossRef]

51. M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm, “Mine: mutual information neural estimation,” arXiv preprint arXiv:1801.04062 (2018).

52. B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, and G. Tucker, “On variational bounds of mutual information,” in International Conference on Machine Learning, (PMLR, 2019), pp. 5171–5180.

53. R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” arXiv preprint arXiv:1808.06670 (2018).

54. K. Schulz, L. Sixt, F. Tombari, and T. Landgraf, “Restricting the flow: Information bottlenecks for attribution,” arXiv preprint arXiv:2001.00396 (2020).

55. Z. Jiang, R. Tang, J. Xin, and J. Lin, “Inserting information bottlenecks for attribution in transformers,” arXiv preprint arXiv:2012.13838 (2020).

56. L. Ardizzone, R. Mackowiak, C. Rother, and U. Köthe, “Training normalizing flows with the information bottleneck for competitive generative classification,” arXiv preprint arXiv:2001.06448 (2020).

57. Y. Gu, Y. Li, Y. Chen, J. Wang, and J. Shen, “A collaborative multi-modal fusion method based on random variational information bottleneck for gesture recognition,” in International Conference on Multimedia Modeling, (Springer, 2021), pp. 62–74.

58. O. Henaff, “Data-efficient image recognition with contrastive predictive coding,” in International Conference on Machine Learning, (PMLR, 2020), pp. 4182–4192.

59. Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” arXiv preprint arXiv:1906.05849 (2019).

60. P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” arXiv preprint arXiv:1906.00910 (2019).

61. Q. Wang, C. Boudreau, Q. Luo, P.-N. Tan, and J. Zhou, “Deep multi-view information bottleneck,” in Proceedings of the 2019 SIAM International Conference on Data Mining, (SIAM, 2019), pp. 37–45.

62. D. McAllester and K. Stratos, “Formal limitations on the measurement of mutual information,” in International Conference on Artificial Intelligence and Statistics, (PMLR, 2020), pp. 875–884.

63. A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748 (2018).

64. A. Shapiro, “Monte carlo sampling methods,” Handbooks Operations Res. Management Sci. 10, 353–425 (2003). [CrossRef]

65. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114 (2013).

66. D. P. Kingma and J. A. Ba, “A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 434 (2019).

67. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 770–778.

68. R. Rs, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” Int. J. Comput. Vis. 128(2), 336–359 (2020). [CrossRef]

69. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 7132–7141.

70. S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January (2017), pp. 5987–5995.

71. S. Zagoruyko and N. Komodakis, “Wide Residual Networks,” British Machine Vision Conference 2016, BMVC 2016 2016-September, 87.1–87.12 (2016).

72. S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. H. Torr, “Res2net: A new multi-scale backbone architecture,” IEEE Transactions on Pattern Analysis and Machine Intelligence PP, 1–1 (2019).

Author	Year	Method	Image modality	Performance
Math et al. [11]	2021	Deep learning approach	Color fundus image	Sensitivity: 96.37%
Y. Kang et al. [15]	2020	Naïve Bayesian	Color fundus image	Accuracy: 93.44%
Y.-H. Li et al. [16]	2019	Convolutional neural network	Color fundus image	Accuracy: 86.17%
T. Li et al. [12]	2019	Deep learning algorithm	Color fundus image	Accuracy: 82.84%
X. Li et al. [13]	2019	Deep learning algorithm	Color fundus image	Accuracy: 92.6%
T. R. Gadekallu et al. [23]	2020	Deep neural network	Color fundus image	Accuracy: 97.3%
X. Pan et al. [24]	2020	Convolutional neural network	Fluorescein angiography	AUC: 0.8703
T. Nagasawa et al. [25]	2019	Convolutional neural network	Fluorescein angiography	Sensitivity: 94.7%
A. Tanboly et al. [27]	2018	Fusion classification network	Optical coherence tomography	Accuracy: 93%
D. Le et al. [28]	2020	Convolutional neural network	Optical coherence tomography	Specificity: 90.82%
F. Tang et al. [14]	2021	Deep learning method	Ultra-wide field SLO	Sensitivity: 94.9%
F. Tang et al. [29]	2020	Deep learning algorithm	Ultra-wide field SLO	Accuracy: 92.8%

Networks	Accuracy	Precision	Recall	F1_score	Specificity	AUC
ResNet50-BR	91.914%	98.233%	86.335%	0.919	0.864	0.923
ResNet50-GR	91.089%	96.241%	85.333%	0.905	0.871	0.917
ResNet50-IR	91.914%	97.509%	86.709%	0.918	0.871	0.923
ResNet50-Pseudocolor	92.047%	96.918%	88.162%	0.923	0.873	0.921
ResNet50-Three	94.146%	95.608%	92.182%	0.939	0.927	0.942
ResNet50-Multimodal	94.304%	95.318%	92.834%	0.941	0.934	0.944

Networks	Accuracy	Precision	Recall	F1_score	Specificity	AUC
ResNet18	95.411%	98.601%	91.857%	0.951	0.928	0.957
SE-ResNet18	96.203%	99.303%	92.834%	0.960	0.936	0.965
SE-ResNet50	95.570%	96.346%	94.463%	0.954	0.949	0.956
WRN-18	95.886%	97.627%	93.811%	0.957	0.944	0.960
WRN-50	90.794%	94.946%	85.668%	0.901	0.875	0.912
ResNeXt18	94.762%	96.599%	92.508%	0.945	0.896	0.949
ResNeXt50	93.968%	94.684%	92.834%	0.938	0.933	0.940
Res2Net18	89.365%	92.857%	84.691%	0.886	0.866	0.897
Res2Net50	90.159%	96.935%	82.410%	0.891	0.854	0.912

Networks	Accuracy	Precision	Recall	F1_score	Specificity	AUC
ResNet18-IB	96.032%	97.635%	94.137%	0.959	0.946	0.961
MMIB-Net(ResNet50)	96.994%	99.654%	93.811%	0.968	0.945	0.972
SE-ResNet18-IB	96.361%	97.651%	94.788%	0.962	0.952	0.964
SE-ResNet50-IB	95.714%	97.297%	93.811%	0.955	0.943	0.958
WRN-18-IB	96.349%	97.333%	95.114%	0.962	0.955	0.964
WRN-50-IB	94.921%	97.578%	91.857%	0.946	0.927	0.951
ResNeXt18-IB	95.873%	97.627%	93.811%	0.957	0.943	0.960
ResNeXt50-IB	96.394%	98.630%	93.811%	0.962	0.944	0.965
Res2Net18-IB	94.462%	95.638%	92.834%	0.942	0.934	0.945
Res2Net50-IB	96.044%	98.621%	93.160%	0.958	0.939	0.962

Author	Year	Method	Image modality	Performance
Math et al. [11]	2021	Deep learning approach	Color fundus image	Sensitivity: 96.37%
Y. Kang et al. [15]	2020	Naïve Bayesian	Color fundus image	Accuracy: 93.44%
Y.-H. Li et al. [16]	2019	Convolutional neural network	Color fundus image	Accuracy: 86.17%
T. Li et al. [12]	2019	Deep learning algorithm	Color fundus image	Accuracy: 82.84%
X. Li et al. [13]	2019	Deep learning algorithm	Color fundus image	Accuracy: 92.6%
T. R. Gadekallu et al. [23]	2020	Deep neural network	Color fundus image	Accuracy: 97.3%
X. Pan et al. [24]	2020	Convolutional neural network	Fluorescein angiography	AUC: 0.8703
T. Nagasawa et al. [25]	2019	Convolutional neural network	Fluorescein angiography	Sensitivity: 94.7%
A. Tanboly et al. [27]	2018	Fusion classification network	Optical coherence tomography	Accuracy: 93%
D. Le et al. [28]	2020	Convolutional neural network	Optical coherence tomography	Specificity: 90.82%
F. Tang et al. [14]	2021	Deep learning method	Ultra-wide field SLO	Sensitivity: 94.9%
F. Tang et al. [29]	2020	Deep learning algorithm	Ultra-wide field SLO	Accuracy: 92.8%

Multicolor image classification using the multimodal information bottleneck network (MMIB-Net) for detecting diabetic retinopathy

Abstract

1. Introduction

2. Related work

2.1 Classification of fundus images to diagnose DR

2.2 Information bottleneck theory

3. Our approach

3.1 Structure of MMIB-Net

3.2 Problem formulation

3.3 Optimization

4. Experiment

4.1 Data and evaluation

4.2 Implementation details

4.3 Evaluation

5. Discussion

6. Conclusion and future work

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (8)

Tables (4)

Equations (24)

Optics Express