End-to-end algorithm for the automatic detection of the neural canal opening in OCT images based on a multi-task deep learning model

Chieh-En Lee; Chieh-En Lee; Jia-Ling Tu; Pei-Chia Tsai; Yu-Chieh Ko; Shih-Jen Chen; Ying-Shan Chen; Chu-Ming Cheng; Chung-Hao Tien; Chung-Hao Tien

doi:10.1364/OPTCON.497631

1. Introduction

The morphologic changes of the optic nerve head (ONH) are one of the most important indicators to help ophthalmologists assess the risk of glaucoma. Glaucoma is a chronic neurodegenerative disease of the optic nerve that eventually leads to optic disc cupping on the surface of the ONH. The optic disc is a clinically visible surface of the ONH, which marks the peripheral boundary of neural tissue [1], and its size is often used to estimate the cup-to-disc ratio indicator to monitor the progression of glaucoma and to assess the efficacy of treatment.

The Optic Nerve Head (ONH) is where the retinal nerve fibers converge at the fundus and is typically bordered by an anatomical opening in Bruch's membrane (also known as Bruch's membrane opening, BMO). The breakpoint of Bruch's membrane is called the neural canal opening (NCO) point. Through spectral-domain optical coherence tomography (SD-OCT) technology, it is possible to reconstruct the 3D structure of the ONH and to identify the location of the NCO. However, it can be difficult to accurately distinguish Bruch's membrane and retinal pigment epithelium (RPE) at the resolution of a typical OCT. Therefore, the breakpoint of the BM/RPE complex layer is usually considered the NCO point [1–4]. The NCO point can be used to estimate the margin of the retinal nerve fiber layer (RNFL) [5]. In addition, the NCO point is an important morphological landmark for assessing the anatomy and pathology of the lamina cribrosa [6].

Various segmentation methods for the ONH have been proposed and some of them focus on applying the features of the color fundus image [7–11]. For example, Yin et al. combined edge detection and circle Hough transformation to detect the ONH [11]. However, this branch is subject to the image quality of the fundus image. On the other hand, another cluster of ideas aimed to identify the NCO points based on a single OCT image. Khalil et al. reported a method that combined a series of morphological strategies and known retinal structural constraints to extract the NCO points [12]. Hussain et al. proposed an approach based on the technique of graph search for three benchmark reference layers to identify the NCO points [13]. Nevertheless, these methods often require many constraints and prior knowledge to accomplish the detection task.

Several methods use OCT volumetric scans to detect NCOs, integrating feature engineering and machine learning techniques [3,14,15]. Wu et al. published a method that first detects the position of the NCO from the SD-OCT volume, then uses a pre-trained SVM classifier to identify patches with NCO points at their centers [14]. Chen et al. proposed a similar coarse-to-fine detection method, replacing the SVM classifier by a U-Net structure to segment the circle region centered at the NCO point [3,16]. This circle region provides the network with additional neighborhood information of the NCO, making the segmentation process easier than direct segmentation for the NCO point. Miri et al. proposed a multi-modal method that combines features of fundus and OCT images by graph theory and machine learning [15]. This differs from the method of Hussain et al. in that the method first clusters cup, rim, and background pixels of fundus images via random forest. The OCT volume is then resampled and projected into a radial projection image to obtain BMO features. Finally, all information is used as cost functions of the graph-based approach to obtain the optimal NCOs location.

Recently, approaches for the NCO detection based on deep learning (DL) have been published and achieved an outstanding performance [4,17,18]. Sułot et al. proposed a fully semantic segmentation method that attempted to divide a B-scan OCT image into normal and BMO-regions [4]. Each A-scan line will be categorized into these two classes to obtain a prediction mask. The A-scan line corresponding to the disc margin can be further identified and the coordinates of BMO are extracted consequently. Devalla et al. reported another semantic segmentation method for the 6 ONH tissue layers [17,18]. The terminations of the RPE layer around the retinal nerve fiber layer and the prelaminar tissue are identified as the BMO points. However, these approaches still require an additional step to extract the coordinates of the NCO points from the output image since the architectures of their DL models are designed for image-to-image segmentation purpose. Therefore, the DL models are trained and optimized using a segmentation mask as the ground truth rather than the coordinates of the NCO points. This leads a disconnection between the loss function and the distance error between the target and predicted NCO coordinates.

In this study, we proposed a multi-task DL model for the automatic detection of the existence of NCO points and their coordinates. The idea of this multi-task DL model lies in the integration of image classification and landmark detection, whereby two tasks are conducted: determining whether an OCT image contains NCO points or not, and coordinate regression for the NCO points if present. To achieve this dual goal with one DL model, the MobileNetV2 was used as the backbone architecture to extract the features of the input image [19]. These features are then passed to two different networks for the image classification and coordinate regression tasks, meaning that these two networks share the common image features. Unlike DL models in previous works, the proposed method does not require pre-processing for pixel clustering, region-of-interest detection, or post-processing to extract the coordinates of the NCO points from the output of the DL model. We believe that the proposed DL model with this kind of end-to-end design can offer more versatility to OCT-based clinical diagnoses.

2. Material and methods

2.1 Subjects and data acquisition

In this work, volumetric OCT data obtained from the ACT100 (Medimaging Integrated Solutions Inc., Taiwan) was used for DL model training, validation, and testing, respectively. The data was collected from 106 volunteers who participated in a clinical trial (Serial number: CMUH109-REC2-109) approved by the Taiwanese Association of Institutional Review Boards, with all participants providing written consent. There are 33 females, 50 males, and 23 volunteers who are inconvenient to disclose their gender. The age distribution ranges from 25 to 87 years old, of which 36 were normal subjects, 25 were diagnosed with glaucoma, and 45 were diagnosed with retinopathy. For each participant, one eye will be randomly selected for single or multiple OCT measurements. In total, 153 OCT volumes of the ONH region were acquired via raster scanning during the clinical trial.

2.2 Datasets

For each OCT volume, it comprises 128 horizontal B-scans with 8-bit grey scale image of size $1024 \times 512$ pixels (height ${\times} $ width), which corresponds to $9.00 \times 9.00 \times 2$ mm³ (height ${\times} $ width ${\times} $ depth) in tissue. The OCT volumes are first randomly divided into three sub-datasets (training, validation, and testing) and then manually re-assorted the sub datasets according to participants to ensure that the OCT images belong to the same person does not exist in different sub-datasets at the same time. As a result, there are 82 volumes in the training dataset, 14 volumes in the validation dataset, and 57 volumes in the testing dataset as shown in Table 1.

Table 1. The Amount of the Dataset

View Table | View all tables in this article

Since the scanning area is larger than the area of the entire ONH region, there is no guarantee that the NCO points will exist in every B-scan image. Among these 3 datasets, there are 10496, 1792, and 7296 B-scan images in the training, validation, and testing datasets, respectively. This includes 2546, 412, and 1807 B-scan images with NCO points manually classified and marked by experienced experts, respectively. For the inter-/intra-observer reliability test, please refer to the Supplement 1. Each B-scan image also has a 5-dimensional vector as their label shown in Fig. 1. The first dimension is a binary value ${c_n}$ indicating whether the image has visible NCO points (0 represents no visible NCO points, and vice versa.). The other four represent the corresponding coordinates of the NCO points, zero if there is no NCO point.

Fig. 1. An OCT B-scan image with NCO points (orange dot). Each B-scan image in the dataset contains a 5-dimensional vector as their label. The first dimension ${c_n}$ indicates which classes (0 for image without NCO) it belongs to as marked by green color and n stands for the ${n^{th}}$ OCT sample. The other four $({{x_{1n}},{y_{1n}},{x_{2n}},{y_{2n}}} )$ represent the corresponding coordinates $({{x_1},{y_1}} )$ and $({{x_2},{y_2}} )$ of the NCO points. The coordinates will be zero if there is no NCO point in such image.

Download Full Size | PDF

2.3 Data preprocessing and augmentation for model training

In this work, to increase the generality of the proposed DL model, both data preprocessing and data augmentation are applied during the training process. These skills involve contrast enhancement, coordinates normalization, random vertical translation, and random rotation.

Since the proposed DL model aims to locate the termination of the BM/RPE complex layer, we took advantage of its optical characteristic that the BM/RPE layer has the highest contrast in one OCT image, so the proposed DL model focuses on the features of BM/RPE layer during training via contrast enhancement. Additionally, the aspect ratio of the original image, 2 (1024/512), may affect the model's performance in the task of coordinates prediction due to unequal lengths along different axes. Therefore, all input OCT images are first resized to 512 × 512 pixels and the dynamic range of the pixel value is normalized between 0 and 1. The normalized image I is then rescaled by gamma correction as:

(1)$$V({i,j} )= I({i,j} )\exp \{\gamma \},$$

where V represents the normalized image with gamma correction. The parameter $\gamma $ for the contrast enhancement is chosen to be $1.3$ for retaining the image features of other low-contrast retina layers in this work. As it described in previous section, every B-scan image has a 5-dimensional vector as their label, including one class label and four coordinates as shown in Fig. 1. Each coordinate value is also normalized between 0 to 1 before model training.

Finally, although the OCT device we used in this work has an operating range, which will exclude the situation where the retina is very close to either the B-scan top or bottom boundaries, it is still observed that the retina features in different depths or different rotation angles due to oblique measurement significantly affect the performance of the proposed DL model. To address this, random vertical translation and random rotation from the center of B-scan image are sequentially introduced to every input image for data augmentation in this work. The range of random vertical translation is approximately ±102 pixels (±20% image height to be specific) and the random rotation angle ranges from -45° to 45°. Each operation is achieved through affine transformation with bicubic interpolation method and a uniform probability density function. Figure 2 illustrates images of (a) normal and (b) oblique measurement, as well as the results after affine transformation with different rotation angles (c) 0°, (d) 15°, (e) 30°, and (f) 45°.

Fig. 2. An OCT B-scan image with (a) normal measurement and (b) oblique measurement. OCT images after affine transformation with different rotation angles (c) $0^\circ $, (d) $15^\circ $, (e) $30^\circ $, and (f) $45^\circ $ from its image center.

Download Full Size | PDF

2.4 Network architecture

To accomplish multi-task with a single DL model, one of the most important things is to extract sufficient features from the input image. In this work, we choose the MobileNetV2 [19], which is well-known for its relatively low computational complexity and outstanding performance in object detection, segmentation, and landmark detection tasks [20–22], as the backbone architecture for feature extraction.

To execute two tasks simultaneously, the feature extracted via the backbone and global average pooling (GAP) layer is then separately passing to the following networks for the classification and coordinate regression. The architectures of these two networks are fully connected (FC) layer consisted of $1280$ neurons with same activation function (sigmoid). The major difference between them is the output dimension as shown in Fig. 3. For the classification network, a number is produced as the predicted class ${\hat{c}_n}$. The regression network outputs a four-dimensional vector as the predicted coordinates ${\hat{{\mathbf {x}}}_n}$ on the other hand. It should be noticed that, the proposed DL model will output ${\hat{{\mathbf {x}}}_n}$ for both predicted class 0 (OCT image without NCO points) and predicted class 1 (OCT image with NCO points) under this network architecture. However, only the coordinates with ${\hat{c}_n} = 1$ will be reserved for representing the NCO points for further evaluation described in the following Sec. 2.6.

Fig. 3. The architecture of proposed DL model. GAP: global average pooling layer.

Download Full Size | PDF

2.5 Loss function

In order to train a multi-task networks like proposed DL model, it is necessary to apply different loss functions for different tasks. In this work, two kinds of loss functions are applied, including binary cross-entropy (BCE) loss for binary classification, and mean absolute error (MAE) loss for coordinate regression.

For the classification task, the proposed DL model is optimized via calculating the BCE loss. The definition of BCE loss is:

(2)$${L_{BCE}} ={-} {c_n}\cdot \log {\hat{c}_n} + ({1 - {c_n}} )\cdot \log ({1 - {{\hat{c}}_n}} ),$$

where ${\hat{c}_n}$ and ${c_n}$ represent the predicted and target classes of the ${n^{th}}$ training OCT image.

To predict the four values representing the coordinates of two NCO points, we train the model by minimizing the mean square error (MAE) loss, which is defined as:

(3)$${L_{MAE}} = \frac{1}{N}\mathop \sum \nolimits_{n = 1}^N {c_n}\cdot |{{{\hat{{\mathbf {x}}}}_n} - {{\mathbf {x}}_n}} |,$$

where ${\hat{{\mathbf {x}}}_n}$ and ${{\mathbf {x}}_n}$ represent the predicted and target coordinates of the ${n^{th}}$ training OCT image.

It should be noticed that, the formula of MAE loss used in this work is slightly different from its common formula with a coefficient ${c_n}$. This coefficient ${c_n}$ is designed to avoid the model paying attention to predict accurate NCO coordinates for the OCT images without NCO points.

Since the magnitude and convergent rate of tasks are different, a common way for the optimization of a multi-task network is to compute the weighted sum of losses into a single aggregated loss function during model training [23]. Inspiring by the concepts of loss weighting and adaptive scheduling [24–26], we balance the aggregated loss function with a hyperparameter $\alpha $ as shown in Eq. (4),

(4)$${\mathrm{{\cal L}}_{model}} = \alpha \cdot {L_{BCE}} + ({1 - \alpha } )\cdot {L_{MAE}}.$$

Considering the magnitude of BCE loss is much larger than the MAE loss, the hyperparameter $\alpha $ is set to be $0.001\cdot {({0.8} )^{epoch/10}}$, which will be gradually decreased during training. In other words, the model pays more attention to the accuracy of classification task at the beginning of training, and then gradually draws attention away from the classification task to the regression task.

3. Results and discussions

3.1 Training device and parameters

The proposed DL model is based on the MobileNetV2 architecture as described in Ref. 19, with the kernel size of all spatial convolutions set to 5 × 5. During the training phase, the network was optimized using the adaptive moment estimation (Adam) algorithm [27], with a learning rate of 10⁻³ at the start. The learning rate was tapered off by a factor of 0.8 every 10 epochs. There are about 2 M trainable parameters in the architecture, which was trained, validated, and tested on an NVIDIA GeForce RTX 2080Ti GPU with CUDA v10.3. The batch size and total training epoch were 12 and 100 respectively. All results in the subsequent sections were obtained using the same training parameters described here.

3.2 Evaluation metrics

To evaluate the performance of the proposed multi-task DL model, multiple quantitative metrics are required for different tasks. In this work, we applied the concept of confusion matrix for evaluating the performance of binary classification task. On the other hand, the Euclidian distance was used to evaluate the difference between the predicted coordinates ${\hat{{\mathbf {x}}}_n}$ and the annotated coordinates ${{\mathbf {x}}_n}$ of NCO points. It should be noticed that, the image belongs to the true negative category will be omitted from the calculation of Euclidian distance due to meaningless.

In addition, if we project all of the annotated and predicted NCO points onto the en face image of OCT volume, we can acquire the corresponding segmentation results of annotated ONH region ${A_t}$ and predicted ONH region ${A_p}$. To further evaluate this segmentation result, the intersection over union (IOU) is also introduced in this work, which is the most popular evaluation metric for object detection and segmentation tasks [28–30]. The IOU is defined as the ratio of the intersection of ${A_t}$ and ${A_p}$ to the union of ${A_t}$ and ${A_p}$ as shown in Eq. (5),

(5)$$\textrm{IOU} = \frac{{{A_o} \cap {A_t}}}{{{A_o} \cup {A_t}}}.$$

3.3 Performance of model trained with different loss functions

Compared with binary classification task, the task of coordinate regression has many options of loss functions. To further explore and compare the performance of different loss functions for coordinate regression, the proposed DL model was also trained using mean squared error (MSE) loss.

Figure 4 shows the learning curves of the proposed DL model trained with different loss functions. The red curves represent the result with MAE loss, and the black curves stand for MSE loss. The solid and dashed lines represent the training and validation results, respectively. Judging from these training results, the model trained with MSE loss seems to provide a better performance. However, this performance is not reflected in the validation results. If we take a closer look at the learning curves of total loss, we can see that the model using MSE loss is gradually overfitting as the training epochs increase. This can be attributed to the fact that, compared with MAE loss, MSE loss is more prone to the presence of outliers. As a result, the model tends to fit the outliers during the training process, leading to a worse validation result.

Fig. 4. Learning curves, (a) classification accuracy, and (b) total loss of the proposed DL model with different loss functions (red: MAE loss, black: MSE loss). The solid and dash dot lines represent the training and validation results, respectively.

Download Full Size | PDF

On the other hand, a much more intuitive criterion for evaluating the performance of models trained by different loss functions in coordinate regression task is to directly evaluate its ability through testing results of Euclidian distance and IOU. As Table 2 shows, the proposed DL model achieved equivalent performance in accuracy and IOU on the testing dataset with different loss functions. However, the model trained with MAE loss was able to provide more accurate predictions of NCO points in a single B-scan image. This phenomenon echoes what was mentioned in the previous paragraph; the model trained with MSE loss is prone to overfitting due to the outliers, making it impotent to provide accurate predictions on unseen data.

Table 2. The Testing Results of Models Trained with Different Loss Functions

View Table | View all tables in this article

In addition, the reason why these two methods have no significant difference in the IOU lies in that the calculation of IOU must first project the predicted NCO points of each B-scan image to the en face image of the OCT volume. This will significantly reduce the final discrepancy of IOU because the prediction error of NCO points of a single image will be nullified by its neighbors.

Upon closer examination of the prediction results of these two models for OCT B-scan images, as seen in Fig. 5, there are three different situations: (a) normal case, (b) oblique measurement, and (c) large neural canal obliqueness [1,2]. The red, cyan, and orange dots represent the ground truth, prediction by the model trained with MAE loss, and MSE loss, respectively. It can be seen that the model trained with MAE loss is more robust in predicting NCO points than the one trained with MSE loss.

Fig. 5. Prediction of NCO points on 3 different OCT B-scan images, (a) normal case, (b) oblique measurement, and (c) large neural canal obliqueness. The red, cyan, and orange dots represent the ground truth, prediction by the model trained with MAE loss, and MSE loss, respectively.

Download Full Size | PDF

Figure 6 demonstrates the results of testing different loss functions, which includes (a) a histogram of IOU, and (b)-(c) the RPE en face images with different IOU scores. The RPE en face images are acquired by averaging 5 superior and inferior pixels of the RPE layer along the A-scan direction. The red, cyan, and orange curves represent the ellipses fitting the projected NCO points of the ground truth, the prediction by the model trained with MAE loss, and the MSE loss, respectively. It can be observed that the distribution of the predictions by model trained with MAE loss is more concentrated than the others as shown in Fig. 6(a). As it has been mentioned in previous paragraph, model trained with MSE loss is more susceptible to the outliers so that it is more possible to learn some abnormal image features during model training. Such abnormal image features might not exist in the data it has never seen, resulting in an unstable performance on testing data. These en face images reveal an interesting result: when the IOU score exceeds 0.89, there is no significant difference in the segmentation result of ONH. Nonetheless, the specific IOU score that is required for various applications is beyond the scope of this work.

Fig. 6. Testing results of different loss functions. (a) Histogram of IOU, where the red and black bins stand for the result with MAE and MSE losses. (b)-(c) The RPE en face images with different IOU scores, where the red, cyan, and orange curves are the elliptical fitting by the projected NCO points of the ground truth, prediction by the model trained with MAE loss, and MSE loss.

Download Full Size | PDF

3.4 Influence of data augmentation

In Section 2.3, we discussed the application of data augmentation during model training, including random vertical translation and random rotation, to improve model performance. In this section, we will investigate the effect of data augmentation techniques on model performance. It is worth noting that the loss function used in this discussion is MAE loss, as it is more resilient to outliers than MSE loss.

Examining the impact of different levels of random vertical translation (5%, 10%, 15%, 20%, 25%, and 30%, respectively) on model performance, as shown in Table 3, reveals that the proposed model can achieve around 99% accuracy on the task of distinguishing whether the image contains NCO points or not, regardless of the level of random vertical translation applied. Additionally, in the task of predicting the coordinates of NCO points, there is no significant difference, as the prediction error falls within a single-digit pixel range regardless of the level of random vertical translation applied. This is also reflected in subsequent IOU scores. We infer that this is because no matter the depth at which the retinal feature locates within OCT images, the feature extraction backbone used in this paper (MobileNetV2) can efficiently extract the information of NCO points at different depths. Those features allow the subsequent neural network to provide a stable prediction result.

Table 3. Model Performance on Testing Data with Different Levels of Random Vertical Translation

View Table | View all tables in this article

The results of the model performance on testing data with and without random rotation are shown in Table 4. There is no significant difference between the models trained with and without random rotation on the classification task. However, a substantial gap in the coordinate prediction of NCO points can be observed, which is consistent with the description in Sec. 2.3. The tilt angle of the retina in an OCT image has a significant effect on the features of the image, and OCT images with large oblique retina features will severely impact the accuracy and stability of the proposed DL model on the task of coordinate prediction. This phenomenon can be seen in single OCT images, as shown in Fig. 7. This figure displays four OCT images taken under different conditions, including normal, large shooting angle (oblique measurement), deep shooting depth, and large neural canal oblique. The red dots represent the ground truth. The cyan and yellow dots are the NCO points predicted by the neural network using and not using random rotation during training, respectively. The first and second rows are the results without and with applying random vertical translation (20%). For more results around the superior and inferior disc regions and model performance on different categories (normal, glaucoma, and retinopathy), please refer to the Supplement 1.

Fig. 7. Prediction of NCO points on 4 different OCT B-scan images by models trained with different data augmentation techniques, (a)-(d) no random vertical translation, and (e)-(f) 20% random vertical translation. Columns from the left to right represent the normal measurement, oblique measurement, deep retina feature, and large neural canal obliqueness, respectively. The red, cyan, and yellow dots denote the ground truth, prediction by model trained with and without random rotation.

Download Full Size | PDF

Table 4. Model Performance on Testing Data with Different Data Augmentation Techniques

View Table | View all tables in this article

Figure 8 shows the histograms of the calculated Euclidian distances of each OCT image in the testing data via the proposed model trained with different data augmentation techniques (left: without random vertical translation, right: with 20% random vertical translation). The blue and yellow bins in both subfigures represent the results with and without random rotation, respectively. From this result, it can be seen that there is no significant difference in model performance in terms of coordinate prediction accuracy between training with and without random vertical translation. However, a tremendous change in the retinal features will be induced due to different levels of obliqueness, which will affect the model performance on coordinate prediction. As a result, whether or not to apply the technique of random rotation during model training can have a significant impact on the accuracy of coordinate prediction in the proposed model. Therefore, it is suggested that if one wishes to use deep learning for various applications of retinal OCT images such as image segmentation, retinal disease identification, feature point prediction etc., applying random rotation during model training will likely improve the model performance

Fig. 8. Histogram of the Euclidian distance of the predictions on the testing data via different models (a) without random vertical translation and (b) with 20% random vertical translation. The blue and yellow bins represent the prediction errors with and without random rotation.

Download Full Size | PDF

3.5 Comparison with state-of-the-art approach

Since the proposed method in this study is an automatic NCO detection algorithm based on deep learning, the major benchmark is the multi-modal coarse-to-fine detection method proposed by Chen et al. [3]. However, the coarse detection described in their proposed method requires the fundus image for preliminary segmentation and registration of the ONH, which is absent in this study. Therefore, a random selection technique was implemented to mimic this coarse detection as much as possible without fundus image. For more details about the technique and the corresponding parameters we used to reproduce the method proposed by Chen et al., please refer to the supplement of this manuscript.

The corresponding results are shown in Table 5, from which it can be observed that there is no significant difference between the proposed method and the multi-modal coarse-to-fine method in terms of Euclidian distance or IOU. However, it should be noticed that the design concepts of these two approaches are completely different, which has reflected on the accuracy item in the Table 5. In the multi-modal coarse-to-fine method, the DL model does not require to classify whether the input image contains the target feature points, because this task has already accomplished by multi-model image registration in the coarse detection step. Moreover, the objective of this DL model is to accomplish an image segmentation task via producing a “binary” image as its output. Therefore, additional post-processing technique like screening out the outliers or calculation of the corresponding geometric center are still required to obtain the final coordinates of target feature points.

Table 5. Model Performance on Testing Data with Different Approaches

View Table | View all tables in this article

On the other hand, the design concept of the proposed method is to reduce the pre-processing or post-processing algorithm as much as possible and directly use the DL model to achieve the image classification and coordinate regression tasks at the same time based on shared feature. This kind of design concept is also reflected on the architecture of DL model. The architecture of the DL model proposed by Chen et al. is based on U-net, an image-in image-out neural network. Hence, it is necessary to apply additional post-processing to obtain the final result in any case, as described in previous paragraph. On the contrary, the proposed method applies an image-in coordinate-out architecture, which allows the DL model has the ability to directly output the classification result and the position of feature points at the same time. In addition, this benefit is also reflected on the amount of model parameters. For an image-in image-out model such as U-net, the output image resolution is expected to be same as the input image resolution, which means the DL model must restore the image resolution through a series of up-sampling layers (or blocks) based on high semantic but low-resolution information in the latent space. Such restoration process requires heavy computational resources, leading to a tradeoff between performance and efficiency. This is usually disproportional, after all, the target feature points usually appear in several small regions. Compared to distribute the computational resources for resolution restoration, a more efficient approach is to directly accomplish the task of image classification and coordinate regression by utilizing high semantic information extracted by the backbone of DL model in the latent space. As a result, this kind of end-to-end design concept provides equivalent performance.

4. Conclusions

We proposed a multi-task deep learning (DL) model that aimed to automatically detect the neural canal opening (NCO) points in an optical coherence tomography (OCT) image. The design concept of this model was based on the idea of image classification and landmark detection. Compared to the traditional method, the proposed model can directly classify the existence of NCO points and identify their coordinates if an OCT image contains them. To achieve this multi-task goal with one DL model, MobileNetV2 was chosen as the backbone architecture for feature extraction. Unlike the DL models reported in previous works, the proposed method does not require any pre-processing for pixel clustering or region-of-interest detection, nor any post-processing to extract the coordinates of the NCO points from the output of the DL model.

The experimental results show that our proposed method can achieve 99% accuracy in image classification, regardless of whether data augmentation is applied or not. For the coordinate regression task, since the image features of the retina are likely to change due to its position and obliqueness, data augmentation techniques such as random vertical translation and random rotation were introduced during model training, reducing the coordinate prediction error to 5.68 ± 4.45 pixels (the prediction error was 8.47 ± 4.89 without any data augmentation). It was also found that, compared to random vertical translation, random rotation had a more significant impact on the model performance.

We believe that the proposed DL model with its end-to-end design concepts provides more flexibility and possibilities to OCT systems and clinical diagnoses. For example, the cup-to-disc ratio can be calculated to facilitate subsequent clinical diagnosis and tracking once we have successfully located the coordinates of NCO points. The concept of landmark detection can also be applied to identify and locate the coordinates of other retinal disease. Furthermore, this model can be used as a reference in the future to develop more sophisticated models that are able to detect not only the NCO coordinates, but also other important features of the OCT image. The model can be modified and customized to fit the specific needs and requirements of the medical field. This can be done by adjusting the neural network architecture, tweaking the training parameters, or by fine-tuning the data pre-processing pipeline. Additionally, the model can be used to improve the accuracy and robustness of the algorithms used for OCT systems.

Funding

National Science and Technology Council (MOST 110-2221-E-A49-094-).

Disclosures

Chu-Ming Cheng: MiiS (F).

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. N. G. Strouthidis, H. Yang, J. F. Reynaud, J. L. Grimm, S. K. Gardiner, B. Fortune, and C. F. Burgoyne, “Comparison of clinical and spectral domain optical coherence tomography optic disc margin anatomy,” Invest. Ophthalmol. Visual Sci. 50(10), 4709–4718 (2009). [CrossRef]

2. S. Hong, H. Yang, S. K. Gardiner, H. Luo, C. Hardin, G. P. Sharpe, J. Caprioli, S. Demirel, C. A. Girkin, J. M. Liebmann, C. Y. Mardin, H. A. Quigley, A. F. Scheuerle, B. Fortune, B. C. Chauhan, and C. F. Burgoyne, “OCT-detected optic nerve head neural canal direction, obliqueness, and minimum cross-sectional area in healthy eyes,” Am. J. Ophthalmol. 208, 185–205 (2019). [CrossRef]

3. Z. Chen, P. Peng, H. Shen, H. Wei, P. Ouyang, and X. Duan, “Region-segmentation strategy for Bruch’s membrane opening detection in spectral domain optical coherence tomography images,” Biomed. Opt. Express 10(2), 526–538 (2019). [CrossRef]

4. D. Sułot, D. Alonso-Caneiro, D. R. Iskander, and M. J. Collins, “Deep learning approaches for segmenting Bruch’s membrane opening from OCT volumes,” OSA Continuum 3(12), 3351–3364 (2020). [CrossRef]

5. A. S. Reis, N. O’Leary, H. Yang, G. P. Sharpe, M. T. Nicolela, C. F. Burgoyne, and B. C. Chauhan, “Influence of clinically invisible, but optical coherence tomography detected, optic disc margin anatomy on neuroretinal rim evaluation,” Invest. Ophthalmol. Visual Sci. 53(4), 1852–1860 (2012). [CrossRef]

6. G. Rebolleda, J. García-Montesinos, E. De Dompablo, N. Oblanca, F. J. Muñoz-Negrete, and J. J. González-López, “Bruch's membrane opening changes and lamina cribrosa displacement in non-arteritic anterior ischaemic optic neuropathy,” Br. J. Ophthalmol. 101(2), 143–149 (2017). [CrossRef]

7. P. S. Mittapalli and G. B. Kande, “Segmentation of optic disk and optic cup from digital fundus images for the assessment of glaucoma,” Biomed. Signal Process. Control 24, 34–46 (2016). [CrossRef]

8. S. Morales, V. Naranjo, J. Angulo, and M. Alcañiz, “Automatic detection of optic disc based on PCA and mathematical morphology,” IEEE Trans. Med. Imag. 32(4), 786–796 (2013). [CrossRef]

9. L. Xiong and H. Li, “An approach to locate optic disc in retinal images with pathological changes,” Comput. Med. Imag. Graph. 47, 40–50 (2016). [CrossRef]

10. M. T. Mahmood and I. H. Lee, “Optic disc localization in fundus images through accumulated directional and radial blur analysis,” Comput. Med. Imag. Graph. 98, 102058 (2022). [CrossRef]

11. F. Yin, J. Liu, S. H. Ong, Y. Sun, D. W. Wong, N. M. Tan, C. Cheung, M. Baskaran, T. Aung, and T. Y. Wong, “Model-based optic nerve head segmentation on retinal fundus images,” in Proceedings of IEEE Conference on Engineering in Medicine and Biology Society (IEEE, 2011), pp. 2626–2629.

12. T. Khalil, M. U. Akram, H. Raja, A. Jameel, and I. Basit, “Detection of glaucoma using cup to disc ratio from spectral domain optical coherence tomography images,” IEEE Access 6, 4560–4576 (2018). [CrossRef]

13. M. A. Hussain, A. Bhuiyan, and K. Ramamohanarao, “Disc segmentation and BMO-MRW measurement from SD-OCT image using graph search and tracing of three bench mark reference layers of retina,” in Proceedings of IEEE Conference on Image Processing (IEEE, 2015), pp. 4087–4091.

14. M. Wu, T. Leng, L. de Sisternes, D. L. Rubin, and Q. Chen, “Automated segmentation of optic disc in SD-OCT images and cup-to-disc ratios quantification by patch searching-based neural canal opening detection,” Opt. Express 23(24), 31216–31229 (2015). [CrossRef]

15. M. S. Miri, M. D. Abràmoff, K. Lee, M. Niemeijer, J. K. Wang, Y. H. Kwon, and M. K. Garvin, “Multimodal segmentation of optic disc and cup from SD-OCT and color fundus photographs using a machine-learning graph-based approach,” IEEE Trans. Med. Imaging 34(9), 1854–1866 (2015). [CrossRef]

16. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: convolutional networks for biomedical image segmentation,” in Proceedings of Medical Image Computing and Computer-Assisted Intervention, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, eds. (Springer, 2015), pp. 234–241.

17. S. K. Devalla, P. K. Renukanand, B. K. Sreedhar, G. Subramanian, L. Zhang, S. Perera, J. M. Mari, K. S. Chin, T. A. Tun, N. G. Strouthidis, T. Aung, A. H. Thiéry, and M. J. A. Girar, “DRUNET: a dilated-residual U-Net deep learning network to segment optic nerve head tissues in optical coherence tomography images,” Biomed. Opt. Express 9(7), 3244–3265 (2018). [CrossRef]

18. S. K. Devalla, T. H. Pham, S. K. Panda, L. Zhang, G. Subramanian, A. Swaminathan, C. Z. Yun, M. Rajan, S. Mohan, R. Krishnadas, V. Senthil, J. M. S. De Leon, T. A. Tun, C.-Y. Cheng, L. Schmetterer, S. Perera, T. Aung, A. H. Thiéry, and M. J. A. Girard, “Towards label-free 3D segmentation of optical coherence tomography images of the optic nerve head using deep learning,” Biomed. Opt. Express 11(11), 6356–6378 (2020). [CrossRef]

19. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “MobileNetV2: inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2018), pp. 4510–4520.

20. P. Mishra and K. Sarawadekar, “Fingertips detection in egocentric video frames using deep neural networks,” in Proceedings of IEEE Conference on Image and Vision Computing New Zealand (IEEE, 2019), pp. 1–6.

21. X. Guo, S. Li, J. Yu, J. Zhang, J. Ma, L. Ma, W. Liu, and H. Ling, “PFLD: a practical facial landmark detector,” arXiv, arXiv:2019.10859v2 (2019). [CrossRef]

22. Z. Zhang, P. Luo, C.C. Loy, and X. Tang, “Facial landmark detection by deep multi-task learning,” in Proceedings of European Conference on Computer Vision, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, eds. (Springer, 2014), pp. 94–108.

23. M. Crawshaw, “Multi-task learning with deep neural networks: a survey,” arXiv, arXiv:2009.09796v1 (2020). [CrossRef]

24. S. Jean, O. Firat, and M. Johnson, “Adaptive scheduling for multi-task learning,” arXiv, arXiv:1909.06434v1 (2019). [CrossRef]

25. M. Guo, A. Haque, D.-A. Huang, S. Yeung, and F.-F. Li, “Dynamic task prioritization for multitask learning,” in Proceedings of European Conference on Computer Vision, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, eds. (Springer, 2018), pp. 282–299.

26. C. Li, J. Yan, F. Wei, W. Dong, Q. Liu, and H. Zha, “Self-paced multi-task learning,” arXiv, arXiv:1604.01474v2 (2016). [CrossRef]

27. D. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv, arXiv:1412.6890v9 (2014). [CrossRef]

28. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” Int. J. Comput. Vis. 88(2), 303–338 (2010). [CrossRef]

29. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: common objects in context,” in Proceedings of European Conference on Computer Vision, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, eds. (Springer, 2014), pp. 740–755.

30. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 4510–4520.

Loss function	Accuracy	Euclidian distance (pixels)	IOU
MAE	99.03%	$5.68 \pm 4.45$	$0.90 \pm 0.03$
MSE	98.91%	$5.92 \pm 3.98$	$0.90 \pm 0.04$

Data augmentation techniques	Accuracy	Euclidian distance (pixels)	IOU
Rotation ( $\pm 45^{\circ}$ )	98.91%	$6.02 \pm 4.50$	$0.90 \pm 0.05 .$
Vertical translation ( $5 %$ ) + rotation ( $\pm 45^{\circ}$ )	98.91%	$5.72 \pm 4.34.$	$0.90 \pm 0.05$
Vertical translation ( $10 %$ ) + rotatio( $\pm 45^{\circ}$ )	.95%	$5.83 \pm 4.86$	$0.90 \pm 0.04$
Vertical translation ( $15 %$ ) + rotation ( $\pm 45^{\circ}$ )	99.03%	$5.80 \pm 4.53$	$0.90 \pm 0.04$
Vertical translation ( $20 %$ ) + rotation ( $\pm 45^{\circ}$ )	99.03%	$5.68 \pm 4.45$	$0.90 \pm 0.03$
Vertical translation ( $25 %$ ) + rotation ( $\pm 45^{\circ}$ )	98.95%	$5.58 \pm 4.72$	$0.90 \pm 0.06$
Vertical translation ( $30 %$ ) + rotation ( $\pm 45^{\circ}$ )	98.92%	$5.87 \pm 4.67$	$0.90 \pm 0.06$

Data augmentation techniques	Accuracy	Euclidian distance (pixels)	IOU
None	98.72%	$8.47 \pm 4.89$	$0.86 \pm 0.09$
Rotation ( $\pm 45^{\circ}$ )	98.91%	$6.02 \pm 4.50$	$0.90 \pm 0.05$
Vertical translation ( $20 %$ )	98.74%	$6.79 \pm 4.80$	$0.88 \pm 0.07$
Vertical translation ( $20 %$ ) + rotation ( $\pm 45^{\circ}$ )	99.03%	$5.68 \pm 4.45$	$0.90 \pm 0.03$

Method	Accuracy	Euclidian distance (pixels)	IOU
Proposed method with data augmentation	99.03%	$5.68 \pm 4.45$	$0.90 \pm 0.03$
Multi-modal coarse-to-fine detection [3]	X	$5.85 \pm 5.90$	$0.90 \pm 0.03$

Loss function	Accuracy	Euclidian distance (pixels)	IOU
MAE	99.03%	$5.68 \pm 4.45$	$0.90 \pm 0.03$
MSE	98.91%	$5.92 \pm 3.98$	$0.90 \pm 0.04$

End-to-end algorithm for the automatic detection of the neural canal opening in OCT images based on a multi-task deep learning model

Abstract

1. Introduction

2. Material and methods

2.1 Subjects and data acquisition

2.2 Datasets

2.3 Data preprocessing and augmentation for model training

2.4 Network architecture

2.5 Loss function

3. Results and discussions

3.1 Training device and parameters

3.2 Evaluation metrics

3.3 Performance of model trained with different loss functions

3.4 Influence of data augmentation

3.5 Comparison with state-of-the-art approach

4. Conclusions

Funding

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (8)

Tables (5)

Equations (5)

Optics Continuum

Category	Number of volumes	Number of images	Images with NCO points
Training	82	10496	2546
Validation	14	1792	412
Testing	57	7296	1807
Total	153	19584	4765