Multi-modal automatic montaging of adaptive optics retinal images

Min Chen; Robert F. Cooper; Grace K. Han; James Gee; David H. Brainard; Jessica I. W. Morgan

doi:10.1364/BOE.7.004899

1. Introduction

Adaptive optics (AO) ophthalmoscopy offers high resolution in vivo imaging and analysis of microscopic structures in the retina by using a wavefront sensor and deformable mirror to measure and correct optical aberrations in the eye’s optics. [1] This technique provides near diffraction limited resolution and has enabled visualization of the cone and rod photoreceptor mosaic, [2,3] retinal pigment epithelial mosaic, [4] retinal nerve fiber layer, [5] and retinal vasculature. [6] Increasingly, AO technology is being applied to study the structure and function of retinal cells both in the normal and diseased eye. [7]

While AO imaging has been used with great success in research settings, the analysis of AO images is not straightforward, which is one factor limiting the transition of AO to the clinic. A single AO image typically only covers a small fraction of the total retina (0.085–0.34 mm²). Thus, to examine a large retinal area, multiple adjacent images must be acquired across overlapping retinal regions and analyzed jointly. This requires constructing a montage from the individual AO images, which can be achieved by determining the overlap between images of adjacent retinal regions as well as detecting any discontinuities in the retinal regions covered by a set of images (Fig. 1). Accurate montaging is particularly crucial for estimating cell density across the retina and comparing such density to normative values, as such comparisons require accurate knowledge of the retinal eccentricity of each imaged location.

Fig. 1 Temporal arm of a set of confocal AO images of cone outer segments (a) before and (b) after manual montaging. In (a), individual image thumbnails are shown in order of their nominal retinal locations, as determined by the location of the fixation point that the subject was instructed to look at during image acquisition. The larger images show more detail for two adjacent images. In (b), a manual montage of this image set is shown.

Download Full Size | PDF

Current practice is typically to manually montage each AO dataset such that overlapping areas are aligned, with detected discontinuities then filled in by including additional images in the dataset, if possible. Manual montaging is disadvantageous in three regards. First and most important, manual montaging is expensive. Each montage demands significant time commitment, and requires trained personnel. Second, montages are typically constructed “by eye” and this procedure does not offer a metric that may be used to evaluate how well the overlapping regions of the individual images are aligned. Third, the presence of pathology can disrupt the typical cellular structure and patterns in the retinal images, which in turn can make manual montaging yet more difficult and laborious.

A limited literature has reported on automated solutions for montaging AO images. To date the only published work of which we are aware is that by Li et al. (2012) [8] in which local scale invariant features (PCA-SIFT [9]) are used to match adjacent images. The method produced a montage that appeared good as judged qualitatively on a single dataset, but the paper did not provide a quantitative analysis of the results nor a comparison with manual montaging. In addition, the algorithm was designed for a single imaging modality and did not include a provision to detect discontinuities in the image set.

In this work we build upon the method presented by Li. et al by including the ability to use multi-modal features from simultaneously acquired confocal, split-detection [10], and dark-field [11] AO images (examples shown in Fig. 2). In addition, we designed an approach for detecting discontinuities in the montages. In Section 2, we describe our method in more detail. In Section 3 we perform an extensive analysis of the algorithm’s performance on data acquired for both healthy and diseased retinae. We compare the accuracy of our algorithm against montages of the same datasets generated by a human rater using two quantitative metrics. To do so, we measure the similarity of the alignments in overlapping regions and the number of the discontinuities in each montage. We also analyze the contribution of the three different modalities by observing the changes in performance when using them together and separately in the algorithm. In Section 4 we discuss the merits and limitations of our algorithm given our experimental results. We conclude (Section 5) with closing observations and a discussion of future directions.

Fig. 2 Examples of three simultaneously acquired AO image modalities (confocal, split detection, and dark field), at two adjacent nominal locations. The confocal images are the same as shown in detail in Fig. 1.

Download Full Size | PDF

2. Methods

2.1. AO Image Acquisition

The research protocol and data presented in this study was approved by the institutional review board at the University of Pennsylvania and the Children’s Hospital of Philadelphia, and followed the tenets of the Declaration of Helsinki. All light exposures adhered to the maximum permissible exposure limits set by the American National Standards Institute standards. [12] Following an explanation of the study requirements and potential risks, subjects gave informed consent and voluntarily enrolled in the study.

Eight eyes of five normal sighted volunteers (control), two eyes of two patients diagnosed with central serous chorioretinopathy (CSCR), and one eye of a patient diagnosed with retinitis pigmentosa (RP) were included in the present study. Subjects’ eyes were dilated with phenylephrine hydrochloride (2.5%) and tropicamide (1%), and a dental impression “bite bar” was used to align the subject to the AOSLO.

The AOSLO used in this study has been previously described. [13] Briefly, wavefront sensing was done using a superluminescent diode centered at 840nm (Superlum, Ireland), aberration correction was performed using a 97 actuator deformable mirror (Alpao SAS, France), and imaging was done using a superluminescent diode centered at 790nm (Superlum, Ireland). Three photomultiplier tubes (PMT, Hamamatsu, Corporation, Japan) were used as point detectors to collect reflected light from the retina simultaneously, and videos of the retina were built up in real time at 16.7 frames per second as the scanners steered the beam over the 1° by 1° retinal area imaged. One PMT was used to collect light passing through a confocal pinhole placed optically conjugate with the retina, thereby yielding a confocal retinal video of the photoreceptor mosaic (termed confocal). The remaining two PMTs collected non confocal light from opposing semi-annular apertures also placed optically conjugate with the retina, yielding videos of non-confocal right and left half multiply-scattered light from the retina. [10] The videos acquired by these two detectors were summed point by point to yield a non-confocal video (dark field). [11] The point by point difference of the videos divided by their sum yielded a third video sequence (split detection). [10] The confocal, split detection and dark field retinal videos are three simultaneously collected AOSLO imaging modalities, each showing different features of the retina at the same spatial location. Each video frame has a pixel resolution of 0.4581 μm/pixel assuming an axial length of 24 mm. Pixel size was scaled according to each subject’s axial length; the average pixel size for the 11 datasets included in this study was 0.4777 μm/pixel in both the x and y directions. The average pixel dimension of the video frames was 609 pixels in both directions.

Subjects were instructed to fixate (using the imaged eye) as steadily as possible at a target while the AOSLO videos were acquired. Following successful imaging acquisition at one nominal location, the fixation target was translated to an overlapping, adjacent location and another AOSLO video was acquired. In the normal sighted controls and CSCR subjects, images were collected across the horizontal and vertical meridians of the retina from the fovea out to approximately 1800μm eccentricity, with additional images collected in the central 600μm by using the corners of the scanning square as the fixation location. The RP patient images extended approximately 1200μm from the fovea, along the temporal arm of the horizontal meridian.

2.2. Preprocessing

The confocal, split detection and dark field retinal videos were preprocessed using custom software [14] to dewarp the effect of the sinusoidal scan and register frames together to alleviate the effects of intraframe fixational eye motion. Reference frames were manually chosen from the confocal video, frames of the video were registered together, and an averaged image comprised of up to 50 individual frames was produced from each AOSLO confocal video. Averaged images from the simultaneously acquired split detection and dark field videos were created using the same intra-frame translations used to dewarp and register the confocal image. The result was a confocal, split detection, and dark field image at each nominal retinal location. Altogether, this provided 11 datasets for testing our automated algorithm, each consisting of 47–108 locations with confocal, split detection and dark field modalities. To serve as a comparison, each dataset was manually montaged using Photoshop (Adobe Creative Suite), and discontinuities in each dataset were assessed and noted by the manual rater.

2.3. Problem Framework

Given an AO dataset, we refer to each image as I_i,m(x_i), where i = (i, j) is the image’s nominal retinal location as determined by the fixation used during image acquisition, m is the modality of the image at that location (confocal, split detection, or dark field), and x_i = (x_i, y_i) is the pixel location within each image’s coordinate space. We designate an image I_ref,m(x_ref) as the initial reference image, to which the rest of the images in the dataset are aligned. In our experiments, we chose the image nominally at the origin (ref = (0, 0)) of the fixation coordinate grid as this reference. However, this is without loss of generality and the reference image may be chosen as any image in the dataset and can correspond to any nominal location.

The goal of the montaging algorithm is to find an alignment transformation between the coordinate system x_i from each image I_i,m(x_i) in the dataset and the global coordinate system x_ref. This transformation is described by the mapping

x_{i} = T_{ref \to i} (x_{ref}),

which tells us how each point in x_i directly corresponds to each point in x_ref. Given T_ref→i for each image I_i,m(x_i), we can create the aligned image

I_{i, m}^{ref} (x_{ref}) = I_{i, m} (T_{ref \to i} (x_{ref}))

by using the relationship from Eq. (1) to find the corresponding image value from I_i,m(x_i) for each point in x_ref. In technical terms, T_ref→i serves as a pullback function that we substitute into the image I_i,m(x_i) to transform the image into the global coordinate system x_ref. To the extent that the alignment transformations are accurate, this procedure results in a single global coordinate system anchored by the reference image, such that the same retinal structure in two different images will have the same global coordinates.

2.4. Scale Invariant Feature Transforms (SIFT)

To find a transformation T_ref→i(x) we first consider the case where I_i,m(x_i) and I_ref,m(x_ref) are adjacent and overlapping images. In this case, we know that both images must contain a number of similar (if not identical) image features due to the same retinal structures being imaged in the overlapping region. To identify the location of these structures, we follow [8] and employ a widely-used feature detection method known as Scale Invariant Feature Transforms (SIFT) [15] to find matching features in both images.

This approach uses difference of Gaussian functions within a scale space to find key locations in both images. Each identified location is assigned a size and orientation according to the result of the difference of Gaussian filtering. This process produces feature descriptions that may be assigned a quantitative matching score that is invariant across rotation and scale differences between the two images. The identified features in the two images are then compared in this manner and the top N matches are used to provide a set of candidate matching locations. For our method, we set N as the top 10% of all matches found. For each n ∈ {1, 2, 3 . . . N}, SIFT gives us a pair of corresponding feature locations f_i,n = (x_i,n, y_i,n) and f_ref,n = (x_ref,n, y_ref,n), that potentially represent pixel locations of the same structure in the i^th and reference images, respectively. Under ideal conditions with limited noise, these features describe the relationship

I_{i, m} (f_{i, n}) \approx I_{ref, m} (f_{ref, n}),

where here ≈ denotes equality of the underlying retinal structure in the two images. Figure 3 shows examples of SIFT feature matches found between each of the AO modalities for a pair of adjacent images. Using the candidate matched feature locations, we estimate the transformation between the two image coordinate spaces using the linear relationship

F_{i}^{'} = T_{ref \to i} F_{ref}^{'},

where

F_{i} = [\begin{matrix} x_{i, 1} & y_{i, 1} & 1 \\ x_{i, 2} & y_{i, 2} & 1 \\ \dots \\ x_{i, n} & y_{i, n} & 1 \end{matrix}]

and

F_{ref} = [\begin{matrix} x_{ref, 1} & y_{ref, 1} & 1 \\ x_{ref, 2} & y_{ref, 2} & 1 \\ \dots \\ x_{ref, N} & y_{ref, N} & 1 \end{matrix}]

are the N pairs of feature locations stacked in matrix form, and the prime (′) symbolizes the transpose of the matrix. The transformation T_ref→i is represented by a matrix that depends on the type of alignment transformation that we specify. For a transformation consisting entirely of translation,

T_{Trans} = [\begin{matrix} 1 & 0 & a \\ 0 & 1 & b \\ 0 & 0 & 1 \end{matrix}],

where a and b are the column and row translation parameters we solve for. For a transformation consisting of both translation and rotation we use

T_{Rigid} = [\begin{matrix} cos θ & sin θ & a \\ - sin θ & cos θ & b \\ 0 & 0 & 1 \end{matrix}],

where an additional rotation θ (in radians) must be solved for. Other more general types of transformations, such as those that contain scaling or warping, may also be considered, but these are the two that we have implemented in the current version of our algorithm in order to restrict the transformations to represent those attributable to real eye motions.

Fig. 3 Shown are feature matches found by SIFT on the confocal, split detection, and dark field images between two adjacent nominal locations. The individual lines connect matched features in the two images. Note that most matches suggest a consistent global coordinate transformation between the two images, but that in each case there are also outlier matches.

Download Full Size | PDF

It is important to note that this estimate of the transformation is dependent on the assumption that each corresponding feature pair is correctly matched. However, from Fig. 3 we see that while the majority of the matches are consistent with a single global transformation, there are also outliers which are presumably feature mismatches. Including these will distort the transformation estimate, and we describe an approach for handling this in Section 2.5 below.

Once we have matched an image to the reference image, we can then consider the case where an image I_(i+1),m (x_(i+1)) does not directly overlap I_ref,m(x_ref), but instead is separated from it by the intermediate image I_i,m(x_i). In this case, I_(i+1),m (x_(i+1)) overlaps with I_i,m(x_i), and I_i,m(x_i) also overlaps with I_ref,m(x_ref). In this scenario, we first use SIFT (as described above) to find two pairwise transformations T_i→(i+1)(x_i) and T_ref→i(x_ref). We then compose the two transformations together to find the total transformation to the reference image,

T_{ref \to (i + 1)} (x_{ref}) = T_{i \to (i + 1)} (T_{ref \to i} (x_{ref})) .

This process can be repeated for any number of connected images to bring all images into the global image coordinate space. In Section 2.7 we discuss the criteria we use to handle situations where more than two images are overlapping, and how the algorithm detects and handles the cases where there is a discontinuity that prevents an image from being connected to the reference image.

2.5. Random sample consensus (RANSAC)

One potential problem with using SIFT feature matching to find the correct alignment is the presence of false positive matches between locations with similar feature properties (Fig. 3 and appendix Fig. 8). To address these, we use a technique known as random sample consensus (RANSAC) [16] to remove outlier matches. This idea is regularly used in conjunction with SIFT features [15] and was also adopted in the earlier work on montaging AO images. [8] In this method, a random subset of R < N matched feature pairs is taken from the N prospective SIFT matches and used to estimate a candidate transformation T_R. This transformation is then applied to all N prospective SIFT feature pairs and the squared error

{| f_{i, n} - T_{R} (f_{ref, n}) |}^{2}

is calculated between the feature location in one image, and the transformed feature location for the other image. This error is used to calculate a RANSAC score for the candidate transformation,

S_{R} = \sum_{n = 1}^{N} 1_{(< σ^{2})} ({| f_{i, n} - T_{R} (f_{ref, n}) |}^{2}),

where 1_(<σ²) (d) is an indicator function that returns one when the squared error is within the square of the tolerance σ and returns zero otherwise. Effectively, S_R represents the number of prospective SIFT features that fit the candidate transformation. In the absence of noise or matching error, S_R = N regardless of the subset of R prospective features that are used to estimate the transformation. However, in the presence of errors, we must repeatedly test different subsets of features to find the candidate transformation that maximizes the number of features that fits the transformation (i.e.,

max_{T_{R}} S_{R}

). This is done iteratively by randomly selecting a new subset of R features at each iteration, and then calculating and recording T_R and S_R given the new subset of features. At the end of G iterations, the transformation that produced the best RANSAC score is used as the solution. In our experiments, we’ve found that R = 3, G = 5000, and a tolerance (σ) of 6 pixels generally produced good results. In our AO images, 6 pixels is equivalent to less than 3 μm. Hence, this tolerance level ensures that all feature matches selected by RANSAC will follow the estimated alignment transformation with errors smaller than the size of a cone.

2.6. Multi-modal matching using RANSAC

To use the multiple modalities together, we start with a multi-channel approach for generating the SIFT features matches, such that we generate N feature matches for each modality m, where each feature pair is described by

f_{i, m, n} = (x_{i, m, n}, y_{i, m, n})

and

f_{ref, m, n} = (x_{ref, m, n}, y_{ref, m, n}) .

We then make the observation that all three image modalities are already aligned for a given nominal location, because they are acquired simultaneously. Hence, for a given nominal location, the features from each modality must be in the same coordinate system. Since, the goal of RANSAC is to estimate a transformation between the coordinate systems, it is then inconsequential where the features originated from. We can combine features from each modality into a single set, and use an aggregated score

S_{R, M} = \sum_{m = 1}^{M} \sum_{n = 1}^{N} 1_{(< σ^{2})} ({| f_{i, m, n} - T_{R} (f_{ref, m, n}) |}^{2}),

to evaluate the estimated transformation, where M is the total number of modalities (three in our case). This effectively pools the feature information from all three modalities into a single estimation problem, which allows the best matches across the three modalities to drive the transformation used for the alignment. Figure 4(a) shows the remaining SIFT matches from Fig. 3 after using multi-modal RANSAC to remove outliers from all three modalities simultaneously. Figure 4(b) shows the image pairs montaged by applying the transformation learned from matching the SIFT features. For visualization, the intensities of the overlapping regions in the figure were averaged at each pixel to show the alignment of the corresponding structures in the two images.

Fig. 4 Shown are (a) the remaining SIFT matches from Fig. 3 after using multi-modal RANSAC to remove outliers from all three modalities simultaneously, and (b) the montage created by applying the transformation learned from the matching features to the second image. For visualization, the intensities in the overlapping regions were averaged at each pixel to show the alignment of the corresponding structures in the two images.

Download Full Size | PDF

2.7. Matching criterion and discontinuity detection

The algorithm described so far assumes that each nominal location only has one image, and that it is always possible to align images between adjacent nominal locations to connect the whole montage. However, this is rarely the case in practice. Multiple images are often acquired at the same nominal location, which means that an image can have several possible choices of reference image for aligning that image to the global coordinate system. And due to different degrees of noise and artifacts, some of these possible choices will serve as a better reference than others. In addition, fixation error can sometimes cause images at adjacent nominal locations to have no overlap at all. This can result in a discontinuous montage, where disjoint pieces of the montage must be constructed separately by selecting a new global coordinate system for each of the pieces.

To address these potential problems the algorithm requires two key features. First, while aligning an image to the global coordinate system, it needs the capability to choose the best reference image when there are multiple viable images connected to the current montage. Second, it must be able to detect when none of the currently unaligned images can align well enough with the available reference images to justify joining any of them to the current montage. Fortunately, one major advantage of using RANSAC to remove outlier matches between the SIFT features is that it can provide several quantitative measures on the reliability of each alignment. In particular, we can calculate the absolute number of matches between the images that fit the model parameter, the percent of inlier matches out of the total number of matches found by SIFT, and the average mean square error between the feature locations in one image and the transformed locations of the corresponding matches from the other image. We use these measures to determine quantitatively when a pair of images should or shouldn’t be matched together.

We use the absolute number of inlier matches to decide which image to use as the reference image when there are multiple viable choices aligned to the current montage. To do this, we first determine a search range, K, around the current image, which we are trying to transform into the global coordinate system. If an aligned image (i.e, an image in the current montage, already transformed into the global coordinate space) is within K nominal locations, we use SIFT and RANSAC to find the alignment between it and the current image. We then repeat this for all aligned images within K nominal locations, and the image that produces the most inlier matches is used as the reference for montaging the current image. The reasoning for this is that the number of inlier matches represents the quality of the consensus for a particular transformation model. Hence, the more inlier matches, the more likely that the model is correct and will produce an accurate alignment. In addition, since the algorithm is estimating a linear transformation, a higher number of inlier matches allows for more robust and accurate estimations by averaging out the noise in the feature matching. However, we note that this improvement rapidly plateaus with increasing numbers of features. In general, we find that having more than 100 inlier matches only provides marginal improvement to the final estimation. We believe this is due to the strict outlier removal provided by RANSAC, and the low dimension of the transformation being estimated. From our experiments, we also found that K = 3 is sufficient for most cases, but may need to be expanded when there are many fixation errors in the data or can be reduced when there is a known large amount of overlap between the nominal locations.

To detect discontinuities in the montage, we set a minimum threshold on the number and percentage of SIFT inlier matches required for an alignment to be valid. In our algorithm, we assume that an alignment is always valid if at least 10 inlier matches are found. If there are fewer than 10 inlier matches, then the algorithm assumes the alignment is valid only if at least 50% of the total matches are inliers matches. Any alignment with 2 or fewer total matches is always considered invalid, since that is not enough feature points to estimate a transformation. These thresholds were determined during preliminary work, using 3 datasets that were independent from the datasets used in the experiments presented in Section 3.

When the current image does not meet the above criterion with any of the aligned images, the algorithm skips the image temporarily and continues the process for all remaining unaligned images. If a new image is aligned, then all skipped images are checked against the newly aligned image to see if an intermediate connection can be formed. When none of the remaining images meet the alignment criterion, then the algorithm decides that there is a discontinuity in the montage. The image from the remaining unaligned images which is closest to the reference image (according to the nominal position) is selected as a new reference image, and a new montage is constructed using the remaining images. This process is repeated until every image is aligned.

3. Experiments

3.1. Alignment accuracy

In our first experiment, we evaluated the accuracy of the montages produced by the algorithm. To do this, we compared the alignment of the image intensities in the overlapping regions in the automated montage against that of the manual montages. We used two common image similarity measures for this evaluation. First, we evaluated the normalized cross-correlation (NCC) measure given by

NCC = \frac{1}{n} \frac{\sum_{x} [A (x) - \bar{a}] [B (x) - \bar{b}]}{σ_{A} σ_{B}},

where A and B are the intensities in the overlapping region in the two images (after aligning to the global coordinate system), ā and b̄ are the mean intensities over A and B, σ_A and σ_B are the standard deviation of the intensities over A and B, and n is the number of pixels in the region being evaluated. Normalized cross-correlation ranges from −1 to 1, where 1 represents a perfect alignment. The metric is robust to global intensity shifts, but can be unreliable in the presence of local intensity inhomogeneity. Normalized cross-correlation is widely used as a measure of image alignment and registration [17,18]. The second measure we evaluated was the normalized mutual information (NMI) metric [19], given by

NMI = \frac{H (A) + H (B) - H (A, B)}{\sqrt{H (A) H (B)}},

where

H (A) = - \sum_{i = 1}^{W} P_{A} (a_{i}) log P_{A} (a_{i}),

H (B) = - \sum_{j = 1}^{V} P_{B} (b_{j}) log P_{B} (b_{j}),

and

H (A, B) = - \sum_{i = 1}^{W} \sum_{j = 1}^{V} P_{A, B} (a_{i}, b_{j}) log P_{A, B} (a_{i}, b_{j})

are the marginal and the joint entropy [20] measures, described by the marginal [P_A(a_i) and P_B(b_j)] and joint [P_A,B(a_i, b_j))] probability density estimate of the W and V possible image intensities (a_i and b_j) in images A and B, respectively. NMI is a statistical evaluation of the image intensities in each image and their joint relationship. It measures the likelihood that the intensities in one image can be predicated using the intensities from the other image. The measure ranges from 0 to 1, with 1 representing perfect alignment. One primary advantage of NMI is that, unlike NCC, it does not assume a linear relationship between the image intensities in the two images. This allows it to be more robust to systematic intensity variations specific to particular structures or tissues. Like NCC, NMI is also widely used as metric of image similarity in image alignment and registration. [21]

In Table 1 we show the average and standard deviation of the NCC and NMI measures evaluated over all overlapping regions in the manual and automated montages for each modality. We divided the analysis across the 3 types of datasets (control, CSCR, RP). The algorithm was evaluated using both the translation only transformation model and the rigid (translation and rotation) transformation model.

Table 1. Average (standard deviation) normalized cross-correlation (NCC) and normalized mutual information (NMI) of all overlapping regions in the confocal modality over the 11 datasets after manual and automated montaging. *Significant (α = .05) improvements when compared to manual results using paired Wilcoxon test.

View Table | View all tables in this article

From our results we see that for both measures the automated montages on average produced better alignments than the manual montages. In addition, we see that this advantage is maintained regardless of the modality, and across the control, CSCR, and RP datasets. We also see that in general the rigid transformation model outperformed the translation only model. Figure 5 shows examples of a manual and automated montage of the same dataset, and Fig. 6 shows the NCC measure calculated over every overlapping region for these montages.

Fig. 5 An example of a full manual and automated (rigid) montage of the same dataset.

Download Full Size | PDF

Fig. 6 Average normalized cross-correlation of each overlapping region in the manual and automated (rigid) montage shown in Fig. 5

Download Full Size | PDF

We performed paired Wilcoxon tests between the automated and manual metrics, and found significant improvements (α = .05) for the automated results when evaluating all 11 datasets together, and when evaluating just the 8 control datasets. We also found trends suggesting improvements in the CSCR and RP cases, but could not establish significance due to the small sample size (N=2 and N=1, respectively).

3.2. Contribution from multiple modalities

For our second experiment, we evaluated the contribution the 3 modalities provided to our algorithm. We evaluated this by repeating the overlap metric experiment in Section 3.1, except that we adjusted the algorithm input to use each modality individually. In addition, we calculated the NCC and NMI measures when using all three modalities together. Table 2 shows this analysis averaged over all 11 datasets.

Table 2. Average (and standard deviation) normalized cross-correlation (NCC) and normalized mutual information (NMI) of all overlapping regions over the 11 datasets when using the automated algorithm with individual or multiple modalities as inputs. *Significant (α = .05) improvements when compared to single modality results using paired Wilcoxon test.

View Table | View all tables in this article

We see from the table that, in all cases, using all three modalities in the algorithm improved the average alignment in the overlapping regions. The most noticeable gain was in the confocal modality, while the dark field modality gained the least benefit. Performing paired Wilcoxon tests between the single modal and multi-modal results, we found significant improvements (α = .05) when using the multi-modal inputs, regardless of the modality evaluated.

3.3. Discontinuity detection

Our third experiment evaluated the reliability of the montage discontinuities detected by our algorithm. For each discontinuity found by the algorithm in each montage we checked if the discontinuity was a legitimate gap in the acquisition also found by the human rater, or if the algorithm was unable to locate the match found by the human rater. We then repeated the process for the manual montages, and determined if any discontinuities found by the human rater had a match found by the algorithm that we determined ex-post to be correct. Together, this gives us a reliability matrix (see Table 3) that represents the accuracy of the human rater and automated algorithm for detecting discontinuities in the dataset.

Table 3. Counts of the correct and incorrect discontinuities found in the manual and automated montages, over all datasets.

View Table | View all tables in this article

We see from the table that for the control and CSCR datasets, the algorithm had an additional error in each group relative to the human rater. However, for the RP dataset, the human rater found several additional incorrect discontinuities that the automated algorithm was able to match correctly. It is also worth noting that across the 11 datasets, there were no cases where the algorithm made an incorrect match when there should have been a discontinuity.

3.4. Overlap requirement of the algorithm

Upon inspection of the cases where the algorithm incorrectly marked discontinuities, we determined that all four errors were due to very limited regions of overlap in the images. In these cases, the overlap between the connecting image pairs were only 58, 64, 54, and 38 pixel columns or rows (whichever was the smaller direction) for the two control errors and two CSCR errors, respectively. This motivated our final experiment, where we assessed how much overlap is required for the algorithm to correctly make a match. To evaluate this, we took 22 overlapping image pairs (2 different pairs from each dataset) with at least 300 pixel columns of overlap. We then cropped one of the images in each pair by 5 pixel columns and reran our algorithm to determine if a match could be made using the cropped image. This was repeated until the images were cropped such that there was no longer an overlap. The alignment accuracy of our algorithm was then determined by evaluating the percentages of matches found at each amount of overlap across all the images. Figure 7 shows this accuracy at each point in the experiment for the three different dataset groups, when using each modality individually or all three modalities together as the algorithm input. We observe from the results that the control and CSCR alignment accuracy decreases when there are fewer than 75–100 pixel columns (36–48 μm) of overlap. This became more stringent for the RP images, where over 250 pixel columns (118 μm) were required for 100% alignment accuracy. We also see that in all three data groups, using all three modalities together ensured high accuracy at a lower amount of overlap than using any of the modalities individually.

Fig. 7 Plots showing the pairwise alignment accuracy using different numbers of columns (pixels) of overlap between the images. Shown are the results when aligning images from the (a) control, (b) CSCR, and (c) RP datasets when using each modality [confocal, split detection (SD), dark field (DF)] individually or all three modalities (All Modalities) together as inputs in the algorithm.

Download Full Size | PDF

4. Discussion

4.1. Algorithm accuracy, robustness, and limitations

From Tables 1 and 3, we see that our proposed automated montaging algorithm is accurate and robust, providing levels of performance comparable to manual montaging. Indeed, from our metric analysis of alignment quality in overlapping image regions, we see that on average the algorithm produces better alignment of overlapping regions than the human rater.

One interesting observation from Fig. 6 is that both the manual and automated montages had lower NCC in the fovea than in the rest of the montage. We believe there are two primary reasons for this. First, we know that each AO image contains small local distortions (caused by eye motion that occurs during acquisition of the reference frame) that prevent perfect alignment between two adjacent images. The effect of these errors on NCC is more visible in the fovea because, in our datasets, we acquire images more densely across the fovea. Since the map in Fig. 6 is the average NCC of all pairwise overlap at each point, the fovea region contained considerably more comparisons than the arms of the montage. For example, in the montage shown in Fig 6 that the foveal region contained the average of up to 10 overlapping images (45 pairwise comparisons), whereas the arms had at most 3 overlapping images (3 pairwise comparisons), with the large majority having only a single overlap. This means that many more images with their individual distortion errors are evaluated in the foveal region, thus lowering the average NCC in the region.

We believe our automated algorithm performed better than the manual rater in this regard because the algorithm first considers all possible overlapping images in the region and then chooses the pair with the best alignment. This is then repeated for subsequent overlapping images. In contrast, the manual rater typically starts from a pair of images and then adds to the montage piece by piece. In the fovea where the montage has multiple overlapping images, the local distortions make it likely that later images will not align very well with earlier images in the montage. We can observe this effect in Fig. 6 where we see that the bottom right corner of the fovea in the manual montage was aligned well, but the NCC decreases as you go clockwise around the fovea.

The second reason the NCC is likely lower in the fovea is that similarity metrics (such as NCC) are more sensitive when comparing images with high spatial-frequency structure than low frequency structure. This is the same reason that the dark field and split detection images (which are spatially smoother) typically have higher NCC and NMI than the confocal images at the same locations. In the fovea, the cones are very densely packed which results in more high frequency structure. This means that the same transformation error will cause more intensity mismatches (thus a lower NCC) in the fovea than in regions away from the fovea where the structures are larger and more sparse.

Our algorithm showed good discrimination for finding discontinuities in the 11 datasets. For the control and CSCR datasets, our algorithm performed only slightly worse than the human rater, producing an additional error in each group. However, in the RP dataset, the algorithm was able to find 9 additional matches that were missed by the human rater. This difference is possibly due to fixation errors being higher in the RP case than the control or CSCR cases, which causes each image’s nominal retinal location to be less accurate. As a result, we need to search additional images at more distant nominal locations to correctly find the matching structures for each alignment. For the human rater, increasing this search range makes it more difficult to locate corresponding structures between images, which increases the likelihood that a match is missed and determined (incorrectly) to be a discontinuity. The automated algorithm is less affected by this problem, because searching through additional image pairs only increases its runtime and has no impact on its accuracy.

In Fig. 7 we showed that our algorithm is generally robust to the amount of overlap needed for aligning two images. For the control and CSCR data, around 75 pixel columns of overlap were required to achieve 100% alignment accuracy. For our images, this meant that only about 11% of the total image had to be overlapping for the alignment to be successful. For the RP case, this requirement increased to around 35% of the image (250 pixel columns). However, we note that these values are dependent on the content of the randomly selected image pairs used in the analysis. Different areas of the retina and different availability of features in the overlapping regions can both affect the alignment accuracy. This is more likely to be relevant for the RP and CSCR analysis than the control analysis, due to the smaller number of image pairs analyzed. In addition, the signal to noise ratio (SNR) of the images is likely to impact how much overlap is required by the algorithm. For keypoint feature detection (such as SIFT), high SNR allows for more reliable feature descriptors to be found in the images. This enables more accurate feature matching with fewer false positive matches, which can lead to more accurate and robust montaging and a lower overlap requirement. Conversely, low SNR due to motion or pathology can make features more difficult to match between images, resulting in a higher overlap requirement.

4.2. Advantages of using multiple modalities

From the results in Table 2, we see that there are significant gains from using multiple modalities in the algorithm. For all three modalities, the average similarity of the overlapping regions increased when using all three modalities together, compared to using each alone. This is particularly true for the confocal modality. From the table, we see that the dark field modality gained the least improvement from using additional modalities in the algorithm. However, it is worth noting that in general the dark field images are typically much smoother and lack the high frequency structures we see in the confocal images. As a result, both similarity metrics are inherently much higher and less discriminative when assessing the dark field montage.

From Fig. 7 we see that using all three modalities together in the algorithm also reduces the amount of overlap required for the algorithm to successfully align adjacent images. In each of the three data groups (control, CSCR, and RP) we see that using all modalities together performed better in our overlap experiment than any of the individual modalities by themselves. On average, the multi-modal approach allowed overlapping regions to be 15 pixel columns smaller in the control and CSCR data, and 100 pixel columns smaller in the RP data. This advantage directly improves the general robustness of the algorithm, allowing it to tolerate larger shifts in fixation positions.

In addition, we see from the results that the algorithm can operate successfully (both separately and jointly) on 3 different modalities with very distinct features and structure representations. This suggests that our algorithm has high generality, and is not limited to just the AO images explored in our experiments. This is in line with our model design, which does not use any prior information inherent to our AO images to perform the alignment. Hence, there is good reason to believe that our algorithm may be easily adapted for montaging other retinal imaging modalities, such as flood illuminated AO images or en-face optical coherence tomography images.

4.3. Advantages over existing approach

In this work, we have added two additional features to the method presented by Li et al., 2012 [8]. Most notable is the ability to use multi-modal images, which we have shown can significantly improve alignment accuracy over mono-modal alignment using only the confocal modality. In addition, we have added the ability to detect discontinuities, which is a vital part of practical AO image analysis, since it is common to have such discontinuities in AO datasets. In addition to these technical improvements to the algorithm, we have presented an in-depth analysis of the algorithm’s capabilities and performance relative to manual montaging, which we believe supports the algorithm’s viability as a montaging tool for AO analysis.

5. Conclusion

We have presented a fully automated algorithm for montaging AO images of the retina. Compared to manual montages, our results show good alignment accuracy in overlapping regions and good identification of discontinuities in the image dataset. In addition, our method is capable of using multiple AO modalities simultaneously to further improve the alignment accuracy. Together this suggests that our algorithm can be used as a viable alternative to manual montaging that is often more accurate, and significantly less user intensive. The only manual interaction needed for our algorithm is the several minutes required to load the dataset and nominal starting positions into the software. From there, all alignment and processing is fully automated and can run unattended.

Several future directions can be taken to extend this work. Groupwise alignment using more than 2 images simultaneously could provide better alignment between images with limited overlapping regions by extending the matching area. In addition, recent developments in key point detection techniques, such as SURF [22], Affine-SIFT [23], and bag-of-feature models [24], could be incorporated to find more reliable feature matches. RANSAC also could be improved by taking into account the relative number of feature matches when deciding if an image is discontinuous with the reference image. Lastly, the method could be extended to address longitudinal alignment of AO datasets collected over extended time periods, which is an open problem particularly in the presence of pathology. The software we have presented in this work is available open source and can be downloaded at https://github.com/BrainardLab/AOAutomontaging.

Disclosure

JIWM: Canon Inc. (F), US Patent 8226236 (P)

A. Appendix

Fig. 8 Shown are (a) two adjacent confocal foveal images, (b) the SIFT matches found between the images, and (c) the remaining SIFT features after RANSAC.

Download Full Size | PDF

Funding

National Eye Institute, National Institute of Health (NEI, NIH) (P30 EY001583, U01 EY025477); Research To Prevent Blindness Stein Innovation Award, the Foundation Fighting Blindness; the F. M. Kirby Foundation; and the Paul and Evanina Mackall Foundation Trust.

Acknowledgments

We thank Drs. Alfredo Dubra, Yusufu Sulai, Benjamin J. Kim, and Albert M. Maguire.

References and links

1. J. Liang, D. R. Williams, and D. T. Miller, “Supernormal vision and high-resolution retinal imaging through adaptive optics,” J. Opt. Soc. Am. 14, 2884–2892 (1997). [CrossRef]

2. A. Roorda, A. B. Metha, P. Lennie, and D. R. Williams, “Packing arrangement of the three cone classes in primate retina,” Vision Res. 41, 1291–1306 (2001). [CrossRef] [PubMed]

3. A. Dubra, Y. Sulai, J. L. Norris, R. F. Cooper, A. M. Dubis, D. R. Williams, and J. Carroll, “Noninvasive imaging of the human rod photoreceptor mosaic using a confocal adaptive optics scanning ophthalmoscope,” Biomed. Opt. Express 2, 1864–1876 (2011). [CrossRef] [PubMed]

4. J. I. Morgan, A. Dubra, R. Wolfe, W. H. Merigan, and D. R. Williams, “In vivo autofluorescence imaging of the human and macaque retinal pigment epithelial cell mosaic,” Invest. Ophthalmol. Vis. Sci. 50, 1350–1359 (2009). [CrossRef]

5. A. Roorda, F. Romero-Borja, W. Donnelly III, H. Queener, T. Hebert, and M. Campbell, “Adaptive optics scanning laser ophthalmoscopy,” Opt. Express 10, 405–412 (2002). [CrossRef] [PubMed]

6. T. Y. Chui, D. A. VanNasdale, and S. A. Burns, “The use of forward scatter to improve retinal vascular imaging with an adaptive optics scanning laser ophthalmoscope,” Biomed. Opt. Express 3, 2537–2549 (2012). [CrossRef] [PubMed]

7. J. I. Morgan, “The fundus photo has met its match: optical coherence tomography and adaptive optics ophthalmoscopy are here to stay,” Ophthalmic. Physiol. Opt. 36, 218–239 (2016). [CrossRef] [PubMed]

8. H. Li, J. Lu, G. Shi, and Y. Zhang, “Automatic montage of retinal images in adaptive optics confocal scanning laser ophthalmoscope,” Opt. Eng. 51, 057008 (2012). [CrossRef]

9. Y. Ke and R. Sukthankar, “PCA-SIFT: A more distinctive representation for local image descriptors,” in “Proc. of 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,”, vol. 2 (IEEE, 2004), vol. 2, pp. II–506.

10. D. Scoles, Y. N. Sulai, C. S. Langlo, G. A. Fishman, C. A. Curcio, J. Carroll, and A. Dubra, “In vivo imaging of human cone photoreceptor inner segments,” Invest. Ophthalmol. Vis. Sci. 55, 4244–4251 (2014). [CrossRef] [PubMed]

11. D. Scoles, Y. N. Sulai, and A. Dubra, “In vivo dark-field imaging of the retinal pigment epithelium cell mosaic,” Biomed. Opt. Express 4, 1710–1723 (2013). [CrossRef] [PubMed]

12. A. Standard, “American national standard for the safe use of lasers.American National Standards Institute,” Inc., New York (1993).

13. A. Dubra and Y. Sulai, “Reflective afocal broadband adaptive optics scanning ophthalmoscope,” Biomed. Opt. Express 2, 1757–1768 (2011). [CrossRef] [PubMed]

14. A. Dubra and Z. Harvey, “Registration of 2D images from fast scanning ophthalmic instruments,” in “Proc. of 2010 International Workshop on Biomedical Image Registration,” (Springer, 2010), pp. 60–71.

15. D. G. Lowe, “Object recognition from local scale-invariant features,” in “Proc. of 1999 IEEE international Conference on Computer Vision,”, vol. 2 (Ieee, 1999), vol. 2, pp. 1150–1157.

16. M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM 24, 381–395 (1981). [CrossRef]

17. B. B. Avants, C. L. Epstein, M. Grossman, and J. C. Gee, “Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain,” Med. Image Anal. 12, 26–41 (2008). [CrossRef]

18. A. Sotiras, C. Davatzikos, and N. Paragios, “Deformable medical image registration: A survey,” IEEE Trans. Med. Imag. 32, 1153–1190 (2013). [CrossRef]

19. A. Strehl and J. Ghosh, “Cluster ensembles—a knowledge reuse framework for combining multiple partitions,” J. Mach. Learn. Res. 3, 583–617 (2002).

20. C. E. Shannon, “A mathematical theory of communication,” ACM SIGMOBILE Mobile Computing and Communications Review 5, 3–55 (2001). [CrossRef]

21. J. P. Pluim, J. A. Maintz, and M. A. Viergever, “Mutual-information-based registration of medical images: a survey,” IEEE Trans. Med. Imag. 22, 986–1004 (2003). [CrossRef]

22. H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (SURF),” Comput. Vis. Image Und. 110, 346–359 (2008). [CrossRef]

23. J.-M. Morel and G. Yu, “Asift: A new framework for fully affine invariant image comparison,” SIAM J. Imaging Sci. 2, 438–469 (2009). [CrossRef]

24. J. C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised learning of human action categories using spatial-temporal words,” Int. J. Comput. Vision 79, 299–318 (2008). [CrossRef]

	control (N=8)		CSR (N=2)		RP (N=1)

	NCC	NMI	NCC	NMI	NCC	NMI

Manual	0.389 (0.161)	0.059 (0.034)	0.564 (0.186)	0.097 (0.049)	0.245 (0.145)	0.039 (0.023)
Auto–Translation	0.407* (0.153)	0.062* (0.038)	0.585 (0.168)	0.103 (0.051)	0.289 (0.159)	0.044 (0.024)
Auto–Rigid	0.469* (0.139)	0.069* (0.034)	0.628 (0.155)	0.112 (0.050)	0.311 (0.149)	0.046 (0.024)

	Confocal Alignment		Split Detection Alignment		Dark Field Alignment

	NCC	NMI	NCC	NMI	NCC	NMI

Confocal Only	0.461 (0.174)	0.072 (0.041)
Split Detection Only			0.725 (0.104)	0.108 (0.035)
Dark Field Only					0.874 (0.073)	0.197 (0.057)
All Modalities	0.484* (0.160)	0.075* (0.041)	0.731* (0.099)	0.110* (0.034)	0.875* (0.072)	0.198* (0.058)

		Automated				Automated				Automated

		Correct	Error			Correct	Error			Correct	Error
Manual	Correct	11	2	Manual	Correct	2	2	Manual	Correct	4	0
Manual	Error	1	0	Manual	Error	1	0	Manual	Error	9	0

control (N = 8)				CSCR (N = 2)				RP (N = 1)

	control (N=8)		CSR (N=2)		RP (N=1)

	NCC	NMI	NCC	NMI	NCC	NMI

Manual	0.389 (0.161)	0.059 (0.034)	0.564 (0.186)	0.097 (0.049)	0.245 (0.145)	0.039 (0.023)
Auto–Translation	0.407* (0.153)	0.062* (0.038)	0.585 (0.168)	0.103 (0.051)	0.289 (0.159)	0.044 (0.024)
Auto–Rigid	0.469* (0.139)	0.069* (0.034)	0.628 (0.155)	0.112 (0.050)	0.311 (0.149)	0.046 (0.024)

	Confocal Alignment		Split Detection Alignment		Dark Field Alignment

	NCC	NMI	NCC	NMI	NCC	NMI

Confocal Only	0.461 (0.174)	0.072 (0.041)
Split Detection Only			0.725 (0.104)	0.108 (0.035)
Dark Field Only					0.874 (0.073)	0.197 (0.057)
All Modalities	0.484* (0.160)	0.075* (0.041)	0.731* (0.099)	0.110* (0.034)	0.875* (0.072)	0.198* (0.058)

Multi-modal automatic montaging of adaptive optics retinal images

Abstract

1. Introduction

2. Methods

2.1. AO Image Acquisition

2.2. Preprocessing

2.3. Problem Framework

2.4. Scale Invariant Feature Transforms (SIFT)

2.5. Random sample consensus (RANSAC)

2.6. Multi-modal matching using RANSAC

2.7. Matching criterion and discontinuity detection

3. Experiments

3.1. Alignment accuracy

3.2. Contribution from multiple modalities

3.3. Discontinuity detection

3.4. Overlap requirement of the algorithm

4. Discussion

4.1. Algorithm accuracy, robustness, and limitations

4.2. Advantages of using multiple modalities

4.3. Advantages over existing approach

5. Conclusion

A. Appendix

Funding

Acknowledgments

References and links

Cited By

Figures (8)

Tables (3)

Equations (19)

Biomedical Optics Express