Registration of eye reflection and scene images using an aspherical eye model

Fig. 2. (a) Cross section of the human eye and (b) a geometric eye model with two overlapping surfaces for eyeball and cornea. (c) Spherical ( $Q = 0$ ) and aspherical shape ( $Q = - 0.1 \sim - 0.5$ ) for geometric corneal model. The asphericity of the corneal surface is $Q = - 0.33 \pm 0.1$ .

Fig. 3. 3D eye pose estimation from the projected limbus (iris contour).

Fig. 4. Light lay $E (u)$ is reflected at the aspherical corneal surface point $U$ and reached to the image point $u$ .

Fig. 5. Relation of eye reflection and scene images and their EMs.

Fig. 6. Initial registration algorithm using RANRESAC.

Fig. 7. Corneal imaging camera.

Fig. 8. Experimental results. In original scene images, green crosses are the key points (ground-truth points) used for evaluating registration errors. Yellow crosses and yellow lines in the resulting images indicate corresponding key points and the errors to the ground-truth points.

Fig. 9. Sensitivity of warping function with respect to lens and asphericity. (a) Result using the corneal imaging camera, (b) result using SLR. In both cases, left plots show the projection results of grid points in an eye image to the scene image where red, green, and blue markers indicate the projection of $Q = - 0.4, - 0.2$ and 0.0, respectively. The right graphs visualize the relation between the angle from a projection center and angular difference of the projected point among different asphericities. Apparently, the difference increases according to the distance (angle) from the projection center.

Fig. 10. Residual errors after initial and fine registrations. The fine registration decreases errors at any asphericity.

Fig. 11. Residual errors with respect to the image positions ( $Q = - 0.4$ ). The $x$ -axis shows the angle from the image center and the $y$ -axis indicates the residual error. The red crosses are the result of initial registration and green asterisks are the result after the fine registration.

We developed the following approach to overcome this problem. First, we use an aspherical eye model to obtain the environment map for an eye image and formulate the alignment problem as an estimation of the 3D rotation between the environment maps of an eye reflection and a scene image, which reduces the number of parameters to be estimated. This assumption works when the distance between the eye and the scene camera is relatively smaller than the distance between the eye and the scene, and is a popular technique in making image mosaics [8–10] because it satisfies both simplicity of representation and applicability to real-world scenes.

Second, we developed a single point registration algorithm using a single pair of local feature correspondences to align the environment maps and came up with a two-step iterative registration strategy to determine the correct and accurate registration. Our approach increases the robustness in aligning noisy images where it is difficult to obtain several correct correspondence pairs due to its single point nature, and the accuracy through the second fine registration step that use dense corresponding points.

Our main contributions in this paper are as follows:

• We present an aspherical model for the corneal shape and show the formulations of its surface reflection (Section 4). We also show the formulations of light reflection at the corneal surface and use these for an image registration task.
• Our method combines a single point coarse registration called random resample concensus (RANRESAC) and fine registration steps (Sections 3, 5, and 6).
• Since the coarse registration only requires a single correct correspondence pair to compute the transformation and uses a robust verification scheme, it is robust against image noise such as eyelid, shadows, and iris textures.
• Our fine registration decreases the registration error from 3.15 [deg] to 1.05 [deg] by using dense pairs of points taken between original and synthesized scene images (Sections 6 and 7).

2. RELATED WORK

This section discusses related work on eye image analysis and image registration.

Eye image analysis: The iris region in an eye image is a mixture of the refracted iris texture and the corneal surface reflection of scene illumination. As the iris texture is important for personal identification [1,2] and iris biometrics [11], several works have investigated methods to separate iris texture and corneal reflection. He et al. obtain a reflection map from an iris region using an adaptive thresholding approach and apply a bilinear interpolation to fill out the region [11]. Tan and co-workers use a labeling-based corneal reflection removal for the purpose of iris segmentation [12]. Wang et al. apply the color chromaticity of the iris texture for this task [13]. To estimate the scene illumination, they take the consensus of corneal reflections from the images of both eyes. As these approaches rely on heuristic rules, such as assuming bright scene reflections with sharp edges or consistent chromaticity in iris colors, they have low performance in scenes, where the assumptions do not hold. We believe that this problem can be solved easily and accurately, when a pixel-wise correspondence between an eye and a scene image is available. Moreover, while these approaches are purely image based, we show that explicit geometric modeling of the eye and the light reflection at the corneal surface is beneficial for this task.

A corneal imaging technique was first developed by Nishino and Nayar [3] and then several research groups conducted extensive studies [14,15]. In these works, a camera capturing an image of the eye that exhibits corneal reflections is modeled as a non-rigid catadioptric imaging system [16]. Applying this model enables the scene illumination to be reconstructed from an eye image. In this study, we apply the same geometric and optic model for reconstructing corneal reflections and solving the registration problem. The solution to the registration problem allows robust automatic mapping between eye and scene or eye and eye images, which is essential for practical implementation and extension of the various corneal imaging applications introduced in [3], including object recognition in single eye images, and 3D reconstruction and visual tracking over multiple eye images.

Image registration: There have been many studies on pairwise image alignment as it is a fundamental topic in image processing. Due to advances in local feature descriptors, such as scale invariant feature tracker (SIFT) [17], speeded up robust features (SURF) [18], and maximally stable extremal regions (MSER) [19], feature-point-based alignments are the current state of the art. These works can be categorized by the degree of freedom in their image warping functions, in other words, how much deformation can be assumed in the warping. The highest dimensional category is using pixel-wise or feature-point-wise deformation models, such as free-form deformations [20] and flow-based representations [21]. Although these models can adapt to image region-dependent deformations, they require many nicely matched pairs of points and are therefore sensitive to image noise.

Random sample consensus (RANSAC) algorithm [22] and its extensions are popular techniques to robustly estimate the transformation parameters between noisy images. RANSAC is based on the idea that the “majority” of pairs are inliers, namely, first the hypothesis of a transformation is obtained by using randomly sampled pairs of points, and then is verified by counting how many other points are correctly warped by the transformation. RANSAC works well for most image registration problems, even under noisy conditions, but original RANSAC implementation is problematic in that it requires iterative sampling until obtaining the correct hypothesis, which comes at a greater computational cost and does not guarantee optimality. Several studies have been devoted to solving this problem. For example, progressive sample concensus (PROSAC) introduced progressive sampling for hypothesis generation to reduce the number of iterations [23]. In this approach, diverse samples for each iteration are carefully chosen to find the correct hypothesis. Preemptive RANSAC limits the number of hypotheses and compares the quality of them to reduce computational and memory costs [24]. In the robotics research field, several approaches have introduced motion priors to reduce the number of random samplings and achieved real-time simultaneous localization and mappings [25,26]. These methods can dramatically decrease the iteration and computational time, thereby solving the optimal solution of RANSAC efficiently, but they require a threshold value to define the inliers and do not work well in cases where inliers cannot be assumed to be the majority, i.e., in very noisy images.

Marconi and colleagues proposed a solution that produces the optimal point matching method by means of a new graph structure and a loopy belief propagation [27]. They performed experiments using the CMU house dataset and showed that their algorithm is also adaptable for data that includes Gaussian noise. In recent years, Ask and colleagues used truncated norms to overcome the optimality problem [28,29]. They split each pair of points into inliers and outliers and obtained the optimal transformation by using a robust loss function. This method guarantees the optimality of the solutions but still requires task-dependent parameters to discriminate inliers and outliers.

In reality, our problem is much harder, since we can usually find only one or two correct correspondence pairs between eye and scene images. In addition, it is difficult to define the uniform threshold value because our transformation is a complicated nonlinear function and therefore the pixel density of the transformed points within one image can differ a lot.

In this paper, we use the RANRESAC strategy that our group recently developed [30]. This algorithm is more robust against noise than the RANSAC families, and it therefore has the potential to solve our registration task. The original algorithm is verified only for images with affine transformations; here, we extend it to nonlinear deformations considering the corneal surface reflection system.

3. ALGORITHM OVERVIEW

Figure 1 illustrates our algorithm consisting of initial and fine registration steps. In the initial registration, 3D eye pose parameters are obtained by using the corneal boundary and prior-given geometric eye parameters. Then, we compute the inverse reflection at the corneal surface and obtain an EM of the scene from the eye image. Similarly, an EM is computed from a scene image by using a scene camera parameter. These two EMs are aligned by using the one-point algorithm that uses the correlation of local feature points (SURF) between the eye and scene images. As a result, the initial registration produces the rotation matrix $R^{0}$ between the two EMs and the warping function $W^{0}$ that transforms a point in the scene image to the eye image coordinate. However, this warping function is not accurate for an entire image region due to (1) the sparsity of the feature points, (2) fixed geometric eye parameters, and (3) error in eye pose estimation.

To solve these problems, we perform fine registration using additional dense correspondence of the points and simultaneous optimizations of the EM rotation matrix, geometric eye model parameters, and eye pose estimations. In this step, we first synthesize a scene image from an eye image using the initial warping function, and then take dense corresponding points between the synthesized scene image and original scene image using Kanade–Lucas–Tomasi (KLT) feature tracker [31]. Since the deformation of a synthesized scene image is relatively smaller than that of the original eye image, we can obtain denser correct pairs of points than the ones used in the initial step. Using these dense correspondences, we optimize the parameters of the geometric eye model, eye pose with respect to the eye camera, and rotation matrix of EMs ( $R^{0}$ ) minimizing the residual error of the corresponding points. We iteratively perform this image synthesis and parameter optimizations and obtain the resulting warping function $W^{*}$ .

4. EYE REFLECTION MODELING

This section describes the method to produce an EM from an eye image using a 3D aspherical eye model and model the light reflection at the corneal surface.

A. Aspherical Eye Model

Figures 2(a) and 2(b) show the intersection of the human eye and the geometric eye model. The human eyeball consists of two main segments: the anterior corneal surface and the posterior eyeball. Since our work treats the corneal surface reflection for image registration, the geometric model of the corneal surface is quite important.

Regarding corneal shape, existing computer vision studies have used a fixed-size sphere [3,4,6,15], but in reality, a corneal forms an conic section whose asphericity $Q$ is $- 0.33 \pm 0.1$ (ellipsoid) [32,33]. Since our work aims at an accurate pixel-wise image registration, we introduce an aspherical model and its reflection formulations.

Figure 2(c) shows the intersections of a spherical and aspherical surface model while changing the asphericity from $Q = 0.0$ (sphere) to $Q = - 0.5$ . Assuming the eye optical direction as $Z$ -axis and $Q$ as the asphericity, this can be formalized by

X^{2} + Y^{2} + p Z^{2} = r_{0}^{2}, p = 1 + Q .

B. Eye Pose Estimation

The 3D pose of the eye with respect to an eye camera is obtained by the shape of the circular limbus, which is described by the center point $L$ and the normal vector $g$ . Figure 3 illustrates the projection of the eye corneal surface to a camera image. Assuming a weak perspective projection for the eye camera, as the depth of a tilted limbus is much smaller than the distance between eye and camera, the almost circular limbus projects to an ellipse that is described by five parameters: the center $i_{L}$ , the major and minor radii $r_{\max}$ and $r_{\min}$ , and the rotation angle $φ$ . As the corneal limbus coincides with the contour of the visible iris, its pose is obtained from the elliptical contour of the imaged iris [3,4].

The 3D position of the limbus center can be calculated from the ellipse center and the distance to the camera $d = r_{L} \cdot f / r_{\max}$ , as in $L = d K_{e}^{- 1} i_{L}$ , where $K_{e}$ is the $3 \times 3$ eye camera internal matrix and $f$ is the focal length in pixels. The eye optical axis $g$ , equal to the optical axis of the eye, is obtained as $g = {[\sin τ \sin φ - \sin τ \cos φ - \cos τ]}^{T}$ , where angle $τ = \pm \arccos (r_{\min} / r_{\max})$ corresponds to the tilt of the limbus plane w.r.t. the image plane, and angle $φ$ is already known as it is the rotation of the limbus ellipse in the image plane.

Considering $X^{2} + Y^{2} = r_{L}^{2}$ in the limbus plane in Eq. (1), the distance from the center of the corneal surface and limbus center $d_{L C}$ can be obtained as $d_{L C} = {(r_{0}^{2} / p - r_{L}^{2} / p)}^{1 / 2}$ . Therefore, the center of the corneal surface is obtained as $C = L - d_{L C} g$ .

C. Corneal Surface Reflection

The following describes a corneal reflection model to calculate the inverse light path from an image pixel $u$ to a corneal surface location $U$ and estimate the scene incoming light ray $E (u)$ , as shown in Fig. 4.

We first obtain the normalized backprojection vector,

A_{e} (u) = \frac{K_{e}^{- 1} u}{‖ K_{e}^{- 1} u ‖},

in the direction of

U = t_{1} A_{e} (u)

at distance

t_{1}

from the camera. To recover

U

, we need to calculate the intersection with the corneal surface ellipsoid.

To simplify the formulation, we introduce a $4 \times 4$ homogenous matrix $T_{IC}$ that transforms the camera center coordinates to corneal surface coordinates whose z-axis is the eye optical axis, namely,

T_{CI} = [\begin{matrix} R_{CI} & C \\ 0 & 1 \end{matrix}], R_{CI} = R_{z} (φ) R_{x} (τ), T_{IC} = T_{CI}^{- 1}, R_{IC} = R_{CI}^{- 1},

where

R_{x}

and

R_{z}

are

3 \times 3

rotation matrices along the

x

- and

z

-axes, respectively. Using these transformations, the light lay from the camera projection center to the corneal surface point

U

can be formulated, as in

U = R_{IC} A_{e} (u) \cdot t_{1} + T_{IC} {[\begin{matrix} 0 & 0 & 0 & 1 \end{matrix}]}^{T} .

Solving $t_{1}$ using Eqs. (1) and (3), we obtain the reflection point $U$ on the corneal surface in the corneal surface coordinates.

Finally, we compute the light reflection $E (u)$ using the surface normal $N$ at the point $U$ , namely,

E (u) = R_{IC} A_{e} (u) + 2 (- R_{IC} A_{e} (u) \cdot N (U)) N (U), N (U) = \frac{[\begin{matrix} X_{u} & Y_{u} & p Z_{u} \end{matrix}]}{\sqrt{X_{u}^{2} + Y_{u}^{2} + p^{2} Z_{u}^{2}}},

where the reflection point

U

is denoted by

(X_{u}, Y_{u}, Z_{u})

in the corneal surface coordinates. Registering the reflection rays for the complete iris region at a sphere around the cornea creates a map of incident illumination at the eye (EM) (Fig. 5).

5. INITIAL REGISTRATION: ONE-POINT REGISTRATION USING RANDOM RESAMPLE CONSENSUS

To solve the registration problem, we assume a distant scene condition under which the two optics of the cornea and scene camera share the same environment map. Thus, the alignment problem is reduced to finding a rotation matrix that transforms the environment map obtained from an eye reflection image to that of a scene image (Fig. 5). This can be formulated as

E (u) = R A_{s} (v), A_{s} (v) = \frac{K_{s}^{- 1} {[\begin{matrix} v^{T} & 1 \end{matrix}]}^{T}}{‖ K_{s}^{- 1} {[\begin{matrix} v^{T} & 1 \end{matrix}]}^{T} ‖},

where

u

and

v

are a pair of points in an eye and a scene image, and

E

and

A_{s}

are the functions that transform from 2D image points to 3D points on the eye reflection EM and scene EM, respectively.

R

is the

3 \times 3

rotation matrix that warps the EM of a scene camera to that of a corneal sphere, and

K_{s}

is the

3 \times 3

scene camera internal matrix.

Theoretically, $R$ can be solved from just two pairs of points. However, in reality, since the quality of an eye image is low, there is no guarantee of obtaining multiple correct matches. To overcome this problem, we combine a single point registration algorithm from a single pair of rotation invariant features (such as SIFT, SURF, or MSER) with verification through a RANRESAC strategy. This approach contributes to the robustness of noisy image registration in two ways: (1) a single point registration requires only a single correct correspondence pair, thereby enabling robust estimation of the correct warping function even from noisy eye images; and (2) the RANRESAC strategy verifies a registration hypothesis obtained from a single correspondence pair through resampling pairs of points in accordance with the warping function. In contrast to RANSAC, this scheme does not assume the “majority” of initial correspondence pairs to be correct (inliers). It can therefore be robustly applied to noisy eye images where it is difficult to obtain multiple correct pairs. The algorithm flow is shown in Fig. 6.

A. Detection of Point Pairs

First, we directly apply local feature detection to find pairs of points in an eye and a scene image by using an orientation invariant local feature descriptor. Since the eye reflection is the mirrored image of the scene, we horizontally flip the eye image during matching, and then flip back the image with the detected feature points and adjust their orientations.

Orientation invariant local features consist of four components: position $x$ , feature vector $F (x)$ describing the local texture information, orientation $θ^{x}$ of the major axis of the feature descriptor, and scale parameter $s^{x}$ . As detection result, we obtain $M$ pairs of points and their feature descriptors as ${(u_{i}, F (u_{i}), θ_{i}^{u}, s_{i}^{u}) | i = 1, \dots, M}$ and ${(v_{i}, F (v_{i}), θ_{i}^{v}, s_{i}^{v}) | i = 1, \dots, M}$ in the eye image and the scene image, respectively.

B. Obtaining a Warping Function

For each pair of points, we calculate a rotation matrix $R_{i}$ and obtain a warping function $W_{i}$ that transforms a point in an eye image to a point in a scene image. We first obtain the 3D tangent vectors at eye reflection and scene EMs by using the position and orientation information of the feature points [Fig. 5 and Eq. (5)]. Then, the $R_{i}$ can be solved by aligning the vectors, namely,

R_{i} = [\begin{matrix} {\hat{E}}_{x} & {\hat{E}}_{y} & {\hat{E}}_{z} \end{matrix}] {[\begin{matrix} {\hat{A}}_{x} & {\hat{A}}_{y} & {\hat{A}}_{z} \end{matrix}]}^{- 1},

where

E_{x} = E (u_{i}), E_{y} = E (u_{i}) \times (E (u_{i}) \times E^{'} (u_{i}, θ_{i}^{u})), E_{z} = E (u_{i}) \times E^{'} (u_{i}, θ_{i}^{u}), E^{'} (u, θ^{u}) = E (u + h (θ^{u})) - E (u), A_{x} = A_{s} (v_{i}), A_{y} = A_{s} (v_{i}) \times (A_{s} (v_{i}) \times A_{s}^{'} (v_{i}, θ_{i}^{v})), A_{z} = A_{s} (v_{i}) \times A_{s}^{'} (v_{i}, θ_{i}^{v}), A_{s}^{'} (v, θ^{v}) = A_{s} (v + h (θ^{v})) - A_{s} (v), h (θ) = {[\begin{matrix} \cos (θ) & \sin (θ) \end{matrix}]}^{T} .

Here, $\hat{x}$ denotes the normalized vector of $x$ . $E^{'} (x, θ)$ and $A_{s}^{'} (x, θ)$ are functions that obtain 3D tangent vectors of the eye reflection and scene EMs’ spherical surfaces using 2D positions $x$ and the orientations $θ$ of the local features of eye and scene images, respectively. Namely, $E_{x}$ and $A_{x}$ are the directions from the EMs’ sphere centers to EMs’ surface points. $E^{'} (u, θ^{u})$ and $A_{s}^{'} (v, θ^{v})$ are the functions that transform the orientations of local features to the 3D spherical coordinates. Thus, $(E_{y}, E_{z})$ and $(A_{y}, A_{z})$ can be computed through the orthogonalizations using $(E_{x}, E^{'} (u, θ^{u}))$ and $(A_{x}, A^{'} (v, θ^{v}))$ , respectively. As a result, we obtain the warping function that transforms a point in an eye image to a point in a scene image by using the $i$ th pair of points, as in

W_{i} (u) \equiv \frac{[\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \end{matrix}] K_{s} R_{i}^{- 1} E (u)}{[\begin{matrix} 0 & 0 & 1 \end{matrix}] K_{s} R_{i}^{- 1} E (u)} .

C. Verifying Warping Functions Using RANRESAC

Now we want to choose the most appropriate warping function to transform points between the two images. In the usual RANSAC framework, this verification is achieved by counting the number of inliers, i.e., the number of points that can be correctly transformed by the warping function. However, since we cannot expect the “majority” of pairs to be inliers, we generate new point pairs in the eye and scene images for a particular warping function hypothesis, and then verify the correctness according to the similarity of the local feature vectors at these resampled point pairs.

For each warping function $W_{i}$ , we randomly sample $K$ point pairs ${(u_{j}^{*}, W_{i} (u_{j}^{*})) | j = 1, \dots, K}$ in an eye and a scene image, by choosing only points within the corneal boundary that have a warped location within the boundary of the scene image.

Next, we define the likelihood of the correct warping function by considering (1) the correlation of local feature vectors describing the texture consistency, and (2) the similarity of local feature orientations. We assume Gaussian distributions for these similarity measures and obtain the following evaluation function:

P (W_{i}) = P_{t} (W_{i}) \cdot P_{o} (W_{i}) \propto - \sum_{j = 1}^{K} \frac{{‖ F (u_{j}^{*}) - F (v_{j}^{*}) ‖}^{2}}{2 σ_{t}^{2}} - α \sum_{j = 1}^{K} \frac{{‖ 1 - ({\hat{E}}^{'} (u_{j}^{*}, θ_{u_{j}^{*}}), {\hat{A}}_{s}^{'} (v_{j}^{*}, θ_{v_{j}^{*}})) ‖}^{2}}{2 σ_{o}^{2}}, v_{j}^{*} = W (u_{j}^{*}),

where

v_{j}^{*} = W (u_{j}^{*})

σ_{t}

σ_{o}

, and

α

are positive constants to weight texture and orientation similarity, and

(\cdot, \cdot)

denotes the vector inner product. Finally, we choose the

W_{i}^{'}

that maximizes Eq. (8), as the initial warping function, as in

i^{'} = \underset{i}{argmax} (P (W_{i})) .

It is very important to set appropriate scale parameters for the local features at the sampled points, which define the spatial size of the descriptor. Matching performs best when the local features cover the same spatial area in a 3D scene. Therefore, we set the scale according to the size of the 3D ray volume of a pair of points. Namely, we obtain the value at each pair of points $(u, v)$ as

s_{i}^{v} = \frac{1}{2 \sqrt{2}} ‖ W_{i} (u + {[\begin{matrix} 1 & 1 \end{matrix}]}^{T}) - W_{i} (u - {[\begin{matrix} 1 & 1 \end{matrix}]}^{T}) ‖ \cdot s_{i}^{u},

where

s_{i}^{u}

is the user-defined scale parameter in an eye image and

s_{i}^{v}

is the scale of the corresponding point in a scene image.

6. FINE REGISTRATION

Starting from the initial warping function $W^{0}$ obtained by the initial step, we perform a fine registration step, as illustrated in Fig. 1. This step consists of (1) synthesizing a scene image from an eye image using the latest warping function, (2) obtaining dense corresponding points between a synthesized scene image and an original scene image using KLT, and (3) refining 3D eye pose, eye geometric parameters, rotation matrix $R$ , and warping function through the minimization of residual error of the dense corresponding points.

A. Synthesize a Scene Image from an Eye Image

First, we synthesized a scene image from an eye image using the latest image warping function $W^{n}$ .

B. Obtain Dense Corresponding Points Using KLT

We obtain dense corresponding points between a synthesized scene image and an original scene image using KLT. Since the synthesized scene image is more similar to an input scene image than an eye image, we can obtain dense pairs of points. Afterward, we eliminate outlier corresponding pairs using a distance threshold and obtain new corresponding points $(u_{j}^{+}, v_{j}^{+})$ $(j = 1, \dots, J)$ .

C. Refine Eye Pose, Eye Geometry, EM Rotation Parameters

We optimize the eye pose parameters $(τ, φ)$ , distance $d$ , and three rotation parameters for environment map rotation matrix $R$ minimizing the residual error of the corresponding points $Σ_{j = 1}^{J} {‖ u_{j}^{+} - v_{j}^{+} ‖}^{2}$ using Levenberg–Marquardt optimization [34]. Using the resulting parameters, we updated the warping function $W^{n + 1}$ . We iterate these steps until convergence and obtain the final warping function.

7. EXPERIMENTS

We conducted two experiments with four subjects for several indoor and outdoor scenes. Corneal reflections and scene images were taken at the same time.

In our first experiment, we examined the registration accuracy while changing the asphericity to determine the effectiveness of the proposed aspherical corneal surface model and registration algorithms. The second experiment was performed to evaluate the robustness of the registration, namely, we compared the number of frames where the initial registration was successful using the proposed RANRESAC algorithm, naïve one-point registration, two-point RANSAC, and two-point optimal RANSAC algorithms.

All experiments were conducted using the SURF descriptor under MATLAB 2013b on an Intel Core i7 3.2 GHz, 16 GB RAM PC. The parameter values are fixed for all experiments as $(r_{L}, r_{0}) = (5.75 mm, 8.0 mm)$ and $(σ_{t}, σ_{o}) = (0.2, 0.2)$ and $α = 1.0$ .

A. Cameras

We used two types of cameras for taking eye and scene images. A corneal imaging camera obtained a close-up view of corneal reflections, and an single-lens reflex (SLR) camera captured one outdoor scene to evaluate the applicability of the proposed method for a conventional consumer hardware setup. All cameras and lenses were calibrated by using Zhang’s algorithm [35].

Corneal imaging camera: For the purpose of obtaining clear eye and scene images, we designed a corneal image camera that can capture both images simultaneously. This camera consists of a head rig and two camera modules (IDS UI-1241LE-C-HQ, 1/1.8 in. CMOS, $1280 \times 1024$ [pixel]) (Fig. 7). The eye camera has a 12 mm lens { $(H, V) = (33.3, 24.8)$ [deg]} and the scene camera has a 4 mm lens { $(H, V) = (83.8, 60.1)$ [deg]}. The device can capture a close-up view of an eye with an iris diameter as 400–450 [pixel] at a distance of 70–110 [mm]. To obtain corneal reflections under dark illumination conditions we adjust the sensor sensitivity parameters including gain and exposure. The cameras are connected to a PC, where images are captured at 10 fps. SLR: We used Nikon D800E { $(H, V) = (7360,4912)$ [pixel]} with a facial lens (SIGMA 24–70 mm, $f = 2.8$ IF EX DG HSM, viewing angle $(H, V) = (30.8, 21.4)$ [deg]), and a wide-angle scene lens (SIGMA 15 mm EX Diagonal Fisheye, $f = 2.8$ , viewing angle $(H, V) = (92.72, 69.97)$ [deg]). In this setup, we sequentially captured eye and scene images while changing the camera position and lens. The facial image is taken about 400 [mm] away from the subjects’ face [Fig. 8(e)], and an iris diameter as 420—460 [pixel].

B. Dataset

We acquired eye and scene images for four subjects in five scenes including indoor and outdoor environments. Table 1 lists the scenes and cameras used.

Table 1. Corneal Reflection and Scene Dataset

Regarding the scenes using the corneal imaging camera, the distance from eye camera to corneal surface was about 100 [mm]. Regarding the scene using the SLR, subjects sat in front of the camera about 400 [mm] away. In all images, corneal boundaries were detected by combining edge detection and RANSAC-based ellipse parameter estimation.

C. Sensitivity of Image Warping Function with Respect to Asphericity

We first perform a simulation of corneal surface reflection under different asphericity to observe the relation between asphericity and projection in a scene image.

In the simulation, we assume the eye optical axis is opposite to the optical axis of the eye camera, and with the radii of imaged limbus set to 200 pixels (corneal imaging camera) and 230 pixels (SLR). Figure 9 shows the results. The left plots show the projections from grid points in an eye image to a scene image where red, green, and blue markers indicate the projection of different asphericities ( $Q = 0.0, - 0.2, - 0.4$ ). The right graphs show the relation between the angle from a projection center and angular difference of the projected point among different asphericities ( $Q = - 0.2 \sim - 0.4$ and $Q = 0.0 \sim - 0.4$ ). From these results, the difference of projection increases according to the angle from the projection center. The maximum difference between the projections of $Q = 0.0$ and $- 0.4$ is 1.23 [deg] (33 [pixels]) in the corneal imaging camera, and 2.1 [deg] (334 [pixels]) in the SLR. From these results, we can observe that the asphericity causes up to a 2.1 [deg] difference in our configuration, which indicates that the proposed ellipsoid model has the potential to increase the registration accuracy.

D. Accuracy of the Registration

Next, we evaluate the registration accuracy using the dataset. Using the images where the initial registration succeeded, we chose 10–15 key points in the scene and evaluated the error between the original and synthesized scene images.

Table 2 and Fig. 8 show the experimental results. In Fig. 8, green markers in the original scene images indicate key points. Yellow markers and lines in the synthesized scene images indicate the corresponding key points and the errors to the original key points. As seen in Table 2 and Fig. 10, the second fine registration is effective to increase the quality of registration, namely, this step reduces the error from 3.15 [deg] to 1.16 [deg] on average. Figure 11 shows the residual error with respect to the location of the key points, where the $x$ -axis indicates the angle between the image center to the key points and the $y$ -axis is the angular error. From this figure, we can see that the residual error is larger when the key point is located at the image boundary in the initial registration. However, the fine registration step decreases the error in the entire image region.

Table 2. Experimental Results (Accuracy)

In order to observe the effectiveness of the proposed aspherical model, we set three asphericity parameters ( $Q = 0.0, - 0.2, - 0.4$ ) and performed registration. The results showed that the residual error after the fine registration was 1.05 [deg] ( $Q = - 0.4$ ), 1.18 [deg] ( $Q = - 0.2$ ), and 1.27 [deg] ( $Q = 0.0$ ) on average. Therefore, though the difference is relatively small, the ellipsoid model ( $Q = - 0.2$ or $- 0.4$ ) performed better than the spherical model ( $Q = 0.0$ ).

E. Robustness of the Registration

We evaluate the robustness of the initial registrations of the proposed approach (one-point RANRESAC), naïve one-point registration, two-point RANSAC, and two-point optimal RANSAC [36]. In this experiment, we use images taken by the corneal imaging camera, as it can take multiple images from each subject, and assume a spherical corneal model ( $Q = 0.0$ ).

Naïve one-point registration: We find a rotation matrix and a warping function in the same way as with the proposed RANRESAC approach. All other pairs of points are used for verification by counting the number of points that are correctly transformed using the warping function. Namely, if the point in a corneal image is transformed close to the corresponding point in a scene image, it is assumed to be a correct pair. We set the distance thresholds to 40 pixels (about 4% of the scene image size). This is performed for all pairs to find the best one. If multiple best pairs exist, we choose the pair with the smallest mean error for the inliers. Since this approach uses all pairs of points, it can find the optimal solution when we utilize the single pair of points for generating hypothesis and distance between the pair of points for verification of the hypothesis.

Two-point RANSAC: We randomly choose two pairs of points from the local feature correspondences and estimate a rotation matrix and a warping function. The verification is performed in the same way as naïve one-point registration. This is iterated 500 times for each frame to find the best warping function.

Two-point optimal RANSAC: Similar to the two-point RANSAC, we perform the optimal RANSAC using two pairs of points. Regarding the parameters, we set maxDataTrials, which defines the maximum number of attempts to select a non-degenerate dataset, as 100, and maxTrials, which defines the maximum number of iterations, as 1000.

We count the number of frames where the algorithms could choose correct pairs of points. Table 3 lists the results, specifically, the number of frames where the initial registration succeeded. The proposed RANRESAC approach had a success rate of 86.0%, which is superior to all other three methods.

Table 3. Experimental Results (Registration Robustness), Showing the Number of Frames Where the Registration Succeeds

8. APPLICATIONS

The proposed algorithm enables passive matching of eye reflection and scene images, which significantly expands the potential applications of corneal imaging techniques. In this section, we describe two showcases that are enabled from our approach.

A. Point of Gaze Estimation from Uncalibrated Images

Conventional EGT systems require a calibrated and fixed geometric relation between an eye camera and a scene camera, to obtain the mapping from the gaze direction in the eye camera frame to the PoG in the scene. This involves tedious calibration procedures and manual parallax compensation when the relation changes due to common practical conditions, such as arbitrary non-planar scene geometry, varying scene distance, and drift of components in wearable devices. In contrast, the proposed technique calculates a direct mapping between an eye reflection and a scene image at each frame, which eliminates the restrictions and supports arbitrary scene and camera configurations in wearable and remote setups. This results in an uncalibrated, non-intrusive, and instantly available EGT system.

Figure 12 shows the algorithm flow. First, we compute the reflection of the PoG in the eye image. Here, we apply the forward-projection method in [4] to calculate the gaze reflection point (GRP) $u_{GRP}$ , where the light from the PoG reflects at the corneal surface into the eye camera. Then, applying the warping function from the registration process allows us to calculate the corresponding PoG in a scene image at $W (u_{GRP})$ . The advantage of this strategy is that it only requires image registration, without any geometric calibration as in conventional EGT systems. Moreover, it supports a dynamic relation between eye and scene cameras, which allows for a stationary scene camera or even previously captured image and video.

Fig. 12. PoG estimation using the proposed approach.

Figure 13 shows a scene image with an overlaid trajectory of estimated PoGs. Here, we combined an eye camera and a stationary scene camera with a fisheye lens, where the eye reflection images are matched to a previously captured scene image, such as one from the Google Streetview database. This opens up new potential applications and configurations of EGT systems, such as a small hardware implementation embedded in head-mounted displays (HMDs) or mobile systems.

Fig. 13. PoG estimation result in an outdoor city scene. Green squares show PoGs and red lines show the gaze trajectories.

B. Peripheral Vision Estimation

Similar to the concept of PoG estimation, we can obtain a peripheral vision map overlaid on a scene image using the warping function from the proposed method. Therefore, at each point in an eye image we obtain the 3D light ray incident to the eye using Eq. (4). This allows us to compute the angle between the gaze direction and the image point $u$ in an eye reflection image as $‖ \arccos (E (u_{GRP}), E (u)) ‖$ , which is then mapped to the scene image using the warping function $W (u)$ . Figure 14 shows an example with a scene image captured by an SLR camera with a wide-angle lens (Nikon D800E, Sigma 15 mm, F2.8). The peripheral vision map shows that the eye surface reflects incident light from a range of 200 deg, which is larger than the perceived field of view of the human eye.

Fig. 14. Peripheral vision estimation results for two scenes. The pictures show the viewing angles overlaid to an eye reflection image (left) and a scene image (right). The center of the circular contours marks the PoG at 0 deg, from which contours are drawn at 10 deg increments.

Alert me when this article is cited.

C. Other Applications: Biometrics, Virtual and Mixed Realities

The proposed technique expands the potential of other eye imaging applications. A promising task is the separation of iris texture and corneal reflections to increase the reliability of iris biometrics [7]. This will enable novel identification and security scenarios, such as those involving surveillance cameras and consumer devices (smartphones).

Another potential application will be in virtual and mixed reality, namely, the calibration of optical see-through HMDs or head-up displays. Currently these systems require users to perform eye-display calibrations for accurate registration of scene and virtual images. Our image registration algorithm can resolve this issue by knowing the relation between the virtual and scene images through corneal reflections. As a result, a system can overlay virtual images or texts onto a scene object using our technique.

9. DISCUSSION

Registration accuracy: Through two-step registrations, the average errors after fine registration are 1.05 [deg] when asphericity $Q$ is $- 0.4$ . This becomes 14 pixels at image center and 28 pixels at image boundary under our experimental setup using a corneal imaging camera { $(H, V) = (1280, 1024)$ [pixels]}. According to the results of comparison with different asphericities, the ellipsoid model ( $Q = - 0.2$ and $- 0.4$ ) performed slightly better than the spherical model ( $Q = 0.0$ ). The difference of the best performer ( $Q = - 0.4$ ) and the spherical model is 0.22 [deg] on average, which becomes about 3 pixels in the image center and 6 pixels in the image boundary. This result supports the asphericity of the true human corneal surface, namely $Q = - 0.33 \pm 0.1$ .

However, finding optimal asphericity in the fine registration step is quite difficult due to (1) dependency on the other parameters, such as eye pose and distance; and (2) sensitivity to the errors of pairs of points between eye and scene images. Thus, we fixed the asphericity in the fine registration step and compared the registration error in the experiment.

Eye pose estimation: Although it is not the main focus of this paper, eye pose estimation is necessary for our registration algorithm. We used a simple edge detection and RANSAC-based ellipse fitting approach; however, it is not accurate when the corneal boundary is largely occluded by eyelids. Thus, our system can be used with other state-of-art methods, such as the starburst algorithm [37], active IR lighting [4], machine learning [38], and algorithms that achieve real-time eye pose estimations [39–41].

Effects of iris textures and pupils: Iris texture and darker pupils are expected to have an effect on the registration performance. However, according to the experimental results, there is no significant difference between individuals, even though we took data from subjects with a variety of eye colors and transparencies. Thus, the iris texture and pupil images do not seriously affect the registrations, for two reasons. First, the image features of iris texture and pupil edge are quite different to those of scenes, so few pairs of points that connect between iris texture, pupil edges, and scene points are to be found. Second, even if the wrong pairs of points are chosen, RANRESAC rejects them since it evaluates the global similarity of local feature points between corneal reflection and scene images by using new resampled points.

Unusual eye shapes: The asphericity of the typical corneal shape is $- 0.33 \pm 0.1$ , which is a little bit flatter than a sphere [Fig. 2(c)]. However, it is known that subjects with keratoconus have a much sharper corneal shape. Although we could not gather any data from such subjects in this experiment, we believe our aspherical model can support this case by using the positive asperity parameter ( $Q > 0$ ) that can express sharper surfaces.

10. CONCLUSION

In this paper, we presented a novel approach for the robust and accurate registration of an eye reflection and a scene image in a fully automated way. Since it is difficult to obtain multiple correct correspondence pairs due to noisy reflection images, our initial registration achieves a robust image registration method that (1) reduces the number of estimated parameters by assuming the problem to be the alignment of spherical environment maps, (2) only requires a single correct correspondence pair to compute the transformation, and (3) verifies transformations using a RANRESAC strategy. In addition, we performed second step fine registration that decreases the registration error using dense pairs of points taken between original and synthesized scene images. As a result, our algorithm performs accurately in 1.05 [deg] on average when we use an ellipsoid corneal model whose asphericity $Q$ is set to $- 0.4$ , which is better than the result using the spherical cornea model.

Funding

Japan Society for the Promotion of Science (JSPS) KAKENHI (26280058, 26249029, 15H02738).

REFERENCES

1. K. W. Bowyer, K. Hollingsworth, and P. J. Flynn, “Image understanding for iris biometrics: a survey,” Comput. Vis. Image Underst. 110, 281–307 (2008). [CrossRef]

2. L. Ma, T. Tan, S. Member, Y. Wang, and D. Zhang, “Personal identification based on iris texture analysis,” IEEE Trans. Pattern Anal. Mach. Intell. 25, 1519–1533 (2003). [CrossRef]

3. K. Nishino and S. K. Nayar, “Corneal imaging system: environment from eyes,” Int. J. Comput. Vis. 70, 23–40 (2006). [CrossRef]

4. A. Nakazawa and C. Nitschke, “Point of gaze estimation through corneal surface reflection in an active illumination environment,” in Proceedings of European Conference on Computer Vision (ECCV) (Springer-Verlag, 2012), pp. 159–172.

5. K. Nishino, P. N. Belhumeur, and S. K. Nayar, “Using eye reflections for face recognition under varying illumination,” in Proceedings of IEEE International Conference on Computer Vision (ICCV) (2005), pp. 519–526.

6. C. Nitschke and A. Nakazawa, “Super-resolution from corneal images,” in Proceedings of British Machine Vision Conference (BMVC) (BMVA, 2012), pp. 22.1–22.12.

7. J. G. Daugman, “High confidence visual recognition of persons by a test of statistical independence,” IEEE Trans. Pattern Anal. Mach. Intell. 15, 1148–1161 (1993). [CrossRef]

8. S. Lovegrove and A. J. Davison, “Real-time spherical mosaicing using whole image alignment,” in Computer Vision—ECCV 2010 (Springer, 2010), pp. 73–86.

9. M. V. S. Sakharkar and S. Gupta, “Image stitching techniques-an overview,” Int. J. Comput. Sci. Appl. 6, 324–330 (2013).

10. H.-Y. Shum and R. Szeliski, “Construction of panoramic image mosaics with global and local alignment,” in Panoramic Vision (Springer, 2001), pp. 227–268.

11. Z. He, T. Tan, Z. Sun, and X. Qiu, “Toward accurate and fast iris segmentation for iris biometrics,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 1670–1684 (2009). [CrossRef]

12. T. Tan, Z. He, and Z. Sun, “Efficient and robust segmentation of noisy iris images for non-cooperative iris recognition,” Image Vis. Comput. 28, 223–230 (2010). [CrossRef]

13. H. Wang, S. Lin, X. Liu, and S. B. Kang, “Separating reflections in human iris images for illumination estimation,” in Proceedings of International Conference on Computer Vision (ICCV) (2005), Vol. 2, pp. 1691–1698.

14. M. Backes, T. Chen, M. Dürmuth, H. P. A. Lensch, and M. Welk, “Tempest in a teapot: Compromising reflections revisited,” in Proceedings of IEEE Symposium on Security and Privacy (SP) (2009), pp. 315–327.

15. C. Nitschke, A. Nakazawa, and H. Takemura, “Corneal imaging revisited: an overview of corneal reflection analysis and applications,” IPSJ Trans. Comput. Vis. Appl. 5, 1–18 (2013). [CrossRef]

16. P. Sturm, S. Ramalingam, J.-P. Tardif, S. Gasparini, and J. Barreto, “Camera models and fundamental concepts used in geometric computer vision,” Found. Trends Comput. Graph. Vis. 6, 1–183 (2011). [CrossRef]

17. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis. 60, 91–110 (2004).

18. H. Bay, T. Tuytelaars, and L. V. Gool, “SURF: speeded up robust features,” in Proceedings of European Conference on Computer Vision (ECCV), A. Leonardis, H. Bischof, and A. Pinz, eds., Vol. 3951 in Lecture Notes in Computer Science (Springer, 2006), pp. 404–417.

19. J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baseline stereo from maximally stable extremal regions,” in Proceedings of British Machine Vision Conference (BMVC) (BMVA, 2002), pp. 36.1–36.10.

20. K. Fujiwara, K. Nishino, J. Takamatsu, B. Zheng, and K. Ikeuchi, “Locally rigid globally non-rigid surface registration,” in Proceedings of IEEE International Conference on Computer Vision (ICCV) (2011), pp. 1527–1534.

21. C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, “SIFT flow: dense correspondence across different scenes,” in Proceedings of European Conference on Computer Vision (ECCV) (Springer-Verlag, 2008), pp. 28–42.

22. M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM 24, 381–395 (1981). [CrossRef]

23. O. Chum and J. Matas, “Matching with PROSAC-progressive sample consensus,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2005), Vol. 1, pp. 220–226.

24. D. Nistér, “Preemptive RANSAC for live structure and motion estimation,” Mach. Vis. Appl. 16, 321–329 (2005). [CrossRef]

25. J. Civera, O. G. Grasa, A. J. Davison, and J. Montiel, “1-point RANSAC for extended Kalman filtering: application to real-time structure from motion and visual odometry,” J. Field Rob. 27, 609–631 (2010). [CrossRef]

26. D. Scaramuzza, F. Fraundorfer, and R. Siegwart, “Real-time monocular visual odometry for on-road vehicles with 1-point RANSAC,” in IEEE International Conference on Robotics and Automation (ICRA) (IEEE, 2009), pp. 4293–4299.

27. J. J. McAuley, T. S. Caetano, and M. S. Barbosa, “Graph rigidity, cyclic belief propagation, and point pattern matching,” IEEE Trans. Pattern Anal. Mach. Intell. 30, 2047–2054 (2008). [CrossRef]

28. E. Ask, O. Enqvist, and F. Kahl, “Optimal geometric fitting under the truncated l2-norm,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013), pp. 1722–1729.

29. E. Ask, O. Enqvist, L. Svarm, F. Kahl, and G. Lippolis, “Tractable and reliable registration of 2d point sets,” in Computer Vision—ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, eds., Vol. 8689 of Lecture Notes in Computer Science (Springer, 2014), pp. 393–406.

30. A. Nakazawa, “Noise stable image registration using random resample consensus,” in 2016 International Conference on Pattern Recognition (ICPR2) (IAPR, 2016).

31. B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in International Joint Conference on Artificial Intelligence (1981), Vol. 81, pp. 674–679.

32. M. Guillon, D. P. Lydon, and C. Wilson, “Corneal topography: a clinical model,” Ophthal. Physiol. Opt. 6, 47–56 (1986). [CrossRef]

33. J. Ying, B. Wang, and M. Shi, “Anterior corneal asphericity calculated by the tangential radius of curvature,” J. Biomed. Opt. 17, 0750051 (2012). [CrossRef]

34. J. J. Moré, “The Levenberg-Marquardt algorithm: implementation and theory,” in Numerical Analysis (Springer, 1978), pp. 105–116.

35. Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Mach. Intell. 22, 1330–1334 (2000). [CrossRef]

36. A. Hast and J. Nysjö, “Optimal RANSAC-towards a repeatable algorithm for finding the optimal set,” J. WSCG 21, 21–30 (2013).

37. D. Li, D. Winfield, and D. J. Parkhurst, “Starburst: a hybrid algorithm for video-based eye tracking combining feature-based and model-based approaches,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)-Workshops (IEEE, 2005), pp. 79.

38. E. Wood, T. Baltruaitis, X. Zhang, Y. Sugano, P. Robinson, and A. Bulling, “Rendering of eyes for eye-shape registration and gaze estimation,” in 2015 IEEE International Conference on Computer Vision (ICCV) (IEEE, 2015), pp. 3756–3764.

39. M. Barbosa and A. C. James, “Joint iris boundary detection and fit: a real-time method for accurate pupil tracking,” Biomed. Opt. Express 5, 2458–2470 (2014). [CrossRef]

40. W. Zhang, M. L. Smith, L. N. Smith, and A. Farooq, “Eye center localization and gaze gesture recognition for human-computer interaction,” J. Opt. Soc. Am. A 33, 314–325 (2016). [CrossRef]

41. A. Nakazawa, C. Nitschke, and T. Nishida, “Non-calibrated and real-time human view estimation using a mobile corneal imaging camera,” in International Conference on Multimedia & Expo Workshops (ICMEW) (IEEE, 2015), pp. 1–6.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Fig. 1. Overview of the algorithm.

Fig. 2. (a) Cross section of the human eye and (b) a geometric eye model with two overlapping surfaces for eyeball and cornea. (c) Spherical (

Q = 0

) and aspherical shape (

Q = - 0.1 \sim - 0.5

) for geometric corneal model. The asphericity of the corneal surface is

Q = - 0.33 \pm 0.1

Fig. 3. 3D eye pose estimation from the projected limbus (iris contour).

Fig. 4. Light lay

E (u)

is reflected at the aspherical corneal surface point

U

and reached to the image point

u

Fig. 5. Relation of eye reflection and scene images and their EMs.

Fig. 6. Initial registration algorithm using RANRESAC.

Fig. 7. Corneal imaging camera.

Q = - 0.4, - 0.2

and 0.0, respectively. The right graphs visualize the relation between the angle from a projection center and angular difference of the projected point among different asphericities. Apparently, the difference increases according to the distance (angle) from the projection center.

Fig. 10. Residual errors after initial and fine registrations. The fine registration decreases errors at any asphericity.

Fig. 11. Residual errors with respect to the image positions (

Q = - 0.4

). The

x

-axis shows the angle from the image center and the

y

-axis indicates the residual error. The red crosses are the result of initial registration and green asterisks are the result after the fine registration.

Fig. 12. PoG estimation using the proposed approach.

Fig. 13. PoG estimation result in an outdoor city scene. Green squares show PoGs and red lines show the gaze trajectories.

Tables (3)

Table 1. Corneal Reflection and Scene Dataset

Table 2. Experimental Results (Accuracy)

Table 3. Experimental Results (Registration Robustness), Showing the Number of Frames Where the Registration Succeeds