A two-stage keypoint registration approach is proposed to achieve frame-rate performance, while maintaining high accuracy under large perspective and scale variations. First, an agglomerative clustering algorithm based on an effective edge significance measure is adopted to derive the corresponding regions for keypoint detection. Next, a light-weight detector and a compact descriptor are utilized to obtain the exact location of the keypoints. In conjunction with the point transferring method, the proposed approach can perform registration task in textureless regions robustly. Experiments are conducted to demonstrate that the approach can handle the real-time tracking tasks.
© 2009 OSA
Real-time keypoint registration is a crucial component of many practical applications in computer vision, ranging from video retrieval to augmented reality (AR). Though current local feature algorithms, such as SIFT , GLOH  and SURF-128 , have obtained notable performance in static image registration when there are large viewpoint and illumination changes, their additional processing to eliminate the second-order effects (including skew and anisotropic scaling) in feature detection and description section brings much computational cost, and results in a negative impact on the number of correspondence. Moreover, these algorithms are too time-consuming to realize the frame-rate performance. Lepetit et al. advocated formulating wide-baseline matching as a classification problem [4,5], training the classifier by generating numerous synthetic views of the keypoints as they would appear under different perspective or scale distortions. This shifts much computational burden to training section, without sacrificing registration performance. Thus it is sufficiently fast for real-time applications. However, this type of algorithm requires large memory in building view-sets and randomized trees, and accordingly, the run time of the training section is usually quite costly.
Adhering to the direction of pursuing a good compromise among accuracy, robustness and time cost, we explore a two-stage keypoint registration approach, designated as R3KR for the acronym of ‘Region Restricted Rapid Keypoint Registration’. Firstly, the feature regions with high distinctiveness are detected from two images using an effective edge significance measure, and their correspondences are then obtained. Next, a light-weight local feature registration algorithm is adopted to further obtain the exact location of the keypoints between the two corresponding regions. Note that keypoint extraction of the second stage is performed on the matched feature regions, which are derived with invariance to affine transformation. The influences of scale, illumination, contrast and rotation are also taken into consideration in the registration process of the second stage to further increase the robustness. Thus, the proposed approach is theoretically affine invariant. Furthermore, benefiting from the two-stage framework from region correspondence to keypoint correspondence, the time cost is substantially reduced. We have applied the proposed algorithm to markerless tracking in AR environment, and obtained desirable results.
The remainder of this paper is organized as follows. Section 2 introduces the implementation details of the proposed algorithm. In Section 3, experimental results are presented. An application for markerless AR tracking is also introduced in this section. Lastly in Section 4, conclusions are presented.
2. Region restricted rapid keypoint registration
2.1 Region correspondence stage
As mentioned above, this stage is to derive the corresponding regions for keypoint detection. A straightforward way to realize this stage is to make use of image segmentation methods. The highly recognized maximally stable extremal region (MSER) proposed by Matas et al. , employs watershed-like algorithm to derive the feature regions and yields remarkable performance . Forssén further extended MSER to color space utilizing an effective edge significance measure . Inspired from their work, we obtain the candidate regions using the following procedures.
As is known, most image acquisition devices are photon counters. When a large number of photons hit the surface of charge-coupled device (CCD), the image noise can be modeled as the discrete Poisson distribution, and can be well approximated by the Gaussian distribution . Given the expected intensity μ, the probability of the measured intensity I can be defined as
Defining two pixels belonging to the same contiguous region if their edge significance is smaller than a certain threshold d thr, an image evolution process will be obtained when varying d thr with the increase of time step t. However, the area of region grows faster in the beginning and slower towards the end. In order to make the image evolution approximately proportional to the time steps, Ref. . introduced an inverse conversion of CDF instead of directly measuring the distance threshold. Thus, it can be constructed as
Consequently, the distance threshold at time step t can be computed as
The expansion rate r at time step t is defined as
Only the largest one of the nested feature regions in an evolution process is reserved. The regions which areas differ by less than 15% are defined as overlapping regions, and can also be pruned. We also restrict the number of feature regions by cardinality in order to reduce the redundant computational cost.10]. They can be derived from the raw moments μ as
As can be observed from Fig. 1, though the number of the non-overlapping feature regions is limited, the reserved regions are capable of filling up almost the entire image.
The remaining elliptical regions are then warped into circular regions with consistent radius to obtain the affine invariance. Next, the standard 128 dimensional SIFT descriptor  and the hybrid Spill Tree (SP-Tree)  are adopted in the description and matching section.
2.2 Keypoint correspondence stage
After going through the processes mentioned above in the region correspondence stage, the corresponding feature regions are built and will be used in the second stage for fine matching.
In order to guarantee sufficient number of candidate keypoints, modern detectors tend to locate keypoints by simply examining the intensities of certain pixels around the tentative locations. Following the methodology similar to the algorithms proposed in [4,13–16], we consider only the intensities along a circle of 16 pixels around each candidate keypoint.
We locate a candidate keypoint at P if the intensities of at least n contiguous pixels are all above (negative) or all below (positive) the intensity of P over a certain threshold. This is illustrated in Fig. 2 . Varying n in the range from 10 to 15, the best performance is achieved when n is set as 12 in our experiments. This may be because a keypoint tends to be located in a uniform area or along an edge when n is smaller than 12. On the contrary, if n is larger than 12, it appears to be so rigorous that some ‘good’ keypoints will be discarded. Usually, a featureless keypoint can be rejected very quickly without scanning the entire circle.
Once a keypoint has been located, the intensities of every three contiguous pixels on the discrete circle R are weightedly summed, resulting in a total of 16 sums. The 16 sums are utilized to construct a compact descriptor D. Assuming that Ri is a pixel on the circle R, Ricw and Riccw represent the neighboring points in the clockwise and counter-clockwise directions. Each element of the descriptor D can be described as
This makes the description more stable than using single pixel intensity as an element of the descriptor. In fact, the intensity I is substituted by the gradient G from the point on the circle to its center P, in order to resist the illumination change. Thus Eq. (9) can be rewritten as
The largest sum is chosen as the first element of the descriptor, and the vector is filled with the remaining sums of the circle in a clockwise direction. If more than one sum has the same largest value, all the possible descriptors are stored. This simple process is sufficient for rectifying the detected keypoint with respect to 2D rotation. Moreover, it considerably reduces the computational cost required in calculating the orientation of the small image patch and sorting the descriptor elements using the mean squared deviation (MSD). In addition, the descriptor vector is normalized in order to remove the variance of contrast.
The polarity of every descriptor is also recorded such that the keypoints can be categorized as positive or negative. This partition is efficient for searching, since positive features need not be compared with negative ones.
Lastly, all the descriptors are stored in a list, ranked by the sum of all elements in a descriptor. It is known that the density of keypoints in a region depends on image content. Highly-textured regions usually generate more candidate keypoints. In order to make the distribution of keypoint more uniform, the searching range is set to be proportional to the cardinality of the region (the density is limited to below 1 keypoint per 12 pixels).
In the matching section of the second stage, the hybrid SP-Tree is also employed for its high efficiency. Finally, RANSAC, the effective way of eliminating spurious matches, should not be omitted.
3. Experimental details
We have evaluated the proposed algorithm against two criteria, namely, the number of correspondences and the reconstruction similarity (RS). We advocate evaluating the performance of a registration algorithm using the RS metric as it can reflect the degree of matching error well. More details can be found in Ref .
The evaluations are conducted on the standard test library  provided by the Visual Geometry Group (test images are regulated to the size of 800 × 600). The experimental results on the ‘Graffiti’ sequence are shown in Fig. 3 . As it can be observed from Fig. 3, the performance of R3KR does not deteriorate much with small rotation and viewpoint changes compared with SIFT and GLOH, and it even outperforms them when there are large perspective distortions. Moreover, the total registration time of R3KR (0.476s) is only 20.59% of the computational time of SIFT (2.312s), and 18.99% of GLOH (2.507s). These data are collected using an ordinary laptop with 2.4 GHz Intel® CoreTM2 Duo CPU and 2GB RAM.
Figure 4 shows an application of indoor markerless AR tracking. It can be seen that the R3KR can track the object robustly in various viewing conditions, even when the virtual object is located in a textureless region.
The video sequence is acquired by a Sony IPELA CM120 camera, and the frame size is 320 × 240. The real-time performance (about 12.9 fps) is achieved relying on the multi-thread technology. One thread handles the region correspondence stage on current frame, while the other addresses the keypoint correspondence stage on previous frame.
It should be noted that currently, there is no existing algorithm based on computer vision that can directly perform tracking task in a textureless region due to the reason that current keypoint detection is intrinsically texture-based corner detection. In order to address this problem, we have proposed the point transferring method . This effective strategy is integrated into the application. By estimating the projective matrices and transferring the keypoints, the location of the virtual teapot can be found.
In conclusion, we presented a feasible keypoint registration approach using a two-stage framework. It has been proven in our experiments that this approach not only runs relatively fast, but also has the capability of tolerating perspective distortion and light changes, and handling partial occlusions. Thus, the proposed approach is applicable to keypoint tracking and on-line image registration.
This research is supported by the National High-Tech Research and Development Plan of China (2007AA01Z423), and by the National Basic Research Project of the ‘Eleventh Five-Year-Plan’ of China (C10020060355). Portions of the research were conducted at CIPMAS AR Laboratory at National University of Singapore.
References and links
1. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis. 60(2), 91–110 ( 2004). [CrossRef]
3. H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (SURF),” Comput. Vis. Image Underst. 110(3), 346–359 ( 2008). [CrossRef]
5. M. Özuysal, P. Fua, and V. Lepetit, “Fast keypoint recognition in ten lines of code,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Minneapolis, United States, 2007), pp. 1–8.
6. J. Matas, O. Chuma, M. Urbana, and T. Pajdla, “Robust wide-baseline stereo from maximally stable extremal regions,” Image Vis. Comput. 22(10), 761–767 ( 2004). [CrossRef]
7. K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. V. Gool, “Comparison of affine region detectors,” Int. J. Comput. Vis. 65(1-2), 43–72 ( 2005). [CrossRef]
8. P. E. Forssén, “Maximally stable colour regions for recognition and matching,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Minneapolis, United States, 2007), pp. 1220–1227.
9. C. Boncelet, “Image noise models,” in Handbook of image and video processing (second edition), A. C. Bovic, ed. (Elsevier Academic Press, San Diego, United States, 2005).
10. P. E. Forssén and A. Moe, “View matching with blob features,” Image Vis. Comput. 27(1-2), 99–107 ( 2009). [CrossRef]
11. Visual Geometry Group, “Affine covariant regions datasets,” http://www.robots.ox.ac.uk/~vgg/data/.
12. T. Liu, A. W. Moore, A. Gray, and K. Yang, “An investigation of practical approximate nearest neighbor algorithms,” in Advances in Neural Information Processing Systems, L. K. Saul, Y. Weiss and L. Bottou, eds. (MIT Press, Cambridge, 2005). [PubMed]
13. T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 ( 2002). [CrossRef]
14. V. Lepetit, and P. Fua, “Towards recognizing feature points using classification trees,” in Technical Report IC/2004/74. (EPFL, 2004).
15. E. Rosten, and T. Drummond, “Fusing points and lines for high performance tracking,” in Proceedings of the IEEE International Conference on Computer Vision (Beijing, China, 2005), pp. 1508–1515.
16. E. Rosten, and T. Drummond, “Machine learning for high-speed corner detection,” in Proceedings of the European Conference on Computer Vision (Graz, Austria, 2006), pp. 430–443.
17. Z. Li, W. Gong, A. Y. C. Nee, and S. K. Ong, “The effectiveness of detector combinations,” Opt. Express 17(9), 7407–7418 ( 2009), http://www.opticsinfobase.org/oe/abstract.cfm?uri=oe-17-9-7407. [CrossRef] [PubMed]