Tracking registration of optical see-through augmented reality based on the Riemannian manifold constraint

Zhe An; Zhe An; Yang Liu

doi:10.1364/OE.477750

1. Introduction

Augmented reality (AR) equipment has been preliminarily applied in various fields in three emerging application concepts of hands-free, efficient interaction, and decision-making with the rapid development of AR in recent years. AR technology superimposes computer-generated three-dimensional (3-D) models, images, text, and other virtual information on the real scene in real-time while realizing real-world enhancement. AR can effectively improve the user’s perception of the real scene. Meanwhile, virtual-real interaction has become a new communication method between people and the environment. AR systems need to meet the following characteristics [1]:

(1) Three-dimensional (3-D) tracking registration
(2) Virtual-reality fusion display
(3) Real-time interaction

Augmented reality can provide users with additional environmental information through touch, smell, hearing, and vision. Among the senses, vision is the most direct way of obtaining information and plays an important role in AR research. Optical see-through in the display mode of AR systems projects virtual information to the human eye through optical helmets, AR glasses, and other display devices. Hence, the human eye can observe the fused scene of real environment and virtual information, and intuitively understand the scene information. This display method can ensure not only a clear viewpoint but also provide a wide field of view. It presents characteristics of simple structure, high scene resolution, and strong realism. Therefore, optical see-through in augmented reality systems can be applied to many fields, including the medical field. Surgical operations can be planned and performed effectively by presenting vital sign information or a virtual three-dimensional model during surgery. In addition, it can be applied to auxiliary maintenance of indoor or outdoor power grid equipment, automatic three-dimensional scanning detection, and aviation and automobile auxiliary driving. It can ensure driving safety by not requiring the driver’s line of sight to focus frequently between the instrument panel and the road [2–5].

The complete workflow of the optical see-through augmented-reality system is presented as follows. Determine the initial position state of the equipment and then obtain the content information in the scene, such as feature points and classified objects. Use the 3-D tracking registration technology according to the obtained information to calculate attitude changes of the equipment in different states. The obtained scene content and attitude change information are fused by analyzing the occlusion relationship between the virtual information and objects in the real scene. Although these problems have been extensively investigated, persistent issues in the optical see-through augmented-reality system, such as unstable 3-D tracking registration results, vulnerability to scene information, and susceptibility to drift when movement of the equipment accelerates, must be solved.

In this article, we focus on the theoretical analysis and key technology research of 3-D tracking registration in sparse and complex scenes and mainly solve the problem of using scene information to realize stable and real-time 3-D tracking registration. We add instance object constraints to the 3-D tracking registration. By constructing the instance constraint registration model in sparse and complex scenes, the accuracy and stability of 3-D tracking registration are improved.

The key contributions of this article are summarized as follows.

(1) The instance object deep convolutional neural network is introduced into the 3-D tracking registration process. Four-quadrant weight processing is performed on the object point cloud to reduce the error matching rate. This method can improve the accuracy and stability of 3-D tracking registration.
(2) To improve the accuracy of 3-D tracking registration, the Riemann manifold theory is introduced to eliminate the registration error when establishing the registration model. The 3-D instance point cloud information is taken into the model. We construct the covariance matrix based on Riemannian manifold. The object constraints of class k instance are concerned with sparse scenarios. The 3-D point cloud’s angle difference from the normal vector is introduced into the 3-D tracking registration model. Thus, the registration accuracy and stability in a sparse environment are improved.
(3) The gray error expression under the constraint of the instance object is established. By adding the optimal weight value, the 3-D tracking registration error is reduced. Thus, the registration problem in complex scenes is solved and the registration accuracy is improved.

The remainder of this article is as follows. Section 2 reviews the related works of 3-D tracking registration. Section 3, the 3-D tracking registration model of instance object constraint in sparse and complex scenes is obtained. Section 4 gives the implementation details of the experiment. The conclusions are drawn in Section 5.

2. Related works

The 3-D tracking registration aims to estimate motion posture of the equipment in different states according to the scene information and obtain the two-dimensional projection at a specified time. At present, the scene content can be obtained by extracting objects, such as markers, feature points, or complete objects, in the scene, to provide environmental information for 3-D tracking registration. Many 3-D tracking registration methods have been proposed according to different feature extraction methods.

Markers are usually placed and identified and then virtual objects are superimposed into the scene in the traditional 3-D tracking registration algorithm [6]. Although these methods are simple and convenient, fusing virtual and real scenes is impossible if the marker is beyond the field of view. However, an unmarked 3-D tracking registration algorithm can solve this problem [7]. It can extract feature points that exist in the scene to obtain the scene information and posture of equipment movement. Key points are usually detected using an algorithm, a description vector is extracted to a construct local feature descriptor, and feature matching is carried out. Raul et al. [8] extracted, tracked, mapped, and repositioned feature points from the image obtained by a monocular camera before performing loop detection on them. This method based on simultaneous location and map construction (SLAM) can obtain enhanced 3-D tracking registration results. University of Science and Technology of China proposed a 3-D tracking registration method based on tracking-learning-detection (TLD) [9]. TLD is adopted to realize the tracking registration of an AR system that effectively improves the registration accuracy and stability of object tracking. Although the extraction results of feature points can provide effective scene information under the condition of evident texture features, the number of feature points extracted by the algorithm is small under the condition of weak texture and the robustness of the algorithm is low in the process of 3-D tracking registration. Moreover, feature points are easily mismatched in the process of feature matching. Although redundant points can be removed, some feature information will be lost. Researchers extract all kinds of objects in the scene and segment objects with other parts by means of region extraction and edge detection to obtain useful information in the scene and improve the stability of 3-D tracking registration. However, the calculation process is complex and the algorithm needs to be designed in advance according to the situation of the scene. Such algorithms can usually only be registered in local regions [10].

Many scholars have proposed a series of object detection algorithms with the development of deep learning. These object detection algorithms use deep convolution neural networks to output the classification and detection results directly through learning without extracting features in advance. However, the output of this object detection is the regional range results of the object that cannot be used as input information to calculate the attitude. Semantic segmentation can be used to solve this problem and segment objects. Semantic segmentation outputs the pixel-level segmentation results by adding the upsampling process to the network while obtaining the classification information of the scene object. Long et al. [11] proposed a deep convolution neural network based on FCN in 2015. Semantic segmentation of the image is realized by replacing the fully connected layer with the convolution layer and then fusing the information of different scales. A series of semantic segmentation algorithms, such as U-Net [12], SegNet [13], SCN [14], and PSPNet [15], have been proposed. Nanchang University of Aeronautics and Astronautics of China proposed the DeepLab V3 + algorithm with double attention mechanism to realize the segmentation of large-scale objects [16].

Semantic segmentation can obtain scenes with object labels, provide constraint functions between different scene frames, and identify semantic object constraints at the optimization end. Mapping pixel information with semantic labels into a 3-D space can realize semantic maps with labels and are conducive to understanding the real environment and virtual-real interaction of AR systems. Sean et al. [17] calculated the object center through the probability model and reprojected it onto the image. At this time, the object center is close to the center of the detection frame, and data are determined by their weight. The attitude estimation result is obtained using the optimizing error. Dong et al. [18] used semantic mapping to represent the topological structure and spatial relationship of the scene as well as constructed a semantic map. Beijing University of Aeronautics and Astronautics proposed a SLAM algorithm of semantic optical flow that optimizes features hidden in semantic and geometric information, obtains the semantic segmentation results via SegNet, and uses it as the mask of semantic optical flow to obtain the attitude estimation results [19]. Tsinghua University runs five threads, such as tracking, semantic segmentation, local mapping, loop closure, and dense semantic mapping, in parallel to generate a dense semantic octree, which effectively improves the accuracy of pose estimation [20].

This vision-based 3-D tracking registration algorithm can optimize scene information to obtain 3-D tracking registration results. However, pure vision method can easily be disturbed by image occlusion and moving objects in practical applications. The tracking registration method based on hardware equipment [21] presents advantages of fast response and immunity to imaging quality. Inertial and magnetic sensors can track pedestrians’ gestures in real-time, with applications in healthcare and augmented reality [22]. Riccardo et al. [23] proposed an indoor ultrasonic system for mobile units (MUs) positioning with new relevant features. This system uses the light algorithm of the closed-form positioning to calculate their positions. However, some sensors demonstrate zero bias, low precision, and easy drifting. This kind of method inevitably produces an accumulative error in a long-time tracking registration process. On the other hand, it is easy to lose scene content information.

Tracking registration algorithms based on hardware devices and vision show complementary properties. Researchers have proposed a tracking registration algorithm based on hardware vision hybrid. Raul et al. [24] estimated the pose information of multiframe images using a monocular camera, provided the initial value for the nonlinear estimation system, estimated the scale and sensor deviation, and finally obtained the projection result of gravity vector in the visual coordinate system and aligned the world coordinate system with the visual coordinate system. Northeast University proposed the algorithm of fusion of direct method and inertial measurement unit as well as optimized the function using a sliding window to ensure accurate a priori information, correct the visual pose tracking results, and improve the estimation accuracy [25].

At present, some point cloud registration methods have been proposed. Ying He et al. [26] proposed a new ICP registration method, which constructs a feature descriptor combining density curvature and normal angle. The accuracy and speed of the algorithm can be effectively improved by obtaining the internal relationship between point clouds. Bo You et al. [27] proposed a point cloud registration algorithm for 3-D neighborhood point feature histogram (3DNPFH) descriptor. This method solves the problem of point cloud registration when there is a large amount of point cloud data, and can ensure the accuracy and speed of the algorithm in multiple data sets.

According to current research progress, scene texture, number of objects, structural complexity, and other factors exert a certain impact on the 3-D tracking registration processing. The stability of optical see-through augmented-reality systems is expected to improve. This study focuses on the theoretical analysis and key technology research of 3-D tracking registration in sparse and complex scenes and mainly solves the problem of using scene information to realize stable and real-time 3-D tracking registration. Providing the required feature information while understanding the scene content for estimating the state of equipment in real-time and ensuring stability and accuracy of 3-D tracking registration is difficult in augmented reality systems due to the variability and diversity of the environment. The human eye can infer the camera motion in an image by identifying the motion of different objects. Limited features and available information in sparse scenes affect the accuracy of 3-D tracking registration. Although the semantic segmentation method can obtain the pixel-level classification results of certain objects in the scene, it fails to distinguish individual instances. Therefore, the label information provided is insufficiently accurate in the 3-D tracking registration of complex scenes. In addition, pure vision 3-D tracking registration is susceptible to drift in the case of image occlusion or instantaneous rapid movement of the camera. Therefore, combining hardware equipment in the analysis of scene characteristics is necessary to establish the attitude estimation model and obtain the 3-D tracking registration results.

3. Methodology

3.1 Basic principle of the algorithm

Relevant theoretical analysis and method research are carried out and effective solutions are put forward in this study to solve the current problem of 3-D tracking registration and meet the stability, real-time, and accuracy requirements of optical see-through augmented-reality systems. The basic principle and research scheme of the algorithm as shown in Fig. 1.

Fig. 1. Basic principle framework of the algorithm.

Download Full Size | PDF

Firstly, at the theoretical level, the factors affecting the accuracy of 3-D tracking registration are analyzed. For different application scenarios, use a deep convolution neural network to obtain a 3-D instance point cloud. Then, the 3-D instance segmentation results of the scene and the attitude estimation model with instance object constraints are established among object point clouds. Influencing factors of 3-D tracking registration vary in different scenarios. The feature information of the scene directly affects the results of 3-D tracking registration. Three-dimensional tracking registration models are established for sparse feature and complex structure scenes to optimize the cumulative error of attitude estimation and meet real-time and stability requirements. The accumulated error of the attitude estimation process is optimized using the instance object segmentation results to provide effective information in the scene with sparse features. The robustness of the algorithm is improved in scenes with certain structural complexity. The problem of 3-D tracking registration in different scenes is solved.

3.2 Scene information extraction and point cloud data processing

Stable 3-D tracking registration initially requires the extraction of feature information from the scene and then the estimation of the attitude change of equipment. The 3-D tracking registration method based on feature points demonstrates limitations and can easily lose features in sparse scenes. Semantic segmentation can obtain the pixel-level segmentation results of the scene but fails to distinguish different objects of the same category and ignores constraints of instance objects in the estimation. Therefore, this project primarily aims to extract the instance object in the scene, save the cost of manual calibration, and improve the accuracy of 3-D tracking registration by obtaining accurate labels.

The case segmentation depth convolution neural network usually adds the full convolution network (FCN) process to the object detection network, classifies pixel categories while accurately predicting the edge contour position of each case, and uses a two-dimensional (2-D) image as the processing object. The 2-D coordinate information must be transformed into 3-D point cloud information to establish the attitude estimation model in the attitude estimation of 3-D tracking registration. The process of converting 2-D coordinates into three-dimensional coordinates must obtain the depth value of pixels. A binocular or TOF depth camera is generally used to obtain the depth information of scenes, and the output results of 3-D coordinates are obtained through coordinate transformation. The 3-D instance segmentation of this project is completed based on the SGPN network [28]. The front end of the SGPN network adopts PointNet or PointNet point cloud processing framework and then represents the 3-D instance segmentation results in the form of a similarity matrix. Data sets used in training the network are different according to various application scenarios. By making data sets through image annotation tools and obtaining the segmentation results of the 3-D instance object point cloud after training to meet the needs of different scenarios.

The 3-D instance object point cloud output from the network may include dynamic objects, which impact on the accuracy of 3D tracking registration. Therefore, dynamic objects are removed according to the classification results. Outliers are detected for each class of object point cloud and then removed to reduce the estimation error caused by outliers in tracking registration.

3.3. 3-D tracking registration model in sparse scene

In order to solve the problem of 3-D tracking registration in sparse scenarios, this section proposes a registration method integrating instance object constraints.

The camera attitude needs to be estimated between image frames after the 3-D tracking registration model is established to obtain scene information. The principle is shown in Fig. 2.

Fig. 2. Diagram of attitude estimation.

Download Full Size | PDF

We assume that the rotation variable of the device in the two states is R and the translation variable is T in Fig. 2. A set of point clouds is obtained at times t and t + 1 using the three-dimensional instance segmentation algorithm. Searching for the object is necessary when point clouds match. A threshold is generally set first and the Euclidean distance of two groups of point cloud gravity centers is calculated to determine whether both objects are the same object. The instance object point cloud presents the advantage of directly obtaining the category and segmentation label of the point cloud without object matching. We only search the label that the corresponds to object point cloud during matching to obtain the matching result. The object instance constraint added to the pose model retains not only additional hierarchical features but also adds structure a priori of the scene. Hence, the object instance constraint can adapt to sparse features and complex structure scenes.

IMU sensor and camera are fused to improve the stability of the 3-D tracking registration algorithm. If the IMU and camera are calibrated in advance, then determining gyro angular velocity and IMU accelerometer increments at a certain time during movement is necessary to obtain the IMU pose estimation results. Visual and IMU estimations are two independent modules. The output results of each module are fused, sensor noise does not affect each other, and the 3-D tracking registration results are robust. It can form a high frame rate positioning scheme when applied to the optical see-through augmented-reality system. The cumulative error will lead to the failure of the algorithm with the increase of time because errors exist in each estimation. Therefore, the cumulative error should be eliminated. It is quantified using far and near distributions of the segmented region of the example object to help optimize errors given that the same three-dimensional point should ensure the consistency of detection and segmentation results of the object after reprojection.

Estimating the camera motion attitude (R,T) is necessary according to the relationship between image frames. The following (R,T) calculation method is designed for the attitude estimation model of 3-D tracking registration. A 3-D tracking registration model with instance object constraints is established for sparse feature scenes. Two sets of instance object point clouds can be obtained from two images obtained by the camera. Two sets of point clouds containing the same object demonstrate similar surface curvatures. Furthermore, point cloud surface features include normal vector features. Therefore, angle difference of the normal vector of the corresponding point is also considered in addition to the Euclidean distance of the corresponding point to obtain the geometric error expression with instance object constraints when establishing the error term. If k-class objects exist in the scene, then the established error expression must include the error results of k-class objects. The object constraint model is established to improve the robustness of the 3-D tracking algorithm in sparse feature scenes. We set ${p_b}$ as points in the current image frame after obtaining two sets of instance object point clouds. Point ${p_f}$ demonstrates coordinate values similar to those of the normal vector ${n_f}$. The angle difference of the vector is ${\omega _f}$ and the 3D tracking estimation error is ${\varepsilon _d}$. The geometric error expression with instance object constraint is expressed as follows:

(1)$${\varepsilon _d} = ({{p_d} - {p_f}} )\cdot {n_f} + {\omega _f}. $$

In the process of instance segmentation, the accuracy of segmentation cannot reach 100%, and the result of segmentation is uncertain. The lower accuracy of pixel classification, the higher uncertainty of subsequent tracking registration. Riemannian manifold is an important concept in Riemannian geometry. Manifold topological space is homeomorphic with Euclidean space locally, which is a special differential manifold. Riemannian manifold can establish the measurement of the internal relationship between feature point clouds in manifold space from the geometric point of view of the object point cloud. Based on the Riemannian manifold theory, this paper establishes the feature covariance matrix of the Riemannian manifold, combines the uncertainty of instance segmentation, obtains the manifold relationship between objects, analyzes the instance region of dynamic objects, and selects features. The modeling process of the features covariance matrix is shown in Fig. 3.

Fig. 3. Construction of features covariance matrix.

Download Full Size | PDF

At this time, the feature has depth information, so the 3-D feature covariance matrix H can be constructed. For two adjacent key frames, there is a variation $\Delta H$ in the feature covariance matrix between the same object. Aiming at the uncertainty of example segmentation, the uncertainty factor $\mu $ is introduced to build the following example object pixel feature selection model:

(2)$$P({X_i}) = {\mu _i} \cdot \Delta {H_i}$$

where $P({X_i})$ is the confidence level of the feature selection result, which determines whether the feature is selected.

If k-class instance objects exist in the scene, then the established error expression shall include the following error results of k-class instance objects:

(3)$$E = \arg \min \sum\limits_{i = 1}^k {{{||{P({X_i}) \cdot {\varepsilon_{{d_k}}}} ||}^2}} $$

An instance object constraint model is established to improve the robustness of the 3-D tracking algorithm in sparse feature scenes, where ${\varepsilon _{{d_k}}}$ is the geometric error, and the tracking registration parameters are obtained. Gauss-Newton iterative method can be used to solve the equation. By fusion with the attitude estimation results obtained by IMU, the 3-D tracking registration estimation results in sparse scenes is obtained.

3.4 3-D tracking registration model in complex scene

We solve the problem of 3-D tracking registration in complex scenes by the following methods.

The gray invariant assumption posits that the gray value of a pixel in the same space should theoretically be same in each image for complex structure scenes. The gray invariance assumption is a strong assumption. Discrimination is absent in a single pixel. The instance object constraint is added when establishing the model due to varying gray distributions between different objects to ensure the accuracy of the algorithm. Pixel blocks are distinguished using the segmentation results of objects, and gray error expression under instance object constraint is established. Let each obtained instance object be j(j = 1,2,…, j) with corresponding gray ${\varepsilon _g}$ and geometric ${\varepsilon _d}$ residuals. The gray residual size is determined by the surface texture of the instance object, and the geometric residual size is determined by the geometry of the instance object point cloud. The optimized weight value of the model is determined according to texture and variance of the geometric surface. The gray optimization weight parameter of the example object is set to ${\omega _g}$ and the geometric optimization weight parameter is set to ${\omega _d}$. The objective function to be optimized is expressed as follows:

(4)$$E = \arg \min \sum\limits_{i = 1}^j {{{||{P({X_i}) \cdot ({\omega_g}{\varepsilon_g} + {\omega_d}{\varepsilon_d})} ||}^2}} $$

${\varepsilon _g}$ and ${\varepsilon _d}$ contain variables R and T. Loose coupling method is adopted in the fusion with IMU to improve the accuracy of 3-D tracking registration in complex scenes by optimizing the cumulative error of the attitude estimation process. Similarly, Gauss-Newton iterative method is used to solve the above model.

4. Algorithm experiment

The optical see-through augmented-reality system usually includes an optical display system, tracking camera, IMU, and computer under laboratory conditions. The TOF depth camera is used in this experiment to obtain the virtual information display position of the optical display system and determine the initial attitude. The attitude change of the entire system is calculated from the information obtained by the camera and IMU.

4.1 Scene information extraction

Scene information is first extracted and the SGPN network [28] is used to train sparse and complex scenes in the experiment. Outliers in the point cloud are removed. The 3-D point cloud segmentation results are then verified on the data set. One frame of the 3-D instance point cloud is extracted via the SGPN network, and the 3-D instance point cloud of the next key frame is extracted using the same method. The classification results of the two groups of point clouds are substituted into the 3-D tracking registration model to compare the results with calibration values in the data set, and calculated algorithm accuracy while recovering the 3-D scene with instance information [29–31]. Figure 4 shows the results of the initial point cloud and the segmented point cloud. The experiment shows that the point cloud can be effectively segmented under several data sets, which can provide the required point cloud data information for subsequent 3-D tracking registration.

Fig. 4. Point cloud segmentation results.

Download Full Size | PDF

4.2 Determination of object weight

To analyze the influence of weight on pixel mismatch rate, and determine the value of weight. We select a series of images in the scene sequence, and calculate the mismatch rate of non-added weight and added weight respectively. When the determination weight is significant, it is necessary to determine the gray scale weight and geometric weight. When determining the gray weight, the gray image collected by the camera is used. In tracking registration, the large gray error is concentrated on the edge of the object, that is, the part where the gray gradient changes significantly. Therefore, the pixel whose object edge coincides with the obvious gradient change is replaced by the pixel whose gradient change is obvious, and a smaller weight value is given. When determining geometric weight, and geometric shape, rigid and flexible objects are divided into four-quadrants, and classified objects in the scene are recorded in four-quadrants, as shown in Fig. 5. In the scene, it is usually divided into complex objects and non-complex objects, as well as rigid objects and flexible objects. Due to the complex shape of complex objects, the error generated in case segmentation is larger than that of non-complex objects. Rigid objects are not easy to deform due to their properties, so the segmentation accuracy of this algorithm is higher than that of flexible objects. In the experiment, the influence of weight on the result is observed by measuring the pixel mismatch rate of inter-frame images. Take ten frames of images as a group, select two groups of data, calculate the average error matching rate respectively, and compare the pixel error matching rate before adding weight with that after adding weight.

Fig. 5. Weight determination based on four-quadrants.

Download Full Size | PDF

Two scenarios are selected for the experiment, as shown in Fig. 5(a) is one scenario, and 5(b) is another scenario. In the figure, the ordinate is the average mismatch rate of ten frames of images, and the abscissa is the trustworthiness of point cloud pixel segmentation. The coordinate area is divided into four areas: complex rigid body, uncomplicated rigid body, complex soft body, and uncomplicated soft body. The pedestrians, cars, buildings, trees, and other objects are classified and calculated to obtain the error matching rate and reliability. It can be seen that the mismatch rate is significantly reduced, and correspondingly, the degree of reliability is increased.

Figure 6 shows the point cloud processing results after adding weights. It can be seen that the pixels on the edge of the point cloud are well segmented.

Fig. 6. Point cloud after weight processing.

Download Full Size | PDF

4.3 Registration error comparison experiment

In order to test the accuracy of the tracking registration algorithm, we conducted experiments on the registration error on the standard dataset KITTI [29], ApolloScape [30], ONCE [31]. The experimental results are as follows. We selected 50 frames of data for the experiment. It can be seen from the figure that the algorithm can still keep the error after adding Riemann constraint at about 2 pixels on the above dataset, indicating that the algorithm can effectively meet the requirements of the system.The results are shown in Fig. 7.

Fig. 7. Interframe error in different datasets

Download Full Size | PDF

In order to compare the tracking registration accuracy in sparse and complex scenarios, we took a set of image data. According to the different application scenarios, the network may be fails to identify objects of any application scenario. The method is used to evaluate the three-dimensional tracking registration accuracy of the system. The interframe error of the x-axis in sparse and complex scenes is calculated (Fig. 8,9). In the experiment, four groups of data are selected, each group of images has 50 frames, and the error before and after adding the Riemann constraint is calculated respectively. Figure 8 and Fig. 9 show the error comparison results under the two scenarios. It can be seen from the figure that in complex scenes, the error after adding the Riemann constraint can be kept at about 2 pixels. In sparse scenes, the error after adding the Riemann constraint can be kept below 3 pixels. The experiment shows that it can meet the requirements of an optical see-through head-up display system.

Fig. 8. Interframe error in complex scene.

Download Full Size | PDF

Fig. 9. Interframe error in sprase scene.

Download Full Size | PDF

4.4 Real value error comparison experiment

The correlation distance is used to measure the similarity between registered and real value data further as well as evaluate the error between real and registered values. Real and registered values are $t = ({t_1},{t_1},\ldots ,{t_{600}})$ and ${r_k} = ({r_1},{r_1},\ldots ,{r_{600}})$, respectively, where k = 1, 2, 3, 4, 5, 6. The correlation distance formula between t and r_k is expressed as follows:

(5)$$\begin{array}{l} {D_{t{r_k}}} = 1 - {\rho _{t{r_k}}}\\ = 1 - \left|{\frac{{\sigma_{t{r_k}}^2}}{{{\sigma_t} \cdot {\sigma_{{r_k}}}}}} \right|\\ = 1 - \left|{\frac{{\sum {(t - \bar{t})({r_k} - {{\bar{r}}_k})} }}{{\sqrt {\sum {{{(t - \bar{t})}^2}} } \sqrt {\sum {{{({r_k} - {{\bar{r}}_k})}^2}} } }}} \right|\end{array}, $$

where $\overline t$ and $\overline {{r_k}}$ are mean values of t and ${r_k}$, respectively; $\sigma _{t{r_k}}^2$ is used to represent the covariance of t and ${r_k}$; and standard deviations of t and ${r_k}$ are expressed as ${\sigma _t}$ and ${\sigma _{{r_k}}}$, respectively. Errors between various methods and real values are obtained by comparing this algorithm with others. Several methods present some errors compared with the real value. The difference is caused by the estimation error of the algorithm itself and jitter error in the motion process. Calculating truth error in different datasets, correlation distances of several algorithms after calculation are presented as follows:

\begin{array}{l} {D_{t{r_1}}} = 1 - 0.8871 = 0.1129 > 0.1\\ {D_{t{r_2}}} = 1 - 0.9057 = 0.0943 < 0.1\\ {D_{t{r_3}}} = 1 - 0.9089 = 0.0911 < 0.1\\ {D_{t{r_4}}} = 1 - 0.9217 = 0.0783 < 0.1\\ {D_{t{r_5}}} = 1 - 0.9456 = 0.0544 < 0.1 \end{array},

where ${D_{t{r_1}}}$ is the feature point method [8], ${D_{t{r_2}}}$ is the homography matrix-based method [32], ${D_{t{r_3}}}$ is the point cloud matching method [33], ${D_{t{r_4}}}$ is the sparse scene algorithm in this study, and ${D_{t{r_5}}}$ is the complex scene algorithm. The two groups of data representing real and registered values are highly similar when the correlation distance value is within 0.1. A high correlation distance value indicates low similarity between the two data groups. The comparison of several algorithms with real data showed that algorithms are similar to the true value of data. The registration error within an allowable range meets the requirements of registration accuracy.

In the experiment, the error of several methods was calculated ten times. Figure 10 shows the value error comparison results in different datasets. ONCE dataset has fewer features than ApolloScape and KITTI dataset, so a sparse registration method is used for comparison. ApolloScape and KITTI are compared with complex scenario registration method. The accuracy of 3-D tracking registration of the feature point method is lower than that of other methods. The accuracy of the feature point cloud matching algorithm is easily affected by specific scene conditions. The proposed registration method demonstrates optimal stability because sparse and complex scene methods in this study retain the majority of pixels in the scene with minimal information loss. This method presents satisfactory stability because the geometric information is considered and combines useful information in the scene when establishing the estimation model of the transformation matrix.

Fig. 10. Real value error comparison result.

Download Full Size | PDF

4.5 Algorithm speed comparison experiment

The proposed algorithm is compared with other similar methods and the virtual information is fused with the real scene using the matrix obtained via the three-dimensional tracking registration algorithm to evaluate the running speed of several algorithms. The comparison of average time consumed by several methods after the scene and virtual information fusion using 600 images is presented in Table 1.

Table 1. Comparison of average time of virtual reality fusion (ms)

View Table

The virtual-real fusion speed of the method based on feature point cloud matching is fast because feature-matching steps are unnecessary. The corresponding relationship between image frames is directly established, and the transformation matrix is estimated. Therefore, compared with the feature point method and the method based on homography matrix, the approach based on feature point cloud matching uses feature points in the scene with a faster algorithm. The matching method based on feature point cloud is similar to the two algorithms in this study. The time-consuming feature-matching step is absent, but the scene information used in establishing the model is different. The algorithm time extends with the increasing amount of information considered. As shown in Table 1, the above methods meet the requirements of measurement accuracy.

Compared with other methods, the proposed algorithm can ensure not only the calculation accuracy but also the average processing of virtual-real fusion while demonstrating stability in long-term 3-D tracking registration. The average processing speed of the proposed algorithm for each frame of image is a maximum of 22 FPS, which meets the real-time requirements of the algorithm.

4.6 Performance summary

According to the experiments in this section, the following innovative results are obtained:

(1) The four-quadrant weight method is used to process the weights of rigid objects and soft objects. It can effectively reduce the mismatch rate.
(2) Riemann constraint is introduced into the tracking registration calculation process. In the case segmentation process, the segmentation accuracy cannot reach 100%. The lower accuracy of pixel classification, the higher the uncertainty. Riemannian manifold can be used to measure the intrinsic relationship between feature point clouds in manifold space from the geometric point of view of the target point cloud. By establishing the characteristic covariance matrix, the manifold relationship between objects can be obtained, thus improving the 3-D tracking registration accuracy of objects. The experiment shows that adding Riemannian manifold constraint can effectively reduce the error of 3-D tracking registration.

5. Conclusion

The 3-D tracking registration of instance object constraints is a key step to realizing stable virtual-reality fusion. The instability and low accuracy of the 3-D tracking registration algorithm will directly lead to the failure of virtual-reality fusion. Stable and real-time 3-D tracking registration is affected by scene information. The 3-D instance object segmentation and extraction are conducive to understanding the scene further, retaining additional content information, and adding multilevel constraint information to the 3-D tracking registration model and subsequent error optimization. The data set of 3-D instance segmentation and the setting of network parameters exert a certain impact on the output result. The error balance and estimation method in the 3-D tracking registration model is a new augmented reality approach from the perspective of technological innovation that requires further investigation. This method has a wide range of applications. For example, it can significantly improve not only the safety of power operation but also provide new scientific and technological support for power intelligent industry technology and other related industries as well as high-level and high-quality services for power operators. Moreover, it also challenges the traditional power operation mode and further promotes the development of smart grid and other emerging augmented reality applications.

Funding

Natural Science Foundation of Jilin Province (YDZJ202201ZYTS428).

Disclosures

The authors declareno conflicts of interest.

Data Availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. R. Azuma, “A survey of augmented reality,” Teleoperators and virtual environments. 6(4), 355–385 (1997). [CrossRef]

2. A. Liccardo, P. Arpaia, F. Bonavolonta, E. Caputo, and R. Moriello, “An Augmented Reality Approach to Remote Controlling Measurement Instruments for Educational Purposes During Pandemic Restrictions,” IEEE Trans. Instrum. Meas. 70(1), 1–20 (2021). [CrossRef]

3. P. Arpaia, L. Duraccio, N. Moccaldi, and S. Rossi, “Wearable Brain–Computer Interface Instrumentation for Robot-Based Rehabilitation by Augmented Reality,” IEEE Trans. Instrum. Meas. 69(9), 6362–6371 (2020). [CrossRef]

4. G. Pasquale, S. Graziani, A. Pollicino, and C. Trigona, “Performance Characterization of a Biodegradable Deformation Sensor Based on Bacterial Cellulose,” IEEE Trans. Instrum. Meas. 69(5), 2561–2569 (2020). [CrossRef]

5. S. Murano, C. Pérez-Rubio, D. Gualda, F. J. Álvarez, T. Aguilera, and C. D. Marziani, “Evaluation of Zadoff–Chu, Kasami, and Chirp-Based Encoding Schemes for Acoustic Local Positioning Systems,” IEEE Trans. Instrum. Meas. 69(8), 5356–5368 (2020). [CrossRef]

6. D. Khan, S. Ullah, D. Yanming, I. Rabbi, P. Richard, T. Hoang, M. Billinghurst, and X. Zhang, “Robust Tracking Through the Design of High Quality Fiducial Markers: An Optimization Tool for ARToolKi,” IEEE Access. 6(99), 22421–22433 (2018). [CrossRef]

7. T. Hayashi, H. Uchiyama, J. Pilet, and H. Saito, “An Augmented Reality Setup with an Omnidirectional Camera Based on Multiple Object Detectio,” International Conference on Pattern Recognition.(2010).

8. R. Mur-Artal, J. Montiel, and J. Tardos, “ORB-SLAM: a versatile and accurate monocular SLAM system,” IEEE Trans. Robot. 31(5), 1147–1163 (2015). [CrossRef]

9. Y. Li and D. Yin, “AR tracking and registration method based-on TLD algorithm,” Journal of system simulation. 26(9), 2062–2067 (2014). [CrossRef]

10. Z. Zhang, Z. Min, A. Zhang, J. Wang, S Song, and Q. Meng, “Reliable hybrid mixture model for generalized point set registration,” IEEE Trans. Instrum. Meas. 70, 1–10 (2021). [CrossRef]

11. E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic Segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2017). [CrossRef]

12. R. Olaf, F. Philipp, and B. Thomas, “U-Net: convolutional networks for biomedical image segmentation,” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer International Publishing, (2015).

13. V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep convolutional encoder-decoder architecture for scene segmentation,” IEEE Transactions on Pattern Analysis & Machine Intelligence. 6(99), 1 (2017). [CrossRef]

14. D. Lin, R. Zhang, Y. Ji, P, Li, and H. Huang.“SCN: switchable context network for semantic segmentation of RGB-D images,”,IEEE transactions on cybernetics,(2020).

15. H. Zhao, J. Shi, X. Qi, and X. Qi, “Pyramid scene parsing closest point,” Proceedings of the Eleventh Eurographics/ACMSIGGRAPH Symposium on Geometry Processing. Eurographics Association,(2013).

16. W. Liu, Y. Shu, X. Tang, and J. Liu, “Remote sensing image segmentation using dual attention mechanism Deeplabv3+ algorithm,,” Tropical Geography. 3(23), 1–17 (2020). [CrossRef]

17. S. Bowman, N. Atanasov, K. Daniilidis, and G. Pappas, “Probabilistic data association for semantic SLAM,” 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE,(2017). [CrossRef]

18. W. Dong, Y. Chu, and S. Hong, “Semantic mapping and navigation with visual planar landmarks,” International Conference on Ubiquitous Robots and Ambient Intelligence(URAI).Daejeon,(2012).

19. L. Cui and C. Ma, “SOF-SLAM: A semantic visual SLAM for dynamic environments,” IEEE Access. 7(99), 166528 (2019). [CrossRef]

20. C. Yu, Z. Liu, X. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “DS-SLAM: A semantic visual SLAM towards dynamic environments,” 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).IEEE, (2018).

21. J. Liu, “Overview of simultaneous positioning and map creation of ranging sensors,” Journal of intelligent systems. 10(5), 655–662 (2015).

22. Z. Zhang and X. Meng, “Use of an inertial/magnetic sensor module for pedestrian tracking during normal walking,” IEEE Trans. Instrum. Meas. 64(3), 776–783 (2015). [CrossRef]

23. R. Carotenuto, M. Merenda, D. Iero, and F. G. Della Corte, “An indoor ultrasonic system for autonomous 3-D positioning,” IEEE Trans. Instrum. Meas. 68(7), 2507–2518 (2019). [CrossRef]

24. R. Mur-Artal and J. Tardos, “Visual-inertial monocular SLAM with map reuse,” IEEE Robot. Autom. Lett. 2(2), 796–803 (2017). [CrossRef]

25. Y. Liu, Y. Zhou, L. Rong, H. Jiang, and Y. Deng, “Visual odometry based on the direct method and the inertial measurement unit,” Robot. 41(5), 683–689 (2019). [CrossRef]

26. Y. He, J. Yang, X. Hou, S. Pang, and J. Chen, “ICP registration with DCA descriptor for 3D point clouds,” Opt. Express 29(13), 20423–20439 (2021). [CrossRef]

27. B. You, H. Chen, and J. Li, “Fast point cloud registration algorithm based on 3DNPFH descriptor,” Photonics 9(6), 414 (2022). [CrossRef]

28. W. Wang, R. Yu, Q. Huang, and U. Neumann “SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance Segmentation,” (2017).

29. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI Dataset,” IJRR, (2013). https://www.cvlibs.net/datasets/kitti/raw_data.php.

30. X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “The ApolloScape open dataset for autonomous driving and its application,”, IEEE transactions on pattern analysis and machine intelligence (2019). https://apolloscape.auto/index.html

31. J. Mao, M. Niu, C. Jiang, H. Liang, X. Liang, Y. Li, C. Ye, W. Zhang, Z. Li, and J. Yu, “One million scenes for autonomous driving: ONCE Dataset,”Gitub(2021). https://once-for-auto-driving.github.io/index.html

32. T. Fei, X. Liang, Z. He, T. Fei, and G. Hua, “A registration method based on nature feature with KLT tracking algorithm for wearable computers,”/Proceedings of 2009 International Conference on Cyberworlds, (2009).

33. Z. An, X. Xu, J. Yang, Y. Liu, and Y. Yan, “A real-time three-dimensional tracking and registration method in the AR-HUD system,” IEEE Access 6(1), 43749–43757 (2018). [CrossRef]

Tracking registration of optical see-through augmented reality based on the Riemannian manifold constraint

Abstract

1. Introduction

2. Related works

3. Methodology

3.1 Basic principle of the algorithm

3.2 Scene information extraction and point cloud data processing

3.3. 3-D tracking registration model in sparse scene

3.4 3-D tracking registration model in complex scene

4. Algorithm experiment

4.1 Scene information extraction

4.2 Determination of object weight

4.3 Registration error comparison experiment

4.4 Real value error comparison experiment

4.5 Algorithm speed comparison experiment

4.6 Performance summary

5. Conclusion

Funding

Disclosures

Data Availability

References

Data Availability

Cited By

Figures (10)

Tables (1)

Equations (6)

Optics Express