Tightly-coupled fusion of iGPS measurements in optimization-based visual SLAM

Ze Yang; Yanyan Li; Jiarui Lin; Yanbiao Sun; Jigui Zhu

doi:10.1364/OE.481848

1. Introduction

Advanced intelligent manufacturing and large structures assembly pressingly need an accurate and robust 6-DoF pose estimation of components and mobile industrial robots for machining, assembly, and quality inspection. Although automation industrial robots have been widely used for welding and pick-and-place operations in the automotive industry [1], these repetitive tasks are accomplished with specific mechanical structures and predesigned motion trajectories. The traditional teaching method is confronted with difficulties, such as significant production cycle time for non-repetitive tasks. Unfortunately, industrial robots also do not have the ability to automatic perceiving, plan, and position with high absolute precision for mass customization and small-scale production [2].

With the support of various indoor positioning methods, external measurement and tracking services, such as wireless, ultrasonic, laser, and vision, have been introduced to autonomous mobile robots and make automated manufacturing less intensive and worker-friendly. Bluetooth low energy (BLE) and Wi-Fi usually determine location via rather unreliable received signal strength indicators, which grants positioning accuracy into the meter level. Therefore, they are applied in scenarios that do not need extreme precision, such as warehouses. UWB, the wireless ranging device based on the measurement of round-trip times between emitters and receivers, has been reported to be applied to help Unmanned Aerial Vehicles (UAV) implement indoor navigation and achieves a 95$\%$ probability positioning accuracy of 20cm in an area of 7m by 7m by 3m with 8 UWB nodes [3].

Ultrasonic devices emit signals from beacons with known positions to the receiver and range by using the times-of-arrival method. The advantage of this strategy is low-cost and simple in structure and computation, which also could obtain an approximate accuracy of a centimeter level [4–6]. However, the measurement accuracy of these approaches is limited and unable to fulfill the requirements of millimeter-level absolute positioning in the industry environments [7]. Recently, rotational or solid-state LiDAR can rapidly measure target range beyond 100 meters by recording the time of flight (ToF) of a great deal of laser pulse emitted, generates a rich point cloud map of the environment, and localizes via matching the current map with a known one. It is widely used for automatic driving with the help of a High-definition map (HD map) [8]. Nevertheless, there are some challenges in indoor positioning, such as limited semantic information and measurement accuracy. Vision-based localization methods are usually used for 6DoF pose estimation and reconstructing 3D scenes based on the rich features of images. Among them, motion capture is one of the representative exteroceptive types of sensors and captures object poses using fixed cameras, and capable of achieving sub-millimeter accuracy at hundreds of frame rates, but the measurement range is limited and determined by the number of cameras.

Among the proprioceptive vision localization technology, visual odometry (VO) and visual SLAM (vSLAM) have attracted wide research interest thanks to the advantages of low-cost, small size, and easy integration with many mobile robotic platforms, especially for the monocular camera. Since the camera can record the rich features and color information of a scene in the image and has the excellent perceptual ability, the estimated pose and generated map can be used for real-time control and navigation of the robot. But there are also inherent disadvantages for the monocular system, such as scale ambiguity, scale drift, and instability in reflective or featureless environments [9]. In addition, monocular VO/vSLAM approaches accumulate errors in position and altitude over time due to sensor noise and modeling errors.

To tackle these difficulties in monocular vSLAM, a viable solution is to incorporate at least one sensor, such as IMU and GNSS. Although combining the camera and IMU can recover metric scale effectively and is more robust to transient visual tracking loss, the cumulative drift cannot be solved still. GNSS is able to provide global positioning information in the world coordinate frame with time-independent accuracy and help eliminate cumulative error and correct scale drift. Nevertheless, the global positioning information cannot be used for indoor environments due to the multipath effect and signal obstruction.

In this work, we consider the indoor high-precision localization and mapping problem by introducing the iGPS measurement device, wMPS [10], to provide global pose constraints for the monocular SLAM and jointly solve a bundle adjustment problem to compute millimeter-level absolute camera pose, real-scale metric map and the time offset between sensors. The scalability of the iGPS device enables observation of the same target from multi-station transmitters, avoiding tracking loss due to blockage of line-of-sight and enhancing the robustness of the localization. On the other hand, due to the complementary properties, the visual information can make up for the lack of perception ability of iGPS devices in mobile robot navigation. The main contribution of this work is the following:

1. A tightly-coupled sensor fusion method combining monocular camera and iGPS sensor to achieve high-accuracy global localization and mapping simultaneously.
2. A detailed demonstration of spatial and online temporal offset calibration methods between sensors to ensure the accuracy of the proposed system.
3. An iGPS-aided monocular initialization method and a PnP algorithm with pose constraints to promote the results of global optimization and real-time efficiency.

The rest of this paper is structured as follows: Firstly, Section 2 reviews the works related to the topic. In Section 3, an overview of the system is presented. Next, we discuss the spatial-temporal calibration method and joint optimization in Section 4. The simulated and real-world experimental results are shown in Section 5. Finally, we conclude the performance of the proposed method and possible future research directions in Section 6.

2. Related works

2.1 VO/vSLAM

In recent years, VO/vSLAM has attracted a great deal of attention since they are easy to estimate camera poses and map of the environment simultaneously. Many vSLAM solutions are based on perspective camera models, such as ORB-SLAM [11]. On the basis of monocular method, many multi-camera systems are proposed to expand the field of view and estimate the metric scale. In order to further expand the field of view, omnidirectional camera is introduced to improve the system performance, such as fisheye/catadioptric cameras [12]. Wang et al. propose a LF-VIO framework for large field-of-view cameras with negative plane to perform pose estimation [13]. For different type of camera, rolling shutter (RS) cameras are low-cost but with high motion distortion, Lang et al. introduce the Ctrl-VIO [14] for visual-inertial estimation and online RS camera exposure time calibration in RS cameras. To overcome the drawbacks of vision sensor, multi-modal sensor fusion methods are used to improve the perception ability and positioning accuracy. The sensors used for localization mainly include IMU and GNSS as mentioned above. In the same way, our study is mainly based on the visual SLAM framework. But due to the inherent disadvantages of visual SLAM, we propose a new sensor fusion method for accurate localization, metric scale estimation and avoiding accumulated drift by introducing iGPS measurement. Since the measurement principle of iGPS is different from that of other traditional sensors in the SLAM community, it can provide external pose measurement information when used for sensor fusion, which is conducive to improving the performance of the system.

2.2 Sensor fusion method for VO/vSLAM

Sensor fusion has been proven to be more effective and robust for state estimation [15–17] and can be divided into the loose coupling and tight coupling. The former separately processes all the measurements, and then all the results are fused, reducing the computational cost, so it is suitable for systems with limited resources. Different from the loosely coupled sensor fusion approach that lost measurement information, MSCKF and MSCKF2.0 present a tightly-coupled extended Kalman filter (EKF) visual-inertial odometry (VIO) estimator to fuse visual and inertial measurements to perform online estimation of system states and the camera-to-IMU calibration parameters [18,19]. Since then, the tightly-coupled VIO and VI-SLAM have been further improved with exciting results [20]. For example, based on many excellent VI-SLAM algorithms, Campos Carlos et al. proposed ORB-SLAM3, which optimized features such as feature extraction, multi-map, and loop closure detection, obtaining indoor positioning accuracy of centimeter level [21]. However, the VIO can only ensure relative positioning accuracy in a short time, and the accumulated drift seriously affects the long-term operation of the robot in large-scale environments. Even the VI-SLAM algorithm with loop closure detection also has to face this challenge. To solve this problem, many autonomous driving approaches fuse 6 DoF VIO poses estimated from a tightly-coupled visual-inertial system and 3 DoF global positions obtained by GNSS to determine the global 6 DoF poses [22,23]. Although these loosely-coupled multi-sensor fusion methods can directly modify the VIO poses to improve the positioning accuracy, the measurement information is not complementary enough due to the independent estimation based on their own measurements, resulting in a limited effect. Recent research has shown that the optimization-based tightly-coupled approaches fusing visual-inertial data with GNSS raw measurements get better positioning accuracy than only using VIO, GNSS, or RTK in the outdoors [24,25].

2.3 Spatial and temporal calibration

Since the calibration results affect the localization accuracy and normal operation of the system, many multi-sensor and multi-model fusion approaches usually assume that the sensors have been accurately calibrated (including intrinsic, extrinsic, and time offset parameters) [26–29]. With the aim at meeting the requirements of precise sensor fusion, many excellent calibration methods have been proposed. For example, Paul Furgale et. al [30] provide the open-source Camera /IMU Kalibr toolbox to estimate the spatial and temporal alignment of the camera with respect to the IMU based on a continuous-time batch estimation algorithm, which is widely adopted by VIO and VI-SLAM system. Joern Rehder et. al [31] expand the Kalibr algorithm, and further derive a method for spatially calibrating multiple IMUs in a single estimator. Qiu Kejie et. al [32] propose an IMU-centric scheme for heterogeneous multi-sensor fusion by using a high-frequency IMU as the calibration reference. All the above-mentioned methods are based on the offline mode. Although these methods separate calibration and operation and significantly reduce the complexity of the SLAM algorithm, they have to operate calibration with a fixed planar pattern which increases the complexity of system operation, and it is inferior to online calibration methods in terms of long-term operation stability. For online systems, Kelly Jonathan et. al [33] present an online localization and mapping algorithm based on an unscented Kalman filter that accurately calibrates sensor-to-sensor transformations without using a calibration target. Mingyang Li et. al [34] used camera poses with compensation of time deviation based on the EKF algorithm to update feature observations and estimated the time offset between visual and inertial measurements, which is also adopted by [35]. Qin Tong et. al [36] modeled the speed of the extracted features in each image and calibrated the temporal offset by jointly optimizing the temporal offset, camera, IMU state, and feature position, which makes online time calibration simpler.

3. Overview

3.1 iGPS setup

Among the proposed visual-iGPS system, the iGPS setup based on the optical measurement principle is not only a static metrology instrument used to detect the 3D coordinates of receivers with its portability, noncontact, sub-millimeter-level measurement precision, and large measurement range [37], but also capable for real-time millimeter-level tracking applications to record industrial robot movements.

The iGPS tracking system includes more than two transmitters at the transmitting end and a hand-held probe at the receiving end, as shown in Fig. 1. One of the transmitters is usually defined as the origin of the iGPS tracking coordinate frame, and the extrinsic parameters of other transmitters are calibrated in this unified tracking frame. When the system works properly, each transmitter emits laser lights through a pulsed lasers and two-line laser modules to omnidirectional space, and the 6 receivers on the probe receive these laser signals. Once these receivers capture laser light emitted by more than two transmitters and they can calculate their current coordinates w.r.t the iGPS tracking frame through the built-in processor of the probe. The reader is referred to [38] for a more detailed measurement principle.

Fig. 1. The iGPS system.(a) A transmitter. (b) The iGPS portable hand-held measurement probe with 6 receivers and processor.

Download Full Size | PDF

Online pose estimation of the robot arm in industrial production using the iGPS tracking system is able to keep track of an arbitrary number of probes and receivers in a wide range of environments and updates measurements at a rate of 30Hz with an average latency of 50ms [39]. However, in practice, the dynamic tracking performance of the iGPS is not satisfactory due to the synchronization errors between multiple transmitters, and the dynamic error increases with the increase of motion speed [40]. Therefore the performance of dynamic tracking is greatly inferior to that of static measurement.

3.2 System workflow

The workflow of the proposed monocular visual-iGPS tightly-coupled estimator is shown in Fig. 2. At first, the system preprocesses the camera image and iGPS measurement data, respectively. ORB features in the image are extracted for subsequent image matching. The coordinates of the six receivers of the probe are used to calculate the probe pose. It is noted that the signal of the receiver is instability due to fast speed and occlusion, and thus the current state needs to be strictly checked. Next, the iGPS-aided visual SFM is introduced to recover the metric scale and determine the transformation matrix between the local camera pose and the unified global one, which is discussed in Section 4.3. Then, the estimation of current camera poses and feature velocity is conducted in the tracking procedure. In the joint optimization process, the visual-iGPS bundle adjustment tries to find optimal state estimation by minimizing the pose error and reprojection error with the time offset. Finally, global BA and map updates for loops are performed to further eliminate accumulated drift and errors in the map.

Fig. 2. The pipeline of the proposed estimator. The system consists of data preprocessing, tracking, joint optimization and loop closure, which tightly fuses feature observations and iGPS measurements to estimate system states and time offset.

Download Full Size | PDF

3.3 Notations

The symbol $\left ( \cdot \right ) ^w$, $\left ( \cdot \right ) ^c$ and $\left ( \cdot \right ) ^i$ represent a quantity in the world coordinate frame $\left \{ W \right \}$, camera coordinate frame $\left \{ C \right \}$ at initial time instant, and probe coordinate frame $\left \{ I \right \}$ at initial time instant, respectively. The notation $\boldsymbol {T}_{c}^{w}\in SE\left ( 3 \right )$ represents the transformation matrix of frame $\left \{ C \right \}$ with respect to frame $\left \{ W \right \}$ and includes a rotation $\boldsymbol {R}_{c}^{w}\in SO\left ( 3 \right )$ and a translation $\boldsymbol {t}_{c}^{w}\in \mathbb {R}^3$. The rotation matrix can also be defined in the form of quaternions vectors $\boldsymbol {q}_{c}^{w}=\left [ \begin {matrix} q_w & q_x & q_y & q_z\\ \end {matrix} \right ] ^{\text {T}}$. $\otimes$ defines the multiplication operation between quaternions. $\boldsymbol {T}_{c_k}^{w}$ and $\boldsymbol {T}_{i_k}^{w}$ are the moving camera pose and moving iGPS probe pose w.r.t the world coordinate frame at time $k$. The 3D points $\boldsymbol {p}_{j}^{w}$, is $j$th point in $\left \{ W \right \}$. The estimated variables are defined by the superscripts hat $\left ( ^{\wedge } \right )$.

4. Methodology

4.1 Spatial alignment

In this section, we derive a multi-view alignment method for spatially calibrating the extrinsic parameter matrix between the camera and iGPS probe, as shown in Fig. 3. The camera is mounted at the probe, and the intrinsic parameters are calibrated. All extrinsic parameters of iGPS transmitters are calibrated, and one transmitter is the origin of the world frame and iGPS tracking frame.

Fig. 3. Calibration procedure based on the multi-view alignment method.

Download Full Size | PDF

The calibration process requires the use of markers having known coordinates $\boldsymbol {p}_{j}^{w}$, $j\in 1,2,\ldots,N$, which can be detected by both the camera and the iGPS tracking system. One of these points was used to as the origin to create an object coordinate frame $\left \{ O \right \}$, and $\boldsymbol {p}_{j}^{o}$ represents the $j$th point in this frame. These markers are identified for many camera poses, and their pixel coordinates are extracted from each image. Then we solve each camera pose $\boldsymbol {T}_{c_k}^{w}$ by adopting the PnP (Perspective-n-Points) algorithm [41]:

(1)$$\boldsymbol{T}_{c_k}^{w*}=\underset{\boldsymbol{T}_{c_k}^{w}}{arg\min \frac{1}{2}}\sum_{i=1}^N{\lVert \boldsymbol{z}_{j}^{c_k}-\pi \left( \boldsymbol{R}_{w}^{c_k}\boldsymbol{p}_{j}^{w}+\boldsymbol{t}_{w}^{c_k} \right) \rVert ^2}$$

Where $\boldsymbol {z}_{j}^{c_k}$ denotes the visual observation of $j$th marker in $k$th pose. Considering that the probe measurements are the coordinate of all receivers in $\left \{ W \right \}$, we collect the mean of coordinates of detected receivers in a probe pose for a long time and build a rigid-body 3D model and probe frame $\left \{ I \right \}$. The coordinates of six probe receivers in $\left \{ I \right \}$ is defined as $P^i=\left \{ p_{s}^{i} \right \} _{s=1}^{s=6}$. The points set of probe receivers in $\left \{ W \right \}$ is defined as $\mathcal {P}^w = \left \{ P^w_1, P^w_2, \ldots, P^w_m\right \}$, where the coordinates of the receivers in $k$th pose are defined as $P^w_k = \left \{p^w_s\right \}_{s=1}^{s=6}$. Then the probe poses $\boldsymbol {T}_{i_k}^{w}$ can be calculated by using the Singular Value Decomposition (SVD) method to the corresponding point sets $P^i$ and $\mathcal {P}^w$ inside a RANSAC scheme [42–44].

The above two procedures are repeated for different camera and probe poses. We noticed that the matrix $\boldsymbol {T}_{o}^{w}$ stays the same when the camera and probe change the poses. Therefore, the extrinsic parameter matrix $\boldsymbol {T}_{i}^{c}$ can be determined based on this invariant:

(2)$$\boldsymbol{T}_{i_{k+1}}^{w}\boldsymbol{T}_{c}^{i}\boldsymbol{T}_{w}^{c_{k+1}}\boldsymbol{T}_{o}^{w}=\boldsymbol{T}_{i_k}^{w}\boldsymbol{T}_{c}^{i}\boldsymbol{T}_{w}^{c_k}\boldsymbol{T}_{o}^{w}$$

Rearranging the Eq. (2), we have

(3)$$\boldsymbol{AT}_{c}^{i}=\boldsymbol{T}_{c}^{i}\boldsymbol{B}$$

where,

(4)$$\left\{ \begin{array}{c} \boldsymbol{A}=\left( \boldsymbol{T}_{i_k}^{w} \right) ^{{-}1}\boldsymbol{T}_{i_{k+1}}^{w}\\ \boldsymbol{B}=\boldsymbol{T}_{o}^{c_k}\left( \boldsymbol{T}_{o}^{c_{k+1}} \right) ^{{-}1}\\ \end{array} \right.$$

This is a hand-eye calibration problem $\boldsymbol {A}\boldsymbol {X}=\boldsymbol {X}\boldsymbol {B}$, where $\boldsymbol {A}$ and $\boldsymbol {B}$ are the pose changes observed by two fixed sensors, and the $\boldsymbol {X}$ represents the extrinsic parameter $\boldsymbol {T}_{c}^{i}$. By solving this equation, the $\boldsymbol {X}$ provides the rigid transformation between the probe and the camera [45]. Therefore, the camera poses at any time can be deduced by $\boldsymbol {T}_{c_k}^{w} =\boldsymbol {T}_{i_k}^{w}\boldsymbol {T}_{c}^{i}$.

4.2 Online temporal alignment

In practice, the timestamp of each sensor typically suffers from triggering and transmission delay, leading to temporal misalignment (time offsets) among different sensors. Such time offset dramatically affects the performance of sensor fusion. Among the proposed system, these images and iGPS measurements are recorded at frequencies of 15 and 30Hz, and the maximum offset between matched data is about 15ms. The iGPS setup cannot receive the external trigger signal due to its special rotary-laser scanning measurement principle. But there are internal clocks in iGPS system that accurately record the timestamps when laser planes go through the iGPS receivers, and thus the moment at which the 3D coordinate measurements are generated can be determined. In order to avoid hardware devices designed to synchronize cameras and iGPS timestamps, software-supported time synchronization between the camera and iGPS needs to be considered.

We adopted an online temporal calibration to make sensor data streams temporally consistent [36]. In the tracking process, we calculate the velocity for each feature point in the image, as illustrated in Fig. 4. Each image is aligned with the global iGPS measurement having the closest timestamp. It is noted that when the receiver can only receive laser signals from one transmitter or no signal is received, no timestamp and measurement point will be generated.

Fig. 4. The online temporal alignment method. (a) The velocity of feature points matched by two consecutive images. (b) The reprojection process with and without time offset compensation. The blue lines represent the original visual observation, while the black solid lines define the observation after the compensation. The red dashed lines are the iGPS pose constraint for each camera pose, and the black dashed lines are observed compensation of feature points.

Download Full Size | PDF

The camera motion from $\left \{ C_k \right \}$ to $\left \{ C_{k+1} \right \}$ in a short time can be approximated as uniform motion with a constant speed. The feature matched in two images $\left ( u_{j}^{k},v_{j}^{k} \right )$ and $\left ( u_{j}^{k+1},v_{j}^{k+1} \right )$ can be considered to have constant velocity $\boldsymbol {\vec {V}}_{j}^{k}$ that can be calculated by

(5)$$\boldsymbol{V}_{j}^{k}=\left[ \begin{array}{c} u_{j}^{k+1}-u_{j}^{k}\\ v_{j}^{k+1}-v_{j}^{k}\\ \end{array} \right] /\left( t_{k+1}-t_k \right)$$

Since the camera poses are determined by visual observation and probe pose constraints, once time offset exists, the visual and probe pose constraint is inconsistent, as shown in Fig. 4(b). We use the visual observation with the time offset as the modified observation which can be defined as

(6)$$\boldsymbol{z}_{j}^{k}\left( t_d \right) =\left[ \begin{matrix} u_{l}^{k} & v_{l}^{k}\\ \end{matrix} \right] ^T+t_d\boldsymbol{V}_{j}^{k}$$

Where $\boldsymbol {z}_{j}^{k}$ is the observation of marker $j$ at image $k$, and the time offset $t_d$ for aligning the measurements. By joint optimizing in Section 4.4, the time offset can be estimated, and the timestamps of visual observation can be appropriately adjusted to match iGPS constraints.

4.3 iGPS-aided visual initialization

The initialization process of traditional monocular solution is Structure from motion (SFM) without metric scale. The up-to-scale poses and map points are established in the local camera frame $\left \{ C \right \}$ according to the epipolar constraint:

(7)$$\boldsymbol{x}_2^{T}\boldsymbol{t}^{\land}\boldsymbol{R}\boldsymbol{x}_1=0$$

Where $\boldsymbol {t}^{\land }$ indicates the corresponding skew-symmetric matrix of vector $\boldsymbol {t}$. Therefore the initial 2 camera positions can be defined as $\boldsymbol {t}_{c_0}^{c}= \boldsymbol {0}^{T}$ and $\boldsymbol {t}_{c_1}^{c}= \boldsymbol {t}$, respectively.

In order to unify the coordinate frame, the metric scale $s$ and extrinsic parameters $\boldsymbol {T}_{c}^{w}$ need to be determined to transform the local map into the frame $\left \{ W \right \}$. We first estimate the metric scale quickly using the initial 2 camera positions and the corresponding 2 global positions

(8)$$s=\frac{\lVert \boldsymbol{t}_{c_1}^{w}-\boldsymbol{t}_{c_0}^{w} \rVert}{\lVert \boldsymbol{t}_{c_1}^{c}-\boldsymbol{t}_{c_0}^{c} \rVert} = \frac{\lVert \left( \boldsymbol{R}_{i_1}^{w}-\boldsymbol{R}_{i_0}^{w} \right) \boldsymbol{t}_{c}^{i}+\boldsymbol{t}_{i_1}^{w}-\boldsymbol{t}_{i_0}^{w} \rVert}{\lVert \boldsymbol{t}_{c_1}^{c} \rVert}$$

Where $\boldsymbol {t}_{c_1}^{w}$ and $\boldsymbol {t}_{c_0}^{w}$ are both global camera positions deducted by probe poses. Then, the global camera poses and map can be scaled. The transformation matrix from the frame $\left \{ C \right \}$ to the frame $\left \{ W \right \}$ is decided by $\boldsymbol {T}_{c}^{w}=\boldsymbol {T}_{i_0}^{w} \boldsymbol {T}_{c}^{i}$.

If the time offset between sensors is known, these parameters and the initial map can be further refined by using the global BA method with pose constraint. The accurate metric scale and extrinsic matrix are beneficial to avoid converging to a local minimum in the tracking procedure.

4.4 Global pose optimization for visual-iGPS

In visual-iGPS SLAM, the system state estimation problem can be formulated under a probabilistic framework and be inferred by maximizing the posterior probability given all the measurements, which can be outlined as the factor graph shown in Fig. 5. The maximum a posteriori (MAP) estimation problem seeks the most likely system state $\mathcal {X}$ that best matches all measurements:

(9)$$\begin{aligned} \mathcal{X}^{*} & = \underset{\mathcal{X}}{\text{argmax}}p(\mathcal{X} | \mathbf{z})\\ & \propto\underset{\mathcal{X}}{\text{argmax}}p(\mathcal{X})p\left(\mathbf{z}|\mathcal{X} \right)\\ & =\underset{\mathcal{X}}{\text{argmax}}p(\mathcal{X})\prod_{i=1}^n p\left(\mathbf{z}_i | \mathcal{X}\right) \end{aligned}$$

where $\mathbf {z}$ represents all sensors measurements. The estimated state $\mathcal {X}$ includes the current camera pose $\mathbf {x_k}$ and map points $l_n$, along with the time offset parameters $t_d$, which is defined as:

(10)$$\mathcal{X}=\left\{ \mathbf{x_0},\mathbf{x_1} \ldots \mathbf{x_m}, l_0,l_1 \ldots l_n ,t_d \right\}$$

(11)$$\mathbf{x_k}=\left\{ \boldsymbol{R}_{c_k}^{w},\boldsymbol{t}_{c_k}^{w} \right\},k \in[0, m]$$

Fig. 5. Factor graph of the proposed approaches. The colored circles represent the camera pose nodes $x_m$, landmark nodes $l_n$, iGPS measurements node $\mathcal {G}$ and time offset node $\mathcal {T}$, respectively. The black circles denote measurement factors (pose $g$ and visual reprojection $\pi$) which are connected to nodes.

Download Full Size | PDF

Assuming that all noises are subjected to Gaussian distributed and independent of each other, the maximum a posteriori (MAP) inference is equivalent to solving a non-linear least-squares optimization problem:

(12)$$\begin{aligned} \mathcal{X}^{*} & =\underset{\mathcal{X}}{\text{argmin}}-\log \prod_{i=1}^N p\left(\mathbf{z}_i | \mathcal{X}\right)\\ & =\underset{\mathcal{X}}{\text{argmin}}\left( \sum_{k=0}^{m}\sum_{j=0}^{n}{\rho\left( \lVert \boldsymbol{e}_{j}^{k} \rVert _{\Sigma _v^{{-}1}}^{2} \right)}+\sum_{k=0}^{m}{\rho\left( \lVert \boldsymbol{e}^k_p \rVert _{\Sigma _p^{{-}1}}^{2} \right)}+\sum_{k=0}^{m}{\rho\left( \lVert \boldsymbol{e}^k_o \rVert _{\Sigma _o^{{-}1}}^{2} \right)} \right) \end{aligned}$$

Where the robust Huber kernel function $\rho \left (\cdot \right )$ is utilized for diminishing the effect of outliers [46], such as mismatching in the visual reprojection, and the wrong probe poses due to the fast motion and measurement instability. The prior $p\left (\mathcal {X}\right )$ is ignored due to no prior information about the system state. The Mahalanobis norm is used to normalize the residuals. $\varSigma _v^{-1}$, $\varSigma _p^{-1}$ and $\varSigma _o^{-1}$ are the inverse of covariance matrixes of visual reprojection, position, and attitude, respectively. The Eq. (12) is an objective function containing the following components:

1. visual reprojection residual term with feature velocity. $(13)$$\boldsymbol{e}_{j}^{k}=\boldsymbol{z}_{l}^{k}\left( t_d \right) -\pi \left( \boldsymbol{R}_{w}^{c_k}\boldsymbol{P}_{j}^{w}+\boldsymbol{t}_{w}^{c_k} \right)$$$ Where $\pi \left ( \cdot \right )$ is the reprojection function of camera model which projects the 3D map point into the image. $\boldsymbol {z}_{j}^{k}\left ( t_d \right )$ is the modified visual observation obtained by Eq. (6).
2. pose residual term which includes position residual and attitude residual. $(14)$$\left\{ \begin{array}{c} \boldsymbol{e}_p=\boldsymbol{\hat{t}}_{c_k}^{w}-\left(\boldsymbol{R}_{i_k}^{w}\boldsymbol{t}_{c}^{i}+\boldsymbol{t}_{i_k}^{w} \right)\\ \boldsymbol{e}_a=\left( \boldsymbol{\hat{q}}_{c_k}^{w} \right) ^{{-}1}\otimes \left(\boldsymbol{q}_{i_k}^{w}\boldsymbol{q}_{c}^{i}\right)\\ \end{array} \right.$$$ Where $\boldsymbol {\hat {q}}_{c_k}^{w}$ and $\boldsymbol {\hat {t}}_{c_k}^{w}$ are the estimated global camera pose. This term significantly eliminates scale drift and avoids inaccurate positioning caused by the poor image quality and wrong feature matching. By minimizing the residuals of each error term, accurate global states can be calculated.

4.5 PnP with additional pose constraints

The optimization method proposed in the last section needs initial seeds to converge to accurate results and satisfy real-time efficiency. On the other hand, even though the pose and map are well corrected after global pose optimization, accumulated errors are still inevitable due to few image features and mismatching, which lead to outliers in robot localization. Thus, pose constraint is introduced to the PnP algorithm to improve the localization accuracy. The localization problem at time $k$ is simplified as

(15)$$\mathbf{x_k}^{*}= \underset{\mathbf{x_k}}{\text{argmin}}\left( \sum_{j=0}^{n}{\rho\left( \lVert \boldsymbol{e}_{j}^{k} \rVert _{\Sigma _v^{{-}1}}^{2} \right)}+{\rho\left( \lVert \boldsymbol{e}^k_p \rVert _{\Sigma _p^{{-}1}}^{2} \right)}+{\rho\left( \lVert \boldsymbol{e}^k_o \rVert _{\Sigma _o^{{-}1}}^{2} \right)} \right)$$

The reprojection residual term in Eq. (13) is also simplified without time offset estimation

(16)$$\boldsymbol{e}_{j}^{i}=\boldsymbol{z}_{l}^{i}-\pi \left( \boldsymbol{R}_{w}^{c_k}\boldsymbol{P}_{j}^{w}+\boldsymbol{t}_{w}^{c_k} \right)$$

The optimization procedure obtains not only optimal current state estimation but also provides good initial seeds for global BA optimization.

4.6 Robustness to tracking loss

In practice, camera occlusion, few image features, and fast motions will result in the monocular system to lose tracking. For short-term lost, the current camera pose estimated from the iGPS measurements is utilized for reprojecting all map points to recover tracking and localization results. For long-term lost, new maps are recreated and initialized. Due to the unified world coordinate frame, all maps are represented in the same reference frame, which avoids map separation.

Moreover, the iGPS tracking system also generates outliers or missing data due to the occlusion, fast motion speed, and large angle of incident light in receivers. In the absence of iGPS data, the vision-only global pose estimation can be conducted until the iGPS data is captured again. Thus, our visual-iGPS system is more robust than the pure visual SLAM and pure iGPS tracking system in practice.

5. Experimental results

This section presents the experimental result of the proposed system. A series of experiments are carried out to estimate trajectories and benchmark how the addition of an iGPS contributes to the results. In terms of metric, we adopt the absolute trajectory error (ATE) and the absolute pose error (APE) to evaluate position and orientation quantitatively [47]. Our algorithm is performed on both public and real dataset, which is split in:

1. The indoor dataset EuRoC [48]: 5 sequences in industrial scenario (machine hall) based on MAV platform.
2. The more challenging real-life experiments: 6 sequences where a hand-held camera completes several circuits around an indoor industrial environment.

Then, we compare the performance of the proposed method with the state-of-the-art monocular methods. The estimated results are aligned with Ground truth (GT) by using SE(3) transformation for the proposed method and Sim(3) transformation for monocular methods.

5.1 EuRoC machine hall datasets

The EuRoC machine hall datasets are recorded for the aerial inspection of industrial facilities, which contains the stereo images, IMU measurements, and GT recorded by the Leica total station. The experiments are performed by using monocular images from the left camera at 20Hz. Due to the lack of real iGPS measurements, one virtual iGPS transmitter is simulated at the origin of the frame $\left \{ W \right \}$. Then, we simulate probe measurements according to the GT, and the probe measurements are also subjected to zero-mean Gaussian noise with a standard deviation of 8.66mm (5mm in each direction) and 0.1° in translation and rotation, respectively. The sequence name prefix is MH_, and the difficulty increases with the increase of the following numbers. A total of 5 sequences of the test are conducted and the results are compared with the relevant and state-of-the-art visual SLAM DSO [49] and ORB-SLAM3 in Table 1.

Table 1. Performance comparison of translation in the EuRoC Dataset (RMS ATE in mm). All the results in the trajectory are compared with the GT.

View Table | View all tables in this article

We perform 10 executions for each sequence. In terms of algorithm execution time, DSO takes the longest time. Our method can meet the real-time requirements just like ORB-SLAM3, as shown in Table 2. Note that, the execution times are recorded when we run all experiments using a desktop PC with an Intel i9-10900F at 2.80 GHz and 32-GB memory. For the performance, the RMS ATE of trajectories shows that the proposed method is more than doubles the accuracy of DSO and ORB-SLAM3. In the simplest MH_01 sequence, our method improves the current best result from 19.16mm to 5.87mm. In difficult sequences with fast motion and dark scenes, DSO and ORB-SLAM3 significantly cause cumulative errors due to low-quality visual observation. Although the errors of the proposed method also increase with the increase of the difficulty of the dataset, they are all within the acceptable range. Moreover, the proposed method outperforms in terms of the stability of the results, which proves the good performance of the system.

Table 2. The algorithm execution times (second) of EuRoC Machine Hall Datasets.

View Table | View all tables in this article

From the experiments, we can find that the proposed method is a real-time high-accuracy SLAM system, which is superior to state-of-the-art visual systems in terms of accuracy and also has good robustness.

5.2 Real-life experiments

In order to test the algorithm performance in real-life industrial applications, we used the monocular and iGPS sensors for collecting all the measurements, as shown in Fig. 6. The hardware consists of an IDS UI-5280 GigE industrial camera (resolution: 2456 $\times$ 2054 pixels) which records monocular images at 15Hz, 4 iGPS transmitters, and a probe that outputs coordinate of 6 channels at 30Hz. GT is provided by Leica T-Mac, which outputs 6D pose at 100Hz in world frame and measures up to 30 meters.

Fig. 6. The sensor system that was used to record sensor measurements in real-life experiments. (a) The iGPS transmitters and Leica AT960. (b) The camera, iGPS 6D sensor and Leica T-Mac.

Download Full Size | PDF

In the experiment, we performed 6 real-life experiments based on monocular images and iGPS measurements to test the performance of the proposed system in dealing with different scenarios and its performance compared with the state-of-the-art visual approaches. A few sequences in the datasets do not contain loops to measure the accumulated drift. The camera trajectories of representative hand-held sequences are recorded in 8 m $\times$ 6 m $\times$ 5 m indoor working volume, as shown in Fig. 7. The length of the trajectories is about 22.9m (130.8s), 31.9m (154.7s), 15.2m (64.2s), 68.1m (227.3s), 36.6m (173.0s), and 51.1m (223.1s). Thus, the average speed is nearly 0.232 m/s.

Fig. 7. Representative hand-held camera trajectory in real experiments (Ground Truth).

Download Full Size | PDF

For the EuRoC dataset, the features in the image are evenly distributed, numerous, and rarely occluded. However, the real dataset we collected is much more challenging than this. Firstly, due to the hand-held camera, the feature points extracted from the image are concentrated in the middle of the image, and only a small part of the feature points can be extracted from the ceiling, so the feature points are distributed unevenly. Then, for real application needs, workers have to be close to some instruments rather than keep a certain distance like EuRoC datasets, and thus image features are fewer, which easily leads to tracking loss and trajectory drift. Finally, our scene has many weak texture areas, such as white walls, which is also easy to cause tracking loss due to few feature points.

By comparing the RMS ATE of trajectories in Table 3, we found that the existing state-of-the-art visual methods cannot keep good results for the challenging industrial scenario. Except in the simplest Lab_01 sequence, DSO and ORB-SLAM3 both achieved about 3.1cm because they only focused on the forward and backward movement of the same local area. For DSO, Lab_02 and Lab_04 causes great accumulated drift due to tracking loss and the trajectory error of rest sequences all over 50cm. The results of ORB-SLAM3 are significantly better than that of DSO, but the trajectory errors in Lab03, Lab05, and Lab06 sequences all exceed 10cm. Although the RMS ATE of our proposed method reached 11.59mm only in Lab04, other sequences achieved localization within 7mm. For the Lab01 sequence, the proposed method also reduces the trajectory error from 31.20mm to 5.85mm. Moreover, Table 4 also shows that the proposed method is better than ORB-SLAM3 and more than double the accuracy of DSO in terms of RPE. iGPS is as good as our proposed method in terms of ATE and RPE, but it often fails to obtain measurement data due to the occlusion phenomenon and fast motion speed of receivers. We use the term "integrity of trajectory" to represent the situation of tracking, and the integrities of iGPS trajectories are about $70\%$, while those of other methods are $100\%$. Thus, our proposed method is superior to the iGPS measurement method in terms of robustness. Moreover, Limited by the rotary machine structure, iGPS also works at a limited sampling frequency (less than 30Hz), and the synchronization between multiple transmitters and the dynamic performance is restricted seriously. When the receiver moves fast, it will result in substantial dynamic measurement errors, and the overall accuracy will be further reduced.

Table 3. Localization results on the real-life experiments (RMS ATE in mm).

View Table | View all tables in this article

Table 4. Performance comparison in the real-life experiments (RMS RPE in degree).

View Table | View all tables in this article

In order to compare the performance more in detail, the trajectory of the representative sequence Lab06 obtained is illustrated in Fig. 8. DSO is a visual odometry system without long-term data association and loop closures, so accumulated drift is significant. ORB-SLAM3 don’t even perform as well as many traditional positioning techniques and, therefore, cannot meet the needs of practical applications. Compared with these results, it further shows the superior performance of the proposed method.

Fig. 8. The trajectory of representative sequence Lab06 in real experiments.

Download Full Size | PDF

6. Conclusion and future work

In this paper, we propose a joint optimization approach to fuse the visual and iGPS probe measurement to obtain the optimal global pose estimation and map reconstruction in a unified world frame, as well as time offset. The tightly-coupled visual-iGPS sensor fusion approach fulfills the application requirement of millimeter-level pose estimation and mapping in unstructured indoor industrial environments. Our approach makes the scale observable and eliminates accumulated drift significantly. More importantly, the perceptual ability is improved by introducing visual observation in the traditional field of large-volume metrology. The experimental results show that the average trajectory error is reduced from 46.40mm to 9.61mm in the public EuRoC dataset and from 99.04mm to 6.42mm in more challenging real-life experiments, which fully proves that the proposed system outperforms the state-of-the-art monocular visual SLAM methods in terms of accuracy and have good applicability and robustness in industrial scenarios.

The accuracy of the proposed method is still affected by poor dynamic performance associated with iGPS when receivers move fast. In the future, we will investigate how to accurately fuse image features and a scanning optical signal using only one transmitter station rather than two or more. An additional aspect for future research will be to improve the positioning robustness in a weak texture scene by fusing deep learning features.

Funding

National Natural Science Foundation of China (52075382, 51835007, 52127810).

Disclosures

The authors declared that they have no conflicts of interest to this work.

Data Availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. R. H. Schmitt, W. Kimmelmann, S. Quinders, C. Storm, and F. Oberhansberg, “Investigation of the applicability of an igps metrology system to automate the assembly in motion of a truck cabin,” in Robotics and Automated Production Lines, vol. 840 of Applied Mechanics and Materials (Trans Tech Publications Ltd, 2016), pp. 58–65.

2. R. Schmitt, A. Schoenberg, and B. Damm, “Indoor-gps based robots as a key technology for versatile production,” in ISR 2010 (41st International Symposium on Robotics) and ROBOTIK 2010 (6th German Conference on Robotics), (2010), pp. 1–7.

3. J. Tiemann, F. Schweikowski, and C. Wietfeld, “Design of an uwb indoor-positioning system for uav navigation in gnss-denied environments,” in 2015 International Conference on Indoor Positioning and Indoor Navigation (IPIN), (2015), pp. 1–7.

4. D. Gualda, J. M. Villadangos, J. Ure na, A. R. J. Ruiz, F. S. Granja, and Á. Hernández, “Indoor positioning in large environments: ultrasonic and uwb technologies,” in IPIN, (2019).

5. S. J. Kim and B. K. Kim, “Accurate hybrid global self-localization algorithm for indoor mobile robots with two-dimensional isotropic ultrasonic receivers,” IEEE Trans. Instrum. Meas. 60(10), 3391–3404 (2011). [CrossRef]

6. J. Qi and G.-P. Liu, “A robust high-accuracy ultrasound indoor positioning system based on a wireless sensor network,” Sensors 17(11), 2554 (2017). [CrossRef]

7. N. El-Sheimy and Y. Li, “Indoor navigation: state of the art and future trends,” Satell. Navig. 2(1), 7–23 (2021). [CrossRef]

8. R. Liu, J. Wang, and B. Zhang, “High definition map for automated driving: Overview and analysis,” J. Navig. 73(2), 324–341 (2020). [CrossRef]

9. H. Strasdat, J. M. M. Montiel, and A. J. Davison, “Scale drift-aware large scale monocular slam,” in In Proceedings of Robotics: Science and Systems, (2010).

10. Z. Huang, J. Zhu, L. Yang, B. Xue, J. Wu, and Z. Zhao, “Accurate 3-d position and orientation method for indoor mobile robot navigation based on photoelectric scanning,” IEEE Trans. Instrum. Meas. 64(9), 2518–2529 (2015). [CrossRef]

11. R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: A versatile and accurate monocular slam system,” IEEE Trans. Robot. 31(5), 1147–1163 (2015). [CrossRef]

12. C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “Svo: Semidirect visual odometry for monocular and multicamera systems,” IEEE Trans. Robot. 33(2), 249–265 (2017). [CrossRef]

13. Z. Wang, K. Yang, H. Shi, P. Li, F. Gao, and K. Wang, “Lf-vio: A visual-inertial-odometry framework for large field-of-view cameras with negative plane,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), (2022), pp. 4423–4430.

14. X. Lang, J. Lv, J. Huang, Y. Ma, Y. Liu, and X. Zuo, “Ctrl-vio: Continuous-time visual-inertial odometry for rolling shutter cameras,” arXiv, 2208.12008 (2022). [CrossRef]

15. M. Serviéres, V. Renaudin, A. Dupuis, and N. Antigny, “Visual and visual-inertial slam: State of the art, classification, and experimental benchmarking,” J. Sens. 2021, 1–26 (2021). [CrossRef]

16. H. Chen, K. Wang, W. Hu, K. Yang, R. Cheng, X. Huang, and J. Bai, “Palvo: visual odometry based on panoramic annular lens,” Opt. Express 27(17), 24481–24497 (2019). [CrossRef]

17. D. Wang, J. Wang, Y. Tian, K. Hu, and M. Xu, “Pal-slam: a feature-based slam system for a panoramic annular lens,” Opt. Express 30(2), 1099–1113 (2022). [CrossRef]

18. A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalman filter for vision-aided inertial navigation,” in Proceedings 2007 IEEE International Conference on Robotics and Automation, (2007), pp. 3565–3572.

19. M. Li and A. Mourikis, “High-precision, consistent ekf-based visual-inertial odometry,” The Int. J. Robotics Res. 32(6), 690–711 (2013). [CrossRef]

20. S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual-inertial odometry using nonlinear optimization,” The Int. J. Robotics Res. 34(3), 314–334 (2015). [CrossRef]

21. C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap SLAM,” IEEE Trans. Robot. 37(6), 1874–1890 (2021). [CrossRef]

22. R. Mascaro, L. Teixeira, T. Hinzmann, R. Siegwart, and M. Chli, “Gomsf: Graph-optimization based multi-sensor fusion for robust uav pose estimation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), (2018), pp. 1421–1428.

23. T. Qin, S. Cao, J. Pan, and S. Shen, “A general optimization-based framework for global pose estimation with multiple sensors,” arXiv, 1901.03642 (2019). [CrossRef]

24. S. Cao, X. Lu, and S. Shen, “Gvins: Tightly coupled gnss-visual-inertial fusion for smooth and consistent state estimation,” IEEE Trans. Robot. 38(4), 2004–2021 (2022). [CrossRef]

25. B. Han, Z. Xiao, S. Huang, and T. Zhang, “Multi-layer vi-gnss global positioning framework with numerical solution aided map initialization,” arXiv, 2201.01561 (2022). [CrossRef]

26. J. Kaiser, A. Martinelli, F. Fontana, and D. Scaramuzza, “Simultaneous state initialization and gyroscope bias calibration in visual inertial aided navigation,” IEEE Robot. Autom. Lett. 2(1), 18–25 (2017). [CrossRef]

27. P. An, Y. Gao, T. Ma, K. Yu, B. Fang, J. Zhang, and J. Ma, “Lidar-camera system extrinsic calibration by establishing virtual point correspondences from pseudo calibration objects,” Opt. Express 28(12), 18261–18282 (2020). [CrossRef]

28. P. An, T. Ma, K. Yu, B. Fang, J. Zhang, W. Fu, and J. Ma, “Geometric calibration for lidar-camera system fusing 3d-2d and 3d-3d point correspondences,” Opt. Express 28(2), 2122–2141 (2020). [CrossRef]

29. Z. Lai, Y. Wang, S. Guo, X. Meng, J. Li, W. Li, and S. Han, “Laser reflectance feature assisted accurate extrinsic calibration for non-repetitive scanning lidar and camera systems,” Opt. Express 30(10), 16242–16263 (2022). [CrossRef]

30. P. Furgale, J. Rehder, and R. Siegwart, “Unified temporal and spatial calibration for multi-sensor systems,” in 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, (2013), pp. 1280–1286.

31. J. Rehder, J. Nikolic, T. Schneider, T. Hinzmann, and R. Siegwart, “Extending kalibr: Calibrating the extrinsics of multiple imus and of individual axes,” in 2016 IEEE International Conference on Robotics and Automation (ICRA), (2016), pp. 4304–4311.

32. K. Qiu, T. Qin, J. Pan, S. Liu, and S. Shen, “Real-time temporal and rotational calibration of heterogeneous sensors using motion correlation analysis,” IEEE Trans. Robot. 37(2), 587–602 (2021). [CrossRef]

33. J. Kelly and G. S. Sukhatme, “Visual-inertial simultaneous localization, mapping and sensor-to-sensor self-calibration,” in 2009 IEEE International Symposium on Computational Intelligence in Robotics and Automation - (CIRA), (2009), pp. 360–368.

34. M. Li and A. I. Mourikis, “Online temporal calibration for camera-imu systems: Theory and algorithms,” Int. J. Robotics Res. 33(7), 947–964 (2014). [CrossRef]

35. P. Geneva, K. Eckenhoff, W. Lee, Y. Yang, and G. Huang, “Openvins: A research platform for visual-inertial estimation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), (2020), pp. 4666–4672.

36. T. Qin and S. Shen, “Online temporal calibration for monocular visual-inertial systems,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), (2018), pp. 3662–3669.

37. Z. Liu, J. Zhu, L. Yang, S. Ye, J. Wu, and B. Xue, “Real-time position and orientation measurement with occlusion handling for distributed optical large-scale metrology systems,” Opt. Eng. 52(11), 114101 (2013). [CrossRef]

38. F. Franceschini, M. Galetto, D. Maisano, L. Mastrogiacomo, and B. Pralio, “Distributed large-scale dimensional metrology. New insights,” (2011).

39. L. Stadelmann, T. Sandy, A. Thoma, and J. Buchli, “End-effector pose correction for versatile large-scale multi-robotic systems,” IEEE Robot. Autom. Lett. 4(2), 546–553 (2019). [CrossRef]

40. Z. Huang, L. Yang, Y. Zhang, Y. Guo, Y. Ren, J. Lin, and J. Zhu, “Photoelectric scanning-based method for positioning omnidirectional automatic guided vehicle,” Opt. Eng. 55(3), 034105 (2016). [CrossRef]

41. V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o(n) solution to the pnp problem,” Int. J. Comput. Vis. 81(2), 155–166 (2009). [CrossRef]

42. S. Umeyama, “Least-squares estimation of transformation parameters between two point patterns,” IEEE Trans. Pattern Anal. Machine Intell. 13(4), 376–380 (1991). [CrossRef]

43. D. Eggert, A. Lorusso, and R. Fisher, “Estimating 3-d rigid body transformations: A comparison of four major algorithms,” Mach. Vis. Appl. 9(5-6), 272–290 (1997). [CrossRef]

44. R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed (Cambridge University Press, 2004).

45. R. Ishikawa, T. Oishi, and K. Ikeuchi, “Lidar and camera calibration using motions estimated by sensor fusion odometry,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), (2018), pp. 7342–7349.

46. J. T. Barron, “A general and adaptive robust loss function,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), pp. 4326–4334.

47. J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, (2012), pp. 573–580.

48. M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The EuRoC micro aerial vehicle datasets,” The Int. J. Robotics Res. 35(10), 1157–1163 (2016). [CrossRef]

49. J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 611–625 (2018). [CrossRef]

Sequence	DSO	ORB-SLAM3	Proposed
MH_01	54.96	19.16	5.87
MH_02	32.98	27.95	7.12
MH_03	188.10	35.81	10.72
MH_04	171.99	80.75	11.51
MH_05	121.42	68.32	12.82

Sequence	DSO	ORB-SLAM3	Proposed
MH01	1518.0	204.7	201.5
MH02	1386.0	166.5	165.2
MH03	1043.5	146.4	144.8
MH04	663.86	109.5	108.3
MH05	726.30	123.0	120.6

Sequence	DSO	ORB-SLAM3	iGPS(Integrity of trajectory)	Proposed
Lab01	31.28	31.20	7.34(72.10 $%$ )	5.85
Lab02	1064.1	95.32	4.90(70.00 $%$ )	6.98
Lab03	377.11	126.34	5.80(67.47 $%$ )	4.42
Lab04	527.47	60.96	5.16(60.67 $%$ )	11.59
Lab05	259.08	120.99	7.54(69.60 $%$ )	4.88
Lab06	357.51	159.40	6.26(68.64 $%$ )	4.80

Sequence	DSO	ORB-SLAM3	iGPS(Integrity of trajectory)	Proposed
Lab01	0.711	0.476	0.164(72.10 $%$ )	0.105
Lab02	0.784	0.265	0.174(70.00 $%$ )	0.186
Lab03	0.175	0.137	0.187(67.47 $%$ )	0.259
Lab04	1.535	0.692	0.165(60.67 $%$ )	0.137
Lab05	1.628	0.714	0.171(69.60 $%$ )	0.233
Lab06	0.619	0.328	0.199(68.64 $%$ )	0.217

Sequence	DSO	ORB-SLAM3	Proposed
MH_01	54.96	19.16	5.87
MH_02	32.98	27.95	7.12
MH_03	188.10	35.81	10.72
MH_04	171.99	80.75	11.51
MH_05	121.42	68.32	12.82

Tightly-coupled fusion of iGPS measurements in optimization-based visual SLAM

Abstract

1. Introduction

2. Related works

2.1 VO/vSLAM

2.2 Sensor fusion method for VO/vSLAM

2.3 Spatial and temporal calibration

3. Overview

3.1 iGPS setup

3.2 System workflow

3.3 Notations

4. Methodology

4.1 Spatial alignment

4.2 Online temporal alignment

4.3 iGPS-aided visual initialization

4.4 Global pose optimization for visual-iGPS

4.5 PnP with additional pose constraints

4.6 Robustness to tracking loss

5. Experimental results

5.1 EuRoC machine hall datasets

5.2 Real-life experiments

6. Conclusion and future work

Funding

Disclosures

Data Availability

References

Data Availability

Cited By

Figures (8)

Tables (4)

Equations (16)

Optics Express