Feature matching for texture-less endoscopy images via superpixel vector field consistency

Shiyuan Liu; Jingfan Fan; Jingfan Fan; Danni Ai; Hong Song; Hong Song; Tianyu Fu; Yongtian Wang; Yongtian Wang; Jian Yang

doi:10.1364/BOE.450259

1. Introduction

Endoscopy is widely used in minimally invasive surgery (MIS). Depending on experience, doctors estimate the spatial relationship of the surgical environment and measure the distance between surgical instruments and the operating surface [1]. However, the tunnel vision in the endoscopy image may require surgeons to perform multiple observations to obtain information about the same scene, which increases the duration and the risk of the operation [2]. In addition, the lack of depth information in the 2D image makes it difficult for doctors to accurately grasp the movement posture of surgical instruments [3]. Therefore, finding the correct feature matching from the continuous frame endoscopy images plays a pivotal role for doctors to recognize the 3D structure during the operation. A considerable amount of work has been made on the 3D reconstruction of the surgical environment in MIS [4]. Most of this work can be divided into two categories: the active structured light projection method and the passive visual feature method. Under the active method framework, the specific light signal is projected to the tissue surface, which is collected by a camera. The entire 3D space restoration is based on calculating the position and depth information of the object determined by the changes of the light signal. Researchers designed a structured light with three monochromatic modes to scan the surface of the abdominal cavity. This method requires accurate calibration of the position of the monocular camera and the light source, but the calibrated system needs to remain fixed during the entire reconstruction process. The surface structure of the object can be accurately calculated by the posture relationship between the light source and the imaging sensor [4]. However, the hardware instruments of the structured light projection cannot be passed due to the narrow path of the real surgical environment. The passive visual feature method relies on visual tracking of moving images, which is called structure from motion (SfM) or simultaneous localization and mapping (SLAM) [5]. The 3D position information of feature points obtained by the principle of triangulation and tracking the feature points in the stereo image. By using the characteristics, [6] proposed a singularity optimization method for endoscopy image matching. [7] obtained the posture information of an instrument through multi-level feature aggregation. Optical flow technology was used to recover the depth information of the surface structure from the 2D endoscopy images [8]. This method applies the motion hierarchy structure to the motion estimation process of endoscopy and generates the 3D structure by calculating the depth information. By contrast, the visual feature-based method is more flexible in adapting to the surgical environment.

The cold light source that comes with endoscopy can reduce the contrast detail of the internal environment, which impairs the performance of the visual feature point detection method [9]. Consequently, the performance of feature point detection can be improved by enhancing the contrast of a single image [10]. The study of verified that the denoising and grey space conversion of the image can improve the reconstruction of dense 3D point clouds [11]. Furthermore, [12] proposed a method to highlight the detailed information of the target object with texture-less images. Given that the image may contrast distortion as enhanced texture, [13] proposed a hue mapping method to adaptively amplify local contrast. At the same time, [14] proposed an intensity histogram equalization method to preserve color information as much as possible. Therefore, researchers usually adopt enhanced quality to solve texture-less images. [15] focused on detail improvement in the dark region of the image to improve the 3D reconstruction effect of objects. In addition, the number of feature points and the matching accuracy are affected dramatically by the specular reflections and texture-less regions in smooth and moist tissue surfaces [16]. These factors eventually lead to voids and topological errors in the point cloud, which are insufficient to support the reconstruction of the 3D environment [17]. In general, increasing the number of feature points extracted and enhancing the matching performance of feature points can make the 3D model closer to the real scene [18]. At present, the existing feature extraction and matching algorithms can help generate 3D environment models with high accuracy from the data with textured images. In practice, texture-less soft tissues make feature detection and matching extremely challenging. Obtaining the 3D surface structure of the patient without significant visual features is one of the major sticking points in MIS.

In recent years, the image reconstruction method is a hotspot in 3D vision system research. However, surface reconstruction of texture-less endoscopy images remains a largely underexplored domain. The main challenge faced by many 3D reconstructions of soft tissue surfaces is the accurate and robust extraction and matching of feature points. For example, [19–20] extracts the contour information of the object in the target image for the feature matching. [21–23] took full advantage of image gradient changes to find feature points. In reaction to the scale, rotation or affine change of the image, high-dimensional features such as [24–26] with more robustness are presented. In addition, the computation speed is emphasized in many application environments. [27–28] used a descriptor set method to achieve fast detection and matching. [29–31] proposed a binary representation feature descriptor. Fast feature matching is achieved through Hamming distance [32–33].

The feature matching algorithm finds the accurate correspondence between the sparse distribution of feature points in two images [34]. The quality of feature matching mainly increases with the improvement of the salience of features [35]. The descriptor vector of information storing geometric information of each key point (e.g. position, scale and rotation) [36]. Thereafter, a set of initial matches is determined by the similarity of feature points in the nearest neighbor distance. However, these matches may contain a large number of outliers. Although the strict outlier constraint can reduce the number of false matches, it diminishes a greater number of true matches [37]. Currently, probabilistic methods with the maximum likelihood framework can construct the correspondence matrix between matches and eliminate the negative influence of missing partial true matches [38]. Moreover, performs well for rigid movements with the weighting optimization method, they are not robust enough to estimate deformation or viewpoint changes and usually obtain few matches and suffer from large computational complexity [39]. The Grid-based motion statistics correspondence (GMS) algorithm can distinguish between right and wrong matching more efficiently in the coherent constrained feature point dense region [40].

In the present study, we proposed a neighborhood consensus-based method for texture-less endoscopy image feature matching. By checking the motion difference of superpixel space coherence against the estimated geometric field model in consecutive frames, we distinguished the underlying correspondences into true and false matches. Then, we retained the true matches for the reconstruction of the surface. The contributions of this study are threefold. Firstly, we constructed an adaptive gradient-preserving constraint condition with a texture-less endoscopy image degradation model, which is robust to most texture-less endoscopy images. Secondly, the local motion similarity strategy was introduced, which converts the comparison of feature descriptors into superpixel information entropy matching and obtains the initial matching set. Thirdly, we estimated the geometric feature motion consistency in the initial matching set. The underfitting of the local sports field was optimized by extending the 2D spatial domain with geometric features. A series of evaluations was conducted on different types of endoscopy images. Our method can obtain a greater number of feature points and more accurate matches in texture-less endoscopy images compared with other methods.

2. Methods

The proposed feature matching for the texture-less endoscopy image process includes three major steps: enhancing the texture-less endoscopy image, spatial motion constraint and geometric motion feature smoothing, as shown in Fig. 1. For enhancing the texture-less endoscopy image, we eliminated the negative effect of spatial motion on the matching algorithm with specular region detection method. Then, we increased the number of visual feature point extraction with an adaptive gradient-preserving method. For the spatial motion constraint, the consecutive frame image features were detected with the oriented fast and rotated brief (ORB) algorithm to obtain a hypothetical match. We constructed the spatial constraints according to the information entropy similarity between the continuous frame superpixel blocks of the hypothetical matches. For the geometric motion feature smoothing, the local subtle motions in continuous frame image feature matching were preserved with the geometric motion field. The exact final matches were determined by maintaining the consistency of local feature motion.

Fig. 1. Flowchart of the proposed neighborhood geometric consistency-based texture-less endoscopy image matching method. The major steps are in the bold line rectangles.

Download Full Size | PDF

2.1 Texture-less image adaptive gradient-preserving

The intraoperative images obtained show tissues of human organs, which may contain specular highlights as bright spots due to light reflection on the glossy surface. Specular reflection may lead to vision perception error, and it may affect the feature point extraction of endoscopy images. We initially detected the specular region of endoscopy images. The high saturation through a single channel of an endoscopy image can be considered as the color channel intensity drift [41]. The color balance ratio can be defined as follows:

(1)$$\begin{array}{c} {{r_{GE}} = \frac{{{P_{95}}({{c_G}} )}}{{{P_{95}}({{c_E}} )}},\; {r_{BE}} = \frac{{{P_{95}}({{c_B}} )}}{{{P_{95}}({{c_E}} )}}} \end{array}$$

where, ${P_{95}}(\cdot )$ is 95%, ${c_G}$ and ${c_B}$ for pixel component of blue and green channel. We set the grayscale intensity ${c_E}$ to: ${c_E} = 0.3\cdot {c_R} + 0.6\cdot {c_G} + 0.1\cdot {c_B}$, ${c_R}$ is the red channel component. Then, we determined the highlight region and saved it in a binary mask by analyzing the highlight region of the whole image. After that, the missing information in the specular region were completed by the image inpainting method [42].

We used the general linear model $q = aI + b$ in the guide filter to preserve the gradient of the texture-less endoscopy image. The pixel output of the guided image is represented as a weighted average adjacent pixel by local linear relation of the guided filtering [43]. The filtered result has a linear relationship with the guiding image in the filtering window. As shown in Eq. (2), I is the input image, p is the output image, a and b are the linear coefficients of pixel k in the corresponding window W. To solve the a and b, the minimize cost function is used:

(2)$$\begin{array}{c} {F({{a_k},{b_k}} )= \mathop \sum \limits_{i = {w_k}} ({{{({{a_k}{I_i} + {b_k} - {p_i}} )}^2} + \varepsilon a_k^2} )} \end{array}$$

For the regularization parameter $\varepsilon $, a smaller value will result in image sharpening and preservation of more gradients, but noise will be introduced. On the contrary, the image is smooth, and the gradient is reduced. Given that the degree of texture degradation shows greater difference between the images collected during surgery, we will flexibly use the regularization parameters according to the measure of texture-less endoscopy images. Therefore, we introduced a texture-less gradient retention factor $\Gamma (\cdot )$ to adjust $\varepsilon $ value adaptively. As shown in Eq. (3):

(3)$$\begin{array}{c} {\Gamma (\cdot )= \frac{1}{{|{{W_k}} |}}\mathop \sum \limits_{i = {W_k}} {{({{\nabla^2}{I_i} - \overline {{\nabla^2}{I_k}} } )}^2}} \end{array}$$

(4)$$\begin{array}{c} {F({{a_k},{b_k}} )= \mathop \sum \limits_{i = {w_k}} ({{{({{a_k}{I_i} + {b_k} - {p_i}} )}^2} + \Gamma (\cdot )\varepsilon a_k^2} )} \end{array}$$

(5)$$\begin{array}{c} {{a_k} = \frac{{\frac{1}{{|{{W_k}} |}}\mathop \sum \nolimits_{i = {W_k}} {{({{I_i}{p_i} - {\mu_k}{{\bar{p}}_k}} )}^2}}}{{\sigma _k^2 + \Gamma (\cdot )\varepsilon }}} \end{array}$$

(6)$$\begin{array}{c} {{b_k} = {{\bar{p}}_k} - {a_k}{\mu _k}} \end{array}$$

where $|{{W_k}} |$ is the number of pixels in the window, ${\nabla ^2}{I_i}$ is the second-order at the image pixel point ${I_i}$, and $\overline {{\nabla ^2}{I_k}} $ is the second-order differential mean of pixels in ${W_k}$. In Eq. (4), the gradient amplitude and average residual of each pixel in the window ${W_k}$ can reflect the difference between local pixels and avoid the introduction of significant differences between individuals. In Eqs. (5) and (6), ${\mu _k}$ and $\sigma _k^2$ are the sample mean and variance, respectively, and ${\bar{p}_k}$ is the mean of ${p_i}$ in the window ${W_k}$. We used a smaller $\varepsilon $ in images with a high measurement of texture-less to reduce the binding force on ${a_k}$ and obtain better gradient retention results. A larger $\varepsilon $ in the image with a lower measurement of texture-less to obtain a smaller ${a_k}$ and avoid introducing too much high-frequency noise.

Furthermore, we quantified the textured level of endoscopy images by the pixel spatial intensity similarity. The monochrome cold light source in the endoscope may decrease the color gradient of the surface of the cavity, which further weakens the texture of image measurement [44,45]. The gradient square function is constructed by any point $P(x\textrm{, } y )$ and neighbouring pixels in image I. The texture-less endoscopy image measurement $T(I )$ is defined as Eq. (7):

(7)$$\begin{array}{c} {T(I )= \frac{1}{N}\mathop \sum \limits_{(u\textrm{, } v )\le {W_{(\textrm{u}\textrm{, } v )}}} \left( {\frac{{{\partial^2}I({u\textrm{, }v} )}}{{{\partial^2}u}} + \frac{{{\partial^2}I({u\textrm{, }v} )}}{{{\partial^2}v}}} \right)} \end{array}$$

where N is the number of pixels in the window, ${W_{(\textrm{u}\textrm{, } v )}}$ is a window centred on pixel $(u\textrm{, } v )$, $I(u\textrm{, } v )$ is the greyscale value of a pixel in a greyscale image, and $\bar{I}(u\textrm{, } v )$ is the average value of pixels within the window range. We divided the texture-less endoscopy measurement into five levels from textured to texture-less. $T \in ({0.8 - 1} )$ represents strong texture, $T \in ({0.6 - 0.8} )$ represents strong texture, $T \in ({0.4 - 0.6} )$ represents medium texture, $T \in ({0.2 - 0.4} )$ represents texture-less, and $T \in ({0 - 0.2} )$ represents texture-less.

2.2 Superpixel space constraints

The texture-less endoscopy image can improve the acquisition of visual feature information by adaptive gradient preservation. For a pair of continuous frame texture-less endoscopy images ${I_1}$ and ${I_2}$, their corresponding sets of image features ${\mathrm{{\cal F}}_1}$ and ${\mathrm{{\cal F}}_2}$ are extracted using the ORB [26]. Then, a set of putative matches is determined by the similarity of the descriptor vector. We reduced the outliers in the putative match set by using the superpixel space constraint method. Assuming that the motion field of the two continuous frames satisfies the slow-and-smooth model, the two successive frames of images are divided into the same number of superpixel blocks $S_1^i$ and $S_2^i$ by the pixel spatial clustering method [46]. Our purpose is to locate the spatial position of each feature point into the corresponding superpixel block and obtain the feature point set within each superpixel block.

Thereafter, we estimated a robust energy function $\mathrm{{\rm E}}(\cdot )$ to distinguish the differences between superpixel blocks in an image. The visual motion satisfies the random event that the mapping can be considered as a probability distribution space. For two consecutive frames, we matched the superpixel blocks by comparing the similarity of information entropy of the superpixel blocks in two images. Thus, the spatial motion field of the superpixel block is estimated to reflect the coherent motion field of the actual feature points. For the superpixel block, we used RGB channel information to reflect the aggregation feature of image grey distribution and the effective evaluation of spatial characteristic entropy of grey distribution on each superpixel block, as shown in Eq. (8):

(8)$$\begin{array}{c} {H_{{S^i}}^c ={-} \mathop \sum \limits_m \mathop \sum \limits_n p(x )\log p(x )} \end{array}$$

(9)$$\begin{array}{c} {p(x )= \frac{{f({m,n} )}}{{{N_{{S^i}}}}},\; p(x )\in {S^i}} \end{array}$$

where H denotes the information entropy and $H_{{S^i}}^c$ is the information entropy extended to three color channels ($c = r,g,b$). We used $p(x )$ to represent the information contained in the aggregation feature of pixel distribution in the image. m represents the component of pixel c channel in the superpixel block ${S^i}$. n represents the mean value of the channel neighborhood component of this pixel. $f({m,n} )$ represents the frequency of the occurrence of two-tuple feature $({m,n} )$. ${N_{{S^i}}}$ represents the number of pixels of the superpixel block ${S^i}$.

(10)$$\begin{array}{c} {{\rm E}({{H_{{S^i}}}} )= \frac{{\sqrt {H{{_{{S^i}}^r}^2} + H{{_{{S^i}}^g}^2} + H{{_{{S^i}}^b}^2}} }}{{\min ({H_{{S^i}}^j} )}}} \end{array}$$

(11)$$\begin{array}{c} {\omega \le |{E({{H_{S_1^i}}} )- E({{H_{S_2^i}}} )} |} \end{array}$$

Given that the intensity deviation of endoscopy image color space leads to the uneven distribution characteristics of RGB channel information entropy, we calculated the distance between the superpixel blocks and maximized the difference of the energy function $\mathrm{{\rm E}}({{H_{{S^i}}}} )$ by using Eq. (10). $\min ({H_{{S^i}}^c} )$ is the minimum value of information entropy in three channels. We compared the similarity of the energy function of superpixel blocks between adjacent frames of images and defined the two superpixel block matches when the distance is less than the matching threshold $\omega $ as shown in Eq. (11).

By preserving the feature matching contained in the matched superpixel block $S_1^i \to S_2^i$ as a correct match, we obtained the initial matching set ${\mathrm{{\cal S}}_1} = \{{f({\mathrm{{\cal F}}_1^\mathrm{^{\prime}} \to \mathrm{{\cal F}}_2^\mathrm{^{\prime}}} )} \},{\mathrm{{\cal S}}_1} \subseteq {\mathrm{{\cal S}}_0}$ as shown in Fig. 2. The column (a) shows the consecutive frame of endoscopy images, the column (b) shows the results of superpixel segmentation, and the column (c) shows the mapping of information entropy distribution. In the mapping, different colors indicate the difference of information entropy. Superpixel blocks of the same color are regarded as matching. A feature matching will be regarded as an outlier when the superpixel block containing it has no matching relation.

Fig. 2. Superpixel Region Information Entropy Matching

Download Full Size | PDF

2.3 Vector field smoothing

The matching of information entropy describes the spatial motion relationship between consecutive frames, which is helpful to maintain the overall spatial connectivity of points in different regions during the matching process [47]. However, the difference of feature description vectors corresponding to local regional points is not obvious for the continuous frames of texture-less endoscopy images [48]. Therefore, we constructed a smooth feature vector field on the basis of the initial matching with local similarity constraints to further optimize the matching results of feature points.

In our problem, with the existence of outliers in an initial matching set ${\mathrm{{\cal S}}_1} = \{{f({\mathrm{{\cal F}}_1^\mathrm{^{\prime}} \to \mathrm{{\cal F}}_2^\mathrm{^{\prime}}} )} \}_i^{{N_1}}$ of ${N_1}$ feature matches, the neighborhood relationship between feature points is fixed. Therefore, the continuous frame endoscopy image pairs can be regarded as a simple rigid transformation to ensure accurate feature correspondence. Thus, an optimal solution of the transformation relation on the set of inliers ${\mathrm{{\cal S}}_2}$ can be obtained, as shown in Eq. (12):

(12)$$\begin{array}{c} {{S^\ast } = \arg minC({{\mathrm{{\cal S}}_2},{\mathrm{{\cal S}}_1},\lambda } )} \end{array}$$

(13)$$\begin{array}{c} {C({{\mathrm{{\cal S}}_2},{\mathrm{{\cal S}}_1},\lambda } )= \mathop \sum \limits_{i \in {\mathrm{{\cal S}}_2}} \mathop \sum \limits_{j \in {\mathrm{{\cal S}}_2}} {{({d({{x_i},{x_j}} )- d({{y_i},{y_j}} )} )}^2} + \; \lambda ({{N_1} - |{{\mathrm{{\cal S}}_2}} |} )} \end{array}$$

For the cost function C, $d(, )$ is the Euclidean distance, and $|\cdot |$ is the cardinality of the set. The first term describes the outlier distance, and the second term balances the outlier. $\lambda > 0$ denotes the parameter to balance the tradeoff between the two terms of the cost function. Our goal is to discuss the relatively complex non-rigid body transformation in the real case. Given that the local neighborhood structure between feature points remains unchanged for some outliers in the surrounding region, we can construct the cost function of Eq. (13) into:

(14)$$\begin{array}{c} {C({{\mathrm{{\cal S}}_2},{\mathrm{{\cal S}}_1},\lambda } )= \mathop \sum \limits_{i \in {\mathrm{{\cal S}}_2}} \left( {\mathop \sum \limits_{j\mathrm{\mid }{x_j} \in {N_{{x_i}}}} {{({d({{x_i},{x_j}} )- d({{y_i},{y_j}} )} )}^2} + \mathop \sum \limits_{j\mathrm{\mid }{y_j} \in {N_{{y_i}}}} {{({d({{x_i},{x_j}} )- d({{y_i},{y_j}} )} )}^2}} \right) + \lambda ({{N_1} - |{{\mathrm{{\cal S}}_2}} |} )} \end{array}$$

where ${N_{{x_i}}}$ and ${N_{{y_i}}}$ denote the neighborhood of the feature point ${x_i}$ and ${y_i}$. In general, the neighborhood between feature points is not defined. The nearest K feature points are searched as the feature neighborhood of this feature point. Then, a binary vector P is used to combine the correct matching in the initial matching set. To address the problem of outlier removal, we searched for the P with the minimum cost function.

(15)$$\begin{array}{c} {C({P,{\mathrm{{\cal S}}_1},\lambda } )= \mathop \sum \limits_{i = 1}^{{N_1}} {P_i}({{C_i} - \lambda } )+ \lambda {N_1}} \end{array}$$

(16)$$\begin{array}{c} {{P_i} = \left\{ {\begin{array}{c} {1\; {C_i} \le \lambda }\\ {0\; {C_i} > \lambda } \end{array}} \right.\; ,i = 1, \ldots ,{N_1}} \end{array}$$

In the outliers, the neighborhood structure cannot be consistent between the two images as a large cost ${C_i}$. On the other hand, in the inliers, the neighborhood contains a small number of outliers, and the main component is inliers. Then, the cost ${C_i}$ is still going to be small. Finally, we obtained the final inlier set ${S^{\ast }} = \{{i,{P_i} = 1,i = 1, \ldots ,{N_1}} \}$.

3. Experiments

3.1 Datasets and implementation details

In this study, the feature matching methods that we evaluated in public and undisclosed datasets are described in Table 1. The public dataset includes 10 binocular endoscopy images sequences from Hamlyn [49] dataset, and 7 sets of animal laparoscopy monocular image data from the MICCAI 2019 challenge [50]. The undisclosed dataset includes 6 groups of monocular rhinoscopic images, 4 groups of monocular animal rhinoscopic images, 4 groups of monocular pyeloscope images, and 4 groups of medical model images under different preset lighting conditions that we collected in the clinical experiment. These images include actual surgical environments such as pathological tissue, specular reflection, motion blurring, surgical instruments and texture-less images. In the experimental verification, the binocular endoscopy image data have left and right view transformation matrices. The MICCAI 2019 challenge data have camera position information, which helps us to obtain the rotation translation matrix of monocular endoscopy images. We can obtain the matching of corresponding pixel points, which serves as the ground truth of matching points for a group of images by the transformation matrix. In the evaluation of the experiment, the public dataset was primarily used to present quantitative estimates. The undisclosed dataset was primarily used to present qualitative estimates because the continuous frame image extracted from recorded video lacks the corresponding ground truth information. The reliability of our matches was also evaluated by importing them to a SfM system. The evaluations were conducted using an Intel i7-8700 computer with 16 GB RAM and NVIDIA GTX 1060 graphics card.

Table 1. Dataset details of texture-less medical image

View Table | View all tables in this article

As the visual features of texture-less endoscopy images are scarce, the generally used feature detection methods can only detect few feature points in these images with the original image. We used the adaptive gradient-preserving method, which can effectively increase the number of feature points in the texture-less region. Then, the ORB algorithm can efficiently detect sufficient feature points, and it computes the potential matches by nearest-neighbor matching. The commonly used feature matching acceptance threshold is set to a small fixed value. In this study, the total pixels of images in different types of datasets are inconsistent. We set the number parameter of feature extraction as a certain proportion of the number of pixel points in the image, that is, $0.05\ast \mathrm{N}$. The experiments show that this value can guarantee the extraction of sufficient feature points. Furthermore, the commonly used feature matching acceptant threshold is set to $\mathrm{\gamma } = 0.4\sim 0.6$; that is, a match is accepted if its nearest neighbour difference is less than a certain threshold $\mathrm{\gamma }$. However, for texture-less endoscopy images, this strict threshold will make the visual feature incredibly rare and consequently weaken the matching performance. In this study, in order to find more putative matches, we can calculate the putative matches with a relaxed threshold $\mathrm{\gamma } = 0.8$ to obtain more substantial true matches.

3.2 Results of adaptive gradient-preserving

The image richness of texture analysis using the texture-less measurement (TLM). Experiments were conducted on endoscopy images for 9 sets of public data and 11 sets of undisclosed data. A total of 600 endoscopy images were selected randomly and divided into 60 groups of 10 each. Then, the number of feature points in each group of images was averaged. Additionally, the feature density (we counted the number of feature points on the ${p^2} = 100{\; }pixels{\ast }100{\; }pixels$ region) was used to measure the extraction effect of feature points as the image resolution varies considerably in different sets.

As shown in Fig. 3, the horizontal axis represents the TLM, whilst the vertical axis represents the feature density. We used heat maps from blue to red to represent endoscopy image textures from strong to weak. In the figure, each dot represents the average value of a group of images. The blue dots represent dense texture features and rich visual feature points in the image presentation, whilst red dots represent unobvious texture features in the image presentation, rare visual feature points, and small feature point density. The experiment shows that the density of feature points decreases with a lower TLM value. The insufficient number of image data matches represented in the red dot region will result in missing most of the surface in the reconstruction.

Fig. 3. Correspondence between texture-less measurement and feature density.

Download Full Size | PDF

We evaluated the texture enhancement capability of the adaptive gradient-preserving method in two typical surgical environment datasets: Fig. 4(a) is normal surgical environment and Fig. 4(b) is solution filled environment. Figure 4(1) is the original image and the corresponding TLM map; in TLM map the white region indicates that the image has obvious gradient information, while the black region is smooth and texture-less; the T is the TLM value of the image. Figure 4(2) shows the image by the guided filtering and Fig. 4(3) shows the image by our method. The results of the experiment indicate that the original image visual feature information is blurred, and the TLM value is extremely small, some textures in the image can be enhanced by the guided filtering. Our method solves the problem of image degradation caused by internal wet environment and can produce robust enhancement effect on various texture-less images, The TLM value increases significantly.

Fig. 4. Texture-less measurement and texture details.

Download Full Size | PDF

In addition to visual verification, we also compared the feature point extraction ability of the original image with that of the adaptive gradient-preserving image on different texture-less endoscopy datasets. We validated six types of endoscopy datasets, including rhinoscopy, animal rhinoscopy, gastroscopy, laparoscopy, colonoscopy and intestinal model images. Feature point density was used instead of the total number of extracted images due to the large difference in image size between different datasets. The TLM is proposed in Section 2.1 as a reference for texture richness. As shown in Table 2, it is difficult to extract feature points from the original image. The number of extractions by the three methods is significantly increased by the adaptive gradient preservation. SIFT and BRISK extract fewer feature points than the ORB algorithm. The TLM index is positively correlated with the number of feature points extracted from the image. As can be seen from the results, our method obtained excellent results in all types of endoscopy images.

Table 2. Comparison of texture-less measurement on different datasets

View Table | View all tables in this article

3.3 Evaluation on matching performance

The visual feature points of texture-less endoscopy images are sparse and unevenly distributed. In clinical surgery, camera movement and tissue deformation will reduce the number of matches and affect the accuracy of matches. Therefore, we simulated the image scene transformation that might be encountered in clinical surgery to verify the robustness of the proposed method. As shown in Fig. 5, animal rhinoscopy images collected clinically were used for simulation. We used the adaptive gradient-preserved image as the target image (Fig. 5(a1)). Specifically, four kinds of transformation are included: scale changing (Fig. 5(a2)), where the scaling transform coefficient is 0.4; rotation for 45° (Fig. 5(b2)); random non-rigid (Fig. 5(c2)); and random affine transformation (Fig. 5(d2)). Each kind of transformation is simulated with 10 images from 4 clinical datasets. In total, 160 samples of affine transformation are simulated.

Fig. 5. Evaluation results on clinical endoscopy.

Download Full Size | PDF

In addition, we evaluated the matching ability of the proposed method for the simulated dataset using different matching methods. The ground truth for the exact pixel corresponding to each matching pair is obtained from a known transformation of the simulated dataset. Figure 5(3) demonstrates the matching results on the paired images. The motion vector of matches obtained by connecting the matched feature points on the deformed image is shown in Fig. 5(4). We set the threshold of pixel distance as 4 to distinguish true and false matches. As shown in Table 3, in addition to the number of matches and times obtained from algorithms, we computed the recall, accuracy, precision and F1-score for a general evaluation of their performance. The statistical data of the six different feature matching methods were used for comparison. The motion consensus and global bilateral regression (MCGB) [17] was our previous work. As can be seen from the table, the brute force (BF) and random sample consensus (RANSAC) [40] algorithm has absolute advantages in the number of matches; however, it exhibits a long matching time and low precision. The GMS, vector field consensus (VFC) [17] algorithm has a short matching time and the locality preserving matching (LPM) [39] algorithm has the highest recall. Our algorithm achieves the highest accuracy, precision and F1-score, which are 92.6%, 89.2% and 91.5%, respectively. The comprehensive analysis showed that our algorithm has good matching ability in simulated datasets.

Table 3. Simulation datasets transformation of different feature matching methods

View Table | View all tables in this article

Compared with textured endoscopy images, reliable matching is difficult to find in texture-less endoscopy images during camera movement. Therefore, we evaluated the improvement of feature matching performance of continuous frame endoscopy images by using the adaptive gradient-preserving method from multiple endoscopy datasets. As in Fig. 6. We demonstrate one-tenth of the matching results on the paired images. The column (1) is the original image pair, the column (2) is the matching result with the ORB algorithm, and the column (3) is the matching result presented in this paper. As shown in the public datasets of the row (a), the ground truth comes from binocular endoscopy image pixel matching relationships and camera pose. The green connectors represent positive group matches, whilst the red connectors represent false matches. As shown in the undisclosed datasets of the row (b), since there is no ground truth, we use different colors to represent, the yellow connectors represent the matching results of the ORB algorithm, whilst the blue connectors represent the matching results retained by our algorithm. From the experimental results, the adaptive gradient-preserved image increased the number of matches but also increased the number of outliers, especially in the texture-less region. The method proposed in this study can effectively eliminate outliers. The obtained feature matching results are smooth and reliable and show strong robustness for regions with similar structures and pixel intensity.

Fig. 6. Continuous frame image feature matching performance.

Download Full Size | PDF

At the same time, we selected the public dataset in Fig. 6 to compare the feature matching performance using other matching methods in the experiment: BF, GMS, RANSAC, LPM, VFC and MCGB. The evaluation indices are divided into four groups: accuracy, F1-score, precision and recall. As shown in Fig. 7, the dash is the maximum and minimum value of each method in the middle of the box chart, the small circular represents the mean value of the data, and the straight line represents the median. In the accuracy index, the BF algorithm has the optimal value, but the GMS and VFC algorithm has a higher mean value. Our algorithm can achieve a mean value comparable to the GMS algorithm, but more concentrated. In the Precision index, the RANSAC algorithm has a large range of results, but the LPM and MCGB algorithm achieves a better optimal value. However, our algorithm has a higher mean value. In the precision and recall indices, the four algorithms can obtain relatively concentrated results. Our algorithm is excellent in the maximum value and average value. The results also imply that our algorithm can achieve excellent performance in all four indices. Of note, our algorithm is adaptable to different scenes and robust to outliers.

Fig. 7. Comparison of different matching methods respect to different index. Each group shows the boxplot of seven different methods.

Download Full Size | PDF

The evaluation index proposed in [51] was used to measure the feature matching performance. The matching score $MS = \frac{{\# inlier{\; }matches}}{{\# features}}$ is calculated as the ratio of the number of correct matches to the number of original features. The matching score is limited by the measurement of texture-less image and matching criteria. The texture-less images or the restriction of matching methods will discard potentially effective matching and reduced matching scores. As shown in Table 4, four public datasets from gastroscopy, colonoscopy, laparoscopy and thoracoscopy were used to evaluate the matching score of the original image and the adaptive gradient-preserved image. As can be seen from the table, the matching scores of the adaptive gradient-preserved image were significantly higher than those of the original images. The GMS method obtained the best results in the endoscopy images and colonoscopy dataset. However, our method obtained the best matching results in most datasets.

Table 4. Public datasets matching performance in MS

View Table | View all tables in this article

3.4 Verification on 3D reconstruction

Doctors change endoscopic with different diameters according to different operating conditions, image resolution cannot be unified. We use two types of commonly used resolution images to verify the time efficiency of the method presented in this study. As shown in Fig. 8, The solid curve represents 380*240 resolution and the hollow curve represents 1280*1024 resolution. To ensure the traceability of image content, 20 pairs of continuous frames were selected from two datasets for feature matching. In the time efficiency comparison of the two resolutions, the matching time of the SIFT algorithms is an order of magnitude higher than that of GMS and our algorithm. In the initial phase of continuous matching, there have little difference in operation time between our method and GMS. However, when the number of images was increased, the GMS algorithm exhibited an advantage in matching time.

Fig. 8. Comparison of computation efficiency of endoscope images with different resolutions.

Download Full Size | PDF

To evaluate the undisclosed dataset matching, we used the reprojection error to indicate the matching quality. The reprojection error is a geometric error corresponding to the image distance between projected and measured points. The error is obtained by comparing the pixel coordinates (observed projection position) with the projection of a 3D point. Thus, the reprojection error can be used to evaluate the matching of clinical datasets that have no ground truth. Three types of matching methods, including BF, GMS and MCGB, were used. As shown in Table 5, the lower resolution of gastroscopy images makes the reprojection error slightly different in the four methods, whilst the higher resolution of laparoscopy and rhinoscopy images makes the reprojection error change dramatically. The MCGB method obtained the best results in the laparoscopic dataset, and the proposed method in this study obtained the best results in the other datasets. The results showed that the reprojection errors of our method show good accuracy in the three datasets, which verifies that our matching method can obtain accurate matching correspondence.

Table 5. Reprojection errors (pixel) of different methods on undisclosed datasets

View Table | View all tables in this article

To validate the effectiveness of the feature matching of our method, we input the final feature matching results into an end-to-end visual SfM. Figure 9(1) shows the laparoscopy dataset, Fig. 9(2) shows the other laparoscopy datasets, Fig. 9(3) shows the gastroscopy dataset, Fig. 9(4) shows the pyeloscopy dataset, and Fig. 9(5) shows the colonoscopy dataset. In the experiment, Fig. 9(a) is the original image of each dataset; Fig. 9(b) shows the point cloud structure preservation. Thereafter, the feature points were detected with the ORB algorithm, and the outliers were eliminated using the superpixel space constraints, which can effectively identify reliable matches from the original feature matching set with high noise and make up for the structural deficiency caused by the lack of image information. Finally, we transformed the motion information of feature points into motion constraint conditions. This process can further filter out the assumed obtained. Then, the surface structure is generated by Poisson surface reconstruction, as shown in Fig. 9(c); Fig. 9(d) is the texture map (different views of the same result are presented) to obtain the final 3D reconstruction result. The experiments demonstrate that the obtained feature matches are reliable and robust on the texture-less endoscopy image and that our method can construct high-quality 3D tissue structures.

Fig. 9. 3D reconstruction process.

Download Full Size | PDF

We selected the rhinoscopy data, animal rhinoscopy data and intestinal model data for 3D reconstruction to verify the adaptability of the matching results of our proposed method in various continuous frame image data for 3D reconstruction. As shown in Table 6, we compared the specific 3D reconstruction information of the original image and the adaptive gradient-preserved image (bold is the result of the adaptive gradient-preserved image). The statistics include the number of vertices, the number of mesh and the reconstructed surface integrity (surface integrity is the visual ratio of the reconstructed 3D surface to the actual size in the image). The data showed that the original texture-less endoscopy image lacks feature matching. In the process of 3D reconstruction, the number of spatial projection vertices and the number of triangulated facets are insufficient, which leads to the failure of surface restoration of the scene. The effective reconstruction region of the surface in the three types of scenes is not more than 30%. However, the algorithm proposed in this study can extract more visual feature points from the region of the same texture-less image, improve the accuracy of feature matching, increase the number of spatial projection vertices and the number of triangulated facets, and thus reconstruct more surface information. The effective reconstruction region of the surface in the three types of scenes is more than 50%. Moreover, in the narrow scenes (such as the intestinal model), part of the internal environment surface is lost by the limitation of camera angle and light source.

Table 6. Comparison of different datasets in 3D reconstruction

View Table | View all tables in this article

4. Conclusion and discussion

In this study, we proposed a novel neighborhood geometric consistency-based method for feature matching of texture-less endoscopy images. On the basis of the characteristics of the texture-less gradient of the image itself, we defined a quantitative standard of texture-less measure for endoscopy images. The versatility of texture-less images in feature extraction was dramatically improved by the adaptive gradient preservation. Thereafter, the feature points were detected with the ORB algorithm, and the outliers were eliminated using the superpixel space constraints, which can effectively identify reliable matches from the original feature matching set with high noise and make up for the structural deficiency caused by the lack of image information. Finally, we transformed the motion information of feature points into motion constraint conditions. This process can further filter out the assumed matching to improve the accuracy of feature matching and perfect the visualization effect of 3D reconstruction.

We evaluated the TLM of public and undisclosed datasets. In addition, we compared the number of feature points extracted and the TLM from the original image and the adaptive gradient-preserved image. The experiments demonstrated that our proposed method can effectively retain the image texture information and greatly increase the number of feature points extracted. Thereafter, we evaluated the proposed method with four other image transformations, namely, scaling, rotating, non-rigid and affine transformation. The results demonstrated that our method outperforms other methods in terms of precision and F1-score. The reprojection error of our method maintained good accuracy on three types of endoscopy datasets. Finally, our method showed superior performance compared with other recent methods. Therefore, this method has great potential for feature matching and 3D reconstruction in texture-less endoscopy image data. In future research, this algorithm should overcome the low matching accuracy of continuous frame endoscopy images due to the single angle of view of the image and large depth range of the scene.

Funding

National Natural Science Foundation of China (61901031, 61971040, 62025104, 62071048); Beijing Nova Program (Z201100006820004); Beijing Institute of Technology Research Fund Program for Young Scholars (2020CX04075).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data underlying the results presented in this paper are available in Ref. [49] and [50]. The other dataset may be restricted for privacy reasons.

References

1. J. Song, J. Wang, L. Zhao, S. Huang, and G. Dissanayake, “Mis-SLAM: Real-time large-scale dense deformable slam system in minimal invasive surgery based on heterogeneous computing,” IEEE Robot. Autom. Lett. 3(4), 4068–4075 (2018). [CrossRef]

2. A. Marmol, A. Banach, and T. Peynot, “Dense-ArthroSLAM: dense intra-articular 3-D reconstruction with robust localization prior for arthroscopy,” IEEE Robot. Autom. Lett. 4(2), 918–925 (2019). [CrossRef]

3. C. Sui, J. Wu, Z. Wang, G. Ma, and Y.-H. Liu, “A real-time 3D laparoscopic imaging system: design, method, and validation,” IEEE Trans. Biomed. Eng. 67(9), 2683–2695 (2020). [CrossRef]

4. C. Shi, X. Luo, P. Qi, T. Li, S. Song, Z. Najdovski, T. Fukuda, and H. Ren, “Shape sensing techniques for continuum robots in minimally invasive surgery: A survey,” IEEE Trans. Biomed. Eng. 64(8), 1665–1678 (2017). [CrossRef]

5. S. Bernhardt, S. A. Nicolau, L. Soler, and C. Doignon, “The status of augmented reality in laparoscopic surgery as of 2016,” Med. Image Anal. 37, 66–90 (2017). [CrossRef]

6. L. Qian, J. Y. Wu, S. P. DiMaio, N. Navab, and P. Kazanzides, “A review of augmented reality in robotic-assisted surgery,” IEEE Trans. Med. Robot. Bionics 2(1), 1–16 (2020). [CrossRef]

7. Y. Chu, X. Yang, H. Li, D. Ai, Y. Ding, J. Fan, H. Song, and J. Yang, “Multi-level feature aggregation network for instrument identification of endoscopic images,” Phys. Med. Biol. 65(16), 165004 (2020). [CrossRef]

8. J. Li, P. Duan, and J. Wang, “Binocular stereo vision calibration experiment based on essential matrix,” in 2015 IEEE International Conference on Computer and Communications (ICCC) (IEEE, 2015), 250–254.

9. L. Maier-Hein, A. Groch, A. Bartoli, S. Bodenstedt, G. Boissonnat, P.-L. Chang, N. Clancy, D. S. Elson, S. Haase, and E. Heim, “Comparative validation of single-shot optical techniques for laparoscopic 3-D surface reconstruction,” IEEE Trans. Med. Imaging 33(10), 1913–1930 (2014). [CrossRef]

10. G. Lu, L. Nie, S. Sorensen, and C. Kambhamettu, “Large-scale tracking for images with few textures,” IEEE Trans. Multimedia 19(9), 2117–2128 (2017). [CrossRef]

11. A. Ballabeni, F. I. Apollonio, M. Gaiani, and F. Remondino, “Advances in image pre-processing to improve automated 3D reconstruction,” International Archives of the Photogrammetry, Remote Sensing Spatial Information Sciences XL-5/W4, 315–323 (2015). [CrossRef]

12. A. Ley, R. Hänsch, and O. Hellwich, “Reconstructing white walls: multi-view, multi-shot 3Dreconstruction of textureless surfaces,” International Archives of the Photogrammetry, Remote Sensing Spatial Information Sciences 3, 91–98 (2016). [CrossRef]

13. N. H. Aldeeb and O. Hellwich, “Reconstructing textureless objects-image enhancement for 3D reconstruction of weakly-textured surfaces,” in VISIGRAPP (5: VISAPP), (2018), pp. 572–580.

14. A. Ballabeni and M. Gaiani, “Intensity histogram equalisation, a colour-to-grey conversion strategy improving photogrammetric reconstruction of urban architectural heritage,” J. Int. Colour Assoc. 16, 2–23 (2016).

15. K. L. Lurie, A. Roland, D. V. Zlatev, J. C. Liao, and B. Ellerbee, “3D reconstruction of cystoscopy videos for comprehensive bladder records,” Biomed. Opt. Express 8(4), 2106 (2017). [CrossRef]

16. Y. Bo, L. Chao, W. Zheng, L. Shan, and K. Huang, “Reconstructing a 3D heart surface with stereo-endoscope by learning eigen-shapes,” Biomed. Opt. Express 9(12), 6222–6236 (2018). [CrossRef]

17. Y. Chu, H. Li, X. Li, Y. Ding, X. Yang, D. Ai, X. Chen, Y. Wang, and J. Yang, “Endoscopic image feature matching via motion consensus and global bilateral regression,” Comput. Methods Programs in Biomed. 190, 105370 (2020). [CrossRef]

18. M. J. Islam, Y. Xia, and J. Sattar, “Fast underwater image enhancement for improved visual perception,” IEEE Robot. Autom. Lett. 5(2), 3227–3234 (2020). [CrossRef]

19. M. E. Deetjen and D. Lentink, “Automated calibration of multi-camera-projector structured light systems for volumetric high-speed 3D surface reconstructions,” Opt. Express 26(25), 33278–33304 (2018). [CrossRef]

20. F. Orujov, R. Maskeliūnas, R. Damaševičius, and W. Wei, “Fuzzy based image edge detection algorithm for blood vessel detection in retinal images,” Appl. Soft Computing 94, 106452 (2020). [CrossRef]

21. S. Dawn, S. Tulsyan, S. Bhattarai, S. Gopal, and V. Saxena, “An efficient approach to image indexing and retrieval using Haar cascade and perceptual similarity index,” in 2020 6th International Conference on Signal Processing and Communication (ICSC) (IEEE, 2020), pp. 108–113.

22. D. Singh, V. Kumar, and M. Kaur, “Single image dehazing using gradient channel prior,” Appl. Intelligence 49(12), 4276–4293 (2019). [CrossRef]

23. E. Pelanis, A. Teatini, B. Eigl, A. Regensburger, A. Alzaga, R. P. Kumar, T. Rudolph, D. L. Aghayan, C. Riediger, and N. Kvarnstr, “Evaluation of a novel navigation platform for laparoscopic liver surgery with organ deformation compensation using injected fiducials,” Med. Image Anal. 69, 101946 (2021). [CrossRef]

24. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, “Imagenet large scale visual recognition challenge,” Int. J. Comput. Vis. 115(3), 211–252 (2015). [CrossRef]

25. R. B. Shi, S. Mirza, D. Martinez, C. Douglas, J. Cho, J. C. Irish, D. A. Jaffray, and R. A. Weersink, “Cost-function testing methodology for image-based registration of endoscopy to CT images in the head and neck,” Phys. Med. Biol. 65(20), 205011 (2020). [CrossRef]

26. R. Mur-Artal and J. D. Tardós, “Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Trans. Robot. 33(5), 1255–1262 (2017). [CrossRef]

27. M. Yang, D. He, M. Fan, B. Shi, X. Xue, F. Li, E. Ding, and J. Huang, “Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 11772–11781.

28. J. Ma, X. Jiang, A. Fan, J. Jiang, and J. Yan, “Image matching from handcrafted to deep features: a survey,” Int. J. Comput. Vis. 129(1), 23–79 (2021). [CrossRef]

29. C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós, “ORB-SLAM3: an accurate open-source library for visual, visual–inertial, and multimap SLAM,” IEEE Transactions on Robotics 37, 1874–1890 (2021). [CrossRef]

30. R. Hoffmann, M. Kaar, A. Bathia, A. Bathia, A. Lampret, W. Birkfellner, J. Hummel, and M. Figl, “A navigation system for flexible endoscopes using abdominal 3D ultrasound,” Phys. Med. Biol. 59(18), 5545–5558 (2014). [CrossRef]

31. A. Khare, B. R. Mounika, and M. Khare, “Keyframe extraction using binary robust invariant scalable keypoint features,” in Twelfth International Conference on Machine Vision (ICMV 2019) (International Society for Optics and Photonics, 2020), 1143308.

32. D. Schlegel, G. Grisetti, and A. Letters, “HBST: A hamming distance embedding binary search tree for feature-based visual place recognition,” IEEE Robot. Autom. Lett. 3(4), 3741–3748 (2018). [CrossRef]

33. J. Song, H. T. Shen, J. Wang, Z. Huang, N. Sebe, and J. Wang, “A distance-computation-free search scheme for binary code databases,” IEEE Trans. Multimedia 18(3), 484–495 (2016). [CrossRef]

34. R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: a versatile and accurate monocular SLAM system,” IEEE Trans. Robot. 31(5), 1147–1163 (2015). [CrossRef]

35. S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. Milford, “Visual place recognition: a survey,” IEEE Trans. Robot. 32(1), 1–19 (2016). [CrossRef]

36. J. Ma, J. Zhao, J. Tian, A. L. Yuille, and Z. Tu, “Robust point matching via vector field consensus,” IEEE Trans. on Image Process. 23(4), 1706–1721 (2014). [CrossRef]

37. W. Y. Lin, F. Wang, M. M. Cheng, S. K. Yeung, P. H. S. Torr, M. N. Do, and J. Lu, “CODE: coherence based decision boundaries for feature correspondence,” IEEE Trans. Pattern Anal. Mach. Intell. 40(1), 34–47 (2018). [CrossRef]

38. M. Allan, S. Ourselin, D. J. Hawkes, J. D. Kelly, and D. Stoyanov, “3-D pose estimation of articulated instruments in robotic minimally invasive surgery,” IEEE Trans. Med. Imaging 37(5), 1204–1213 (2018). [CrossRef]

39. X. Li, D. Ai, Y. Chu, J. Fan, H. Song, Y. Gu, and J. Yang, “Locality preserving based motion consensus for endoscopic image feature matching,” in Proceedings of the 2020 4th International Conference on Digital Signal Processing (2020), pp. 117–121.

40. J. Bian, W.-Y. Lin, Y. Matsushita, S.-K. Yeung, T.-D. Nguyen, and M.-M. Cheng, “GMS: Grid-based motion statistics for fast, ultra-robust feature correspondence,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 4181–4190.

41. S. M. Alsaleh, A. I. Aviles, P. Sobrevilla, A. Casals, and J. K. Hahn, “Adaptive segmentation and mask-specific Sobolev inpainting of specular highlights for endoscopic images,” in 2016 38th Annual International Conference of the IEEE Engineering In Medicine and Biology Society (EMBC) (IEEE, 2016), pp. 1196–1199.

42. I. M. Artinescu and C. R. Boldea, “An image inpainting technique based on parallel projection methods,” in 2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC) (IEEE, 2018), pp. 95–98.

43. K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1397–1409 (2013). [CrossRef]

44. Y. Cong, D. Tian, Y. Feng, B. Fan, and H. Yu, “Speedup 3-D texture-less object recognition against self-occlusion for intelligent manufacturing,” IEEE Trans. Cybern. 49(11), 3887–3897 (2019). [CrossRef]

45. Y. Chu, J. Yang, S. Ma, D. Ai, W. Li, H. Song, L. Li, D. Chen, L. Chen, and Y. Wang, “Registration and fusion quantification of augmented reality based nasal endoscopic surgery,” Med. Image Anal. 42, 241–256 (2017). [CrossRef]

46. S. Crommelinck, R. Bennett, M. Gerke, M. Koeva, M. Yang, and G. Vosselman, “SLIC superpixels for object delineation from UAV data,” ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci. IV-2/W3, 9–16 (2017). [CrossRef]

47. F. Shao, Z. Liu, and J. An, “Feature matching based on minimum relative motion entropy for image registration,” IEEE Transactions on Geoscience Remote Sensing 60, 5603712 (2021). [CrossRef]

48. F. Shao, Z. Liu, and J. An, “A discriminative point matching algorithm based on local structure consensus constraint,” IEEE Geosci. Remote Sensing Lett. 18(8), 1366–1370 (2021). [CrossRef]

49. D. Hong, W. Tavanapong, J. Wong, J. Oh, and P. C. De Groen, “3D reconstruction of colon segments from colonoscopy images,” in 2009 Ninth IEEE International Conference on Bioinformatics and BioEngineering (IEEE, 2009), 53–60.

50. Stereo Correspondence and Reconstruction of Endoscopic Data Sub-Challenge, MICCAI, 2019. <https://endovissub2019-scared.grand-challenge.org/>.

51. J. Heinly, E. Dunn, and J.-M. Frahm, “Comparative evaluation of binary features,” in European Conference on Computer Vision, (Springer, 2012), pp. 759–773.

Dataset	Category	Image Pairs	Ground truth	Describe
Hamlyn [49]	Gastroscopy	954	Transformation matrix	Contains 10 binocular endoscopy images sequences.
Hamlyn [49]	Colonoscopy etc.	954	Transformation matrix	Contains 10 binocular endoscopy images sequences.
MIC [50]	Laparoscope	350	Camera pose	Contains 7 different scenario datasets
Undisclosed dataset by our clinical test	Rhinoscopy	1500	–	The image we collected from clinical trials and the medical model, contains 14 Monocular endoscope images sequences.
	Rhinoscopy (dog)
	Pyeloscopy

Dataset	Original				Gradient-preserving
Dataset	TLM	SIFT	BRISK	ORB	TLM	SIFT	BRISK	ORB
Rhinoscopy	0.09	18	9	32	0.51	852	667	1367
Rhinoscopy (animal)	0.05	7	5	10	0.47	1409	996	1982
Gastroscopy	0.23	224	140	299	0.58	2752	1897	3489
Laparoscopy	0.15	83	54	116	0.69	2028	1773	3628
Colonoscopy	0.32	225	166	330	0.62	1803	1019	2159
Intestinal Model	0.04	13	13	15	0.32	178	165	227

Algorithm	Matches	Time (s)	Recall	Accuracy	Precision	F1-Score
BF	5360	15.44	0.392	0.432	0.466	0.442
GMS	2749	1.705	0.899	0.885	0.876	0.883
RANSAC	4896	19.465	0.248	0.366	0.375	0.351
LPM	2887	2.775	0.908	0.892	0.838	0.887
VFC	2776	2.036	0.892	0.905	0.829	0.895
MCGB	2050	6.334	0.870	0.902	0.861	0.904
Ours	2554	7.347	0.887	0.926	0.892	0.915

Dataset	Original (%)			Gradient-preserving (%)
Dataset	ORB	GMS	Our	ORB	GMS	Ours
Gastroscopy	38.77	89.13	88.06	48.04	90.17	92.08
Colonoscopy	44.15	87.63	88.04	54.09	90.92	90.87
Laparoscopy	40.87	89.79	90.06	51.48	91.89	93.47
Thoracoscopy	41.27	90.41	92.88	58.12	88.74	91.02

Method	Gastroscopy	Laparoscopy	Rhinoscopy	Mean
BF	6.8 ± 1.5	36.8 ± 25.2	15.4 ± 10.5	21.6
GMS	5.6 ± 1.7	28.1 ± 15.2	10.8 ± 9.3	16.8
MCGB	5.3 ± 2.2	3.7 ± 1.9	5.9 ± 5.1	6.8
Ours	5.1 ± 1.8	4.3 ± 1.4	2.5 ± 1.2	3.6

Feature matching for texture-less endoscopy images via superpixel vector field consistency

Abstract

1. Introduction

2. Methods

2.1 Texture-less image adaptive gradient-preserving

2.2 Superpixel space constraints

2.3 Vector field smoothing

3. Experiments

3.1 Datasets and implementation details

3.2 Results of adaptive gradient-preserving

3.3 Evaluation on matching performance

3.4 Verification on 3D reconstruction

4. Conclusion and discussion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (6)

Equations (16)

Biomedical Optics Express

Type	Vertices number		Mesh number		Effective reconstruction region (%)
Type	Origin	Ours	Origin	Ours	Origin	Ours
Rhinoscopy	1154	19062	2217	37837	30%	85%
Rhinoscopy (dog)	729	9383	1447	18394	20%	70%
Intestinal model	131	2532	416	5002	15%	55%