Open Access Paper
28 December 2022 Efficient relative pose estimation with an RGB-D camera in indoor environment
Ang Sha, Yaqing Ding
Author Affiliations +
Proceedings Volume 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022); 125063G (2022) https://doi.org/10.1117/12.2661835
Event: International Conference on Computer Science and Communication Technology (ICCSCT 2022), 2022, Beijing, China
Abstract
In this paper, we propose an efficient framework to estimate the motion of a robot when it is navigating indoors. Particularly, we mainly focus on the case where the robot is mounted with an RGB-D camera, and the environments have a lot of planar structures such as the floor, ceiling and walls. The novelties of our method lie in three-folds. First, an efficient normal vector extraction method for planes is proposed to fully make use of the planar structures. Without fitting planes, we only need to use the inverse-depth induced histograms to get the dominant planar structures as well as the normal of the planes. Second, since the robot is roaming indoors, we assume that the environment satisfies the weak Manhattan world constraint, i.e., the floor and ceiling are perpendicular to the walls. It is reasonable in our real world. Based on this assumption, we can calculate the relative camera motion by aligning the current local frame with respect to the world coordinate. Third, we present real-world RGB-D datasets that satisfy the weak Manhattan world constraint including 5,737 images, which make contributions to the community. Extensive experimental results show a very promising performance of our method in terms of accuracy, robustness and efficiency, especially in large-scale lowtextured scenes.

1.

INTRODUCTION

Relative pose estimation is a popular topic in computer vision and robotics. It usually uses the environmental information captured by the camera to compute the 6-degree-of-freedom pose estimation. Recently, many camera motion estimation methods have been proposed by using different kinds of sensors, such as monocular, stereo camera, Lidar and RGB-D camera, which meets various application scenarios. Among them, the RGB-D camera brings more attention to the researchers since it can provide both depth and visual information.

2.

RELATED WORK

Visual Simultaneous Localization and Mapping and relative pose estimation are two topics, which are tightly related. They finally get 3D information results from visual sensor data usually in real-time. The feature-based methods1-4 generally estimate the camera poses by tracking keypoints and minimizing the geometric projection error. Among them, ORB-SLAM4 is a popular SLAM system, which is point-based and monocular. However, in low-textured environments, where feature points are very few, such point-based methods may fail.

To solve these problems, sensor fusion based methods with an RGB-D camera have been proposed5,6. Their availability at low cost has facilitated a large number of methods to solve fundamental problems in computer vision in the last decade. ORB-SLAM27 further expand to stereo and RGB-D sensors. Some persons8 make the depth data more reliable and correct it. And Reference9 reduces the noise in depth data by using a per-pixel polynomial model calibration technique. They all improve the localization accuracy.

But traditional geometric methods may be easy to drift if only use point features for camera pose estimation. If points, lines, and planes features are all used, this limitation can be overcome in some way. Reference10 used points and lines and extracted 2D and 3D information from them to build a geometric constraint model. As for ORB-SLAM311, lines and planes are both used to get better performance in low-textured scenes. Reference12 also uses points, lines, and planes as structural features to estimate the pose.

In addition, the Manhattan-world constraint is a good way to avoid rotation drift in SLAM algorithms for the indoor scene. In Reference13, authors thought that inaccurate rotation estimations can cause drift mostly. So improving the accuracy of rotation estimation is the key to produce lower drift. For example, Reference14 utilized environmental regularity to track keypoints and estimate the translation. Reference15 proposed a framework which can robustly benefit the MW structure. For the benefit of the Manhattan assumption, many approaches can produce a small accumulated rotational error.

In this paper, focusing on the case where the robot is moving in indoor environments, we propose an efficient pose estimation system with an RGB-D camera under weak Manhattan world assumption. In general, indoor environments usually have many planar structures, such as floor, ceiling and walls, and the weak Manhattan world assumption is to assume that the floor and ceiling are perpendicular to the walls. It is practical in our real world. In addition, we consider the indoor floor flat, which means that the roll and pitch angles remain constant when the robot is moving. In this case, the 6-degree-of-freedom (DOF) problem can be reduced to a 3-DOF problem. With the above assumptions, we propose a new method for robot motion estimation and the novelties of this paper are:

  • (1) An efficient method for extracting the normal vector of planes is proposed. We align the current local coordinate with the world coordinate so that the estimated rotation is drift-free. However, one of the main contributions of our work is that we do not need any plane segmentation. Instead, we directly extract the normal of 3D planes by fitting 2D lines from the inverse-depth induced histograms for the subsequent relative pose estimation for real-time applications. It’s a significant difference between our work and previous methods.

  • (2) Since the robot is roaming in indoor environments, we assume that it can extract at least one local Manhattan coordinate frame from the detected planar structures, i.e., there are at least one ground plane and one vertical plane. Whether we get only one or more local Manhattan coordinates, we can align the robot’s coordinate frame to the dominant Manhattan coordinate frame so that the relative rotation can be found. Once the rotation is known, the translation can be calculated from only one 3D point correspondence.

  • (3) To show the benefits of our method, we record real-world RGB-D datasets including almost 6,000 images with a Kinect V2 in indoor environments. We compare our method with ORB-RGBD SLAM7 in both public and our datasets. Experimental results show that our method gets better performance in these datasets which satisfy the weak Manhattan world assumption. Our datasets will be publicly available, which also makes contributions to the community.

3.

PROPOSED METHOD

In our framework, see the overview in Figure 1, since the robot is with an RGB-D camera rigidly, we can estimate the normal vector of the floor using the depth information, and the y-axis of the camera can be aligned with the normal vector of the floor (illustrated in Figure 2a)). In this case, the camera pose can be simplified and the motion model can be parameterized efficiently. The inverse-depth histogram image is used to extract the normal vector and estimate the relative rotation. We align the local camera coordinate with the world coordinate so that we can calculate a drift-free rotation. Once the rotation is obtained, the translation can be found using a 1-point correspondence.

Figure 1.

Overview of the proposed motion estimation framework.

00127_PSISDG12506_125063G_page_2_1.jpg

Figure 2.

(a): Align the $y$-axis of the camera with the normal vector of the floor; (b): A view of a corridor map. It consists of several Manhattan structures. Different color represents different local Manhattan structure; (c): Walls in different Descartes frames. For example, the Red Manhattan frame can align Walls 1 and 3. The Blue Manhattan frame can be with Wall 2.

00127_PSISDG12506_125063G_page_3_1.jpg

We extract and match keypoints in the RGB image and fit lines in the inverse-depth image. Then we use the u-intercept induced motion estimation.

3.1

Aligning the camera with a known direction

With this processing, the rotation between frames is simplified to 1-DOF. 00127_PSISDG12506_125063G_page_3_2.jpg be the unit normal vector of the floor in the robot’s camera coordinate. In this stage, our goal is to get a rotation matrix Rn which can make the y-axis of robot’s camera align with the normal vector of the floor: 00127_PSISDG12506_125063G_page_3_3.jpg. As we know, the matrix Rn can be represented by the Euler-Rodrigues formula. Finally, the matrix we look for can be calculated as:

00127_PSISDG12506_125063G_page_3_4.jpg

With 00127_PSISDG12506_125063G_page_3_5.jpg, 00127_PSISDG12506_125063G_page_3_6.jpg. Our relative pose estimation framework is based on this initialization.

3.2

Estimating the normal of planes via inverse-depth

Enlightened by the u-v disparity of the stereo vision, the inverse-depth induced histograms show up. And We propose an efficient method based on that to directly extract the normal of the horizontal plane. A horizontal plane in the ith camera coordinate and the pinhole camera model can be defined as follows:

00127_PSISDG12506_125063G_page_3_7.jpg

where [X Y Z]T and [u υ 1]T are the 3D point and the point of the image acquired by the camera respectively.

Using equation (2), we can get the relationship between v and Y. And let 00127_PSISDG12506_125063G_page_4_1.jpg which is defined as the inverse-depth up to scale, we finally obtain:

00127_PSISDG12506_125063G_page_4_2.jpg

We can find that equation (3) (right) represents a straight line of (lh, υ) respectively in the horizontal-histogram image as shown in Figure 3b. Let si, υi represent the slope and v-intercept of the line, we have the following relation:

00127_PSISDG12506_125063G_page_4_3.jpg

Using equations (2) (left) and (4) (right), the normal nih of a horizontal plane in the ith frame can be written as:

00127_PSISDG12506_125063G_page_4_4.jpg

Once the y-axis of the robot’s camera is aligned with the normal vector of the floor, a vertical plane in the ith camera coordinate can be defined as X = aiZ + bi. In a similar way to the above, the normal n of a vertical plane in the ith frame can be written into: n = (1 0 – ai)T.

Figure 3.

(a): An RGB image; (b): The horizontal histograms of inverse-depth; (c): The line fitting result using RANSAC algorithm; (d): The corresponding depth image; (e): The vertical histograms of inverse-depth; (f): The line fitting result using RANSAC algorithm.

00127_PSISDG12506_125063G_page_4_5.jpg

Since the camera is calibrated, fy, cy, fx, cx are known and the normal of the plane is only associated with the v, u-intercept. Thus, we can extract the horizontal normal directly with line fitting (Figure 3c) by using RANSAC algorithm from the horizontal-histogram image (Figure 3b)), and (Figures 3f and 3e) for the vertical normal. In addition, this scheme can eliminate the impact of the moving objects, e.g., people in Figure 3a. Once we calculate the normal of the vertical plane, we obtain a local Manhattan frame.

3.3

The u-intercept induced motion estimation

Instead of directly using the plane-to-plane registration, we align the current local coordinate with the world coordinate to obtain the orientation and location. When there is only one local Manhattan coordinate, assume that the orientation with respect to the world coordinate at time ti and ti+1 are Ri and Ri+1, respectively. Then the relative rotation between time ti and ti+1 can be simply written as 00127_PSISDG12506_125063G_page_4_6.jpg.

Once the rotation is known, the relative translation can be found from a single point correspondence: P′ = Ri–1P + T, where P′ and P are current and previous 3D points, respectively. We implement a 1-point RANSAC routine to reject outliers.

However, when the robot moves around the corner, there are sometimes multiple walls corresponding to different Manhattan frames. In this case, we consider the global structure consists of some local Manhattan frames (Figure 2b). We obtain multiple relative rotations since there are multiple orientations with respect to different local Manhattan frames. To find the right solution, we use the 1-point RANSAC to find the rotation and translation which correspond to the most inliers.

4.

EXPERIMENTS

We test our method and result on large-scale real data. The real-world RGB-D datasets that satisfy weak Manhattan constraints are collected by the authors. We use a robot mounted with a Kinect V2 to capture the scenes (Figures 4a and 4c). The resolution of these images is set to 960 x 540. The other datasets are from Reference9, called Lee’s data (Figure 4b), at a resolution of 640 x 480. These real-world datasets do not have ground truth. Though there are some public datasets containing ground truth, most of them focus on general motion and do not apply to our application.

Figure 4.

Comparing our method with ORB-RGBD SLAM in three largescale indoor sequences. (a): Basement (200 m); (b): Laboratory building (190 m); (c): School building (300 m). First column: Example images. Second column: Reconstructed point cloud. Third column: Trajectories of ORB-RGBD SLAM. Fourth column: Trajectories of ours.

00127_PSISDG12506_125063G_page_5_1.jpg

4.1

Performance with large-scale real data

To illustrate the practicability of our relative pose estimation framework, three large-scale scenes taken in 3 different indoor corridor-like environments were tested. These sequences were selected to reflect difficulties when using relative pose estimation in real applications. We make a comparison between our method and the SOTA one: ORB-RGBD SLAM7. Since we do not have ground truth, we only do a qualitative analysis with these datasets. For the scenes which contain a closed loop, we can measure the distance between the starting point and end point.

4.1.1

Basement.

The sequence is challenging for visual-based methods since it contains illumination variations, low-textured walls and similar structures. Figure 4a (third column) shows the trajectories of ORB-RGBD SLAM and ours. As we can see, ORB-RGBD SLAM fails in the second corner due to low-textured walls and then tracking gets lost. Note the value of the y-axis, they are tested in the same dataset while these two paths look different.

4.1.2

Laboratory Building.

Figure 4b (third column) shows that ORB-RGBD SLAM’s tracking gets lost in the corners. Our method produces an error of 1.22 m between the starting point and the endpoint and success with two loop closures.

4.1.3

School Building.

ORB-RGBD SLAM fails Figure 4c (third column) due to the empty scene since the features are very sparse and then relocates to the wrong place. The distance between the start point and the end point of our method is 0.72 m.

It is also worth mentioning that our framework achieves this performance with only frame-to-frame relative pose estimation and without any loop closure detection algorithm or non-linear refinement.

4.2

Timings

The experiment device is a desktop computer with an Intel Core i5-9400F CPU and the process is without GPU acceleration. The results are averaged with almost 6, 000 images. Take a sequence with a resolution of 960 x 540 as an example, the average time for line fitting in a vertical-histogram image is 12 ms. The time about extracting and matching ORB features16 is 20 ms. Note that the line fitting and the feature matching can be parallel executed. From Table 1, we can see that our algorithm shows a huge advantage in speed while maintaining accuracy and robustness. It is efficient enough for real applications.

Table 1.

Comparison with plane fitting based methods.

ResolutionOursPlane fitting17Plane fitting18
960 × 54026 ms100 ms150 ms
640 × 48020 ms60 ms100 ms

5.

CONCLUSION

We propose a novel relative pose estimation framework for an indoor robot with an RGB-D camera. Unlike previous work, without fitting planes, we estimate the relative camera motion using planar structures that are extracted directly from the inverse-depth induced histograms. By aligning the current local coordinate with the world coordinate, the rotation estimation is drift-free. Furthermore, we present datasets that satisfy the weak Manhattan world assumption in indoor environments. The experiments on both public and the authors recorded data suggest that our method gives very excellent results in very challenging indoor scenarios.

REFERENCES

[1] 

Davison, A. J., Reid, I. D., Molton, N. D. and Stasse, O., “Monoslam: Real-time single camera slam,” IEEE Transactions on Pattern Analysis & Machine Intelligence, 1052 –1067 (2007). https://doi.org/10.1109/TPAMI.2007.1049 Google Scholar

[2] 

Montemerlo, M., Thrun, S., Koller, D., Wegbreit, B., et al., “Fastslam: A factored solution to the simultaneous localization and mapping problem,” AAAI/IAAI, (2002). Google Scholar

[3] 

Klein, G. and Murray, D., “Parallel tracking and mapping for small AR workspaces,” 6th IEEE and ACM Inter. Symp. on Mixed and Augmented Reality, 225 –234 (2007). Google Scholar

[4] 

Mur-Artal, R., Montiel, J. M. M. and Tardos, J. D., “Orb-slam: A versatile and accurate monocular slam system,” IEEE Transactions on Robotics, 31 (5), 1147 –1163 (2015). https://doi.org/10.1109/TRO.2015.2463671 Google Scholar

[5] 

Kim, S. and Kim, J., “Occupancy mapping and surface reconstruction using local gaussian processes with kinect sensors,” IEEE Transactions on Cybernetics, 43 (5), 1335 –1346 (2013). https://doi.org/10.1109/TCYB.2013.2272592 Google Scholar

[6] 

Han, J., Shao, L., Xu, D. and Shotton, J., “Enhanced computer vision with microsoft kinect sensor: A review,” IEEE Transactions on Cybernetics, 43 (5), 1318 –1334 (2013). https://doi.org/10.1109/TCYB.2013.2265378 Google Scholar

[7] 

Artal, R. and Tard´os, J. D., “Orb-slam2: An open-source slam system for monocular, stereo, and RGB-d cameras,” IEEE Transactions on Robotics, 33 (5), 1255 –1262 (2017). https://doi.org/10.1109/TRO.2017.2705103 Google Scholar

[8] 

Thabet, A. K., Lahoud, J., Asmar, D. and Ghanem, B., “3D aware correction and completion of depth maps in piecewise planar scenes,” in Asian Conf. on Computer Vision, 226 –241 (2014). Google Scholar

[9] 

Yang, L., Dryanovski, I., Valenti, R. G., Wolberg, G. and Xiao, J., “Rgb-d camera calibration and trajectory estimation for indoor mapping,” Autonomous Robots, 44 (8), 1485 –1503 (2020). https://doi.org/10.1007/s10514-020-09941-w Google Scholar

[10] 

Zhang, C., “Pl-gm: Rgb-d slam with a novel 2d and 3d geometric constraint model of point and line features,” IEEE Access, 9 9958 –9971 (2021). https://doi.org/10.1109/Access.6287639 Google Scholar

[11] 

Campos, C., Elvira, R., Rodr´iguez, J. J. G., Montiel, J. M. and Tard´os, J. D., “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,” IEEE Transactions on Robotics, (2021). https://doi.org/10.1109/TRO.2021.3075644 Google Scholar

[12] 

Guo, Z., Yu, Q., Guo, R. and Lu, H., “Structural features based visual odometry for indoor textureless environments,” 2020 Chinese Automation Congress (CAC), 3984 –3989 (2020). https://doi.org/10.1109/CAC51589.2020 Google Scholar

[13] 

Straub, J., Bhandari, N., Leonard, J. J. and Fisher, J. W., “Real-time Manhattan world rotation estimation in 3d,” in 2015 IEEE/RSJ Inter. Conf. on Intelligent Robots and Systems (IROS), 1913 –1920 (2015). Google Scholar

[14] 

Kim, P., Coltin, B. and Kim, H. J., “Low-drift visual odometry in structured environments by decoupling rotational and translational motion,” in 2018 IEEE Inter. Conf. on robotics and automation (ICRA), 7247 –7253 (2018). Google Scholar

[15] 

Yunus, R., Li, Y. and Tombari, F., “Manhattanslam: Robust planar tracking and mapping leveraging mixture of Manhattan frames,” (2021). https://doi.org/10.1109/ICRA48506.2021.9562030 Google Scholar

[16] 

Rublee, E., Rabaud, V., Konolige, K. and Bradski, G., “Orb: An efficient alternative to sift or surf,” in 2011 IEEE Inter. Con. on Computer Vision, 2564 –2571 (2011). Google Scholar

[17] 

Hou, Z., Ding, Y., Wang, Y., Yang, H. and Kong, H., “Visual odometry for indoor mobile robot by recognizing local Manhattan structures,” Asian conf. on Computer Vision, 168 –182 (2018). Google Scholar

[18] 

Le, P. H. and Kosecka, J., “Dense piecewise planar rgb-d slam for indoor environments,” in IEEE/RSJ Inter. Conf. on Intelligent Robots and Systems (IROS), 4944 –4949 (2017). Google Scholar
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Ang Sha and Yaqing Ding "Efficient relative pose estimation with an RGB-D camera in indoor environment", Proc. SPIE 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022), 125063G (28 December 2022); https://doi.org/10.1117/12.2661835
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Cameras

Motion estimation

Visualization

Information visualization

3D modeling

Sensors

Image resolution

RELATED CONTENT


Back to Top