|
1.INTRODUCTIONRelative pose estimation is a popular topic in computer vision and robotics. It usually uses the environmental information captured by the camera to compute the 6-degree-of-freedom pose estimation. Recently, many camera motion estimation methods have been proposed by using different kinds of sensors, such as monocular, stereo camera, Lidar and RGB-D camera, which meets various application scenarios. Among them, the RGB-D camera brings more attention to the researchers since it can provide both depth and visual information. 2.RELATED WORKVisual Simultaneous Localization and Mapping and relative pose estimation are two topics, which are tightly related. They finally get 3D information results from visual sensor data usually in real-time. The feature-based methods1-4 generally estimate the camera poses by tracking keypoints and minimizing the geometric projection error. Among them, ORB-SLAM4 is a popular SLAM system, which is point-based and monocular. However, in low-textured environments, where feature points are very few, such point-based methods may fail. To solve these problems, sensor fusion based methods with an RGB-D camera have been proposed5,6. Their availability at low cost has facilitated a large number of methods to solve fundamental problems in computer vision in the last decade. ORB-SLAM27 further expand to stereo and RGB-D sensors. Some persons8 make the depth data more reliable and correct it. And Reference9 reduces the noise in depth data by using a per-pixel polynomial model calibration technique. They all improve the localization accuracy. But traditional geometric methods may be easy to drift if only use point features for camera pose estimation. If points, lines, and planes features are all used, this limitation can be overcome in some way. Reference10 used points and lines and extracted 2D and 3D information from them to build a geometric constraint model. As for ORB-SLAM311, lines and planes are both used to get better performance in low-textured scenes. Reference12 also uses points, lines, and planes as structural features to estimate the pose. In addition, the Manhattan-world constraint is a good way to avoid rotation drift in SLAM algorithms for the indoor scene. In Reference13, authors thought that inaccurate rotation estimations can cause drift mostly. So improving the accuracy of rotation estimation is the key to produce lower drift. For example, Reference14 utilized environmental regularity to track keypoints and estimate the translation. Reference15 proposed a framework which can robustly benefit the MW structure. For the benefit of the Manhattan assumption, many approaches can produce a small accumulated rotational error. In this paper, focusing on the case where the robot is moving in indoor environments, we propose an efficient pose estimation system with an RGB-D camera under weak Manhattan world assumption. In general, indoor environments usually have many planar structures, such as floor, ceiling and walls, and the weak Manhattan world assumption is to assume that the floor and ceiling are perpendicular to the walls. It is practical in our real world. In addition, we consider the indoor floor flat, which means that the roll and pitch angles remain constant when the robot is moving. In this case, the 6-degree-of-freedom (DOF) problem can be reduced to a 3-DOF problem. With the above assumptions, we propose a new method for robot motion estimation and the novelties of this paper are:
3.PROPOSED METHODIn our framework, see the overview in Figure 1, since the robot is with an RGB-D camera rigidly, we can estimate the normal vector of the floor using the depth information, and the y-axis of the camera can be aligned with the normal vector of the floor (illustrated in Figure 2a)). In this case, the camera pose can be simplified and the motion model can be parameterized efficiently. The inverse-depth histogram image is used to extract the normal vector and estimate the relative rotation. We align the local camera coordinate with the world coordinate so that we can calculate a drift-free rotation. Once the rotation is obtained, the translation can be found using a 1-point correspondence. We extract and match keypoints in the RGB image and fit lines in the inverse-depth image. Then we use the u-intercept induced motion estimation. 3.1Aligning the camera with a known directionWith this processing, the rotation between frames is simplified to 1-DOF. be the unit normal vector of the floor in the robot’s camera coordinate. In this stage, our goal is to get a rotation matrix Rn which can make the y-axis of robot’s camera align with the normal vector of the floor: . As we know, the matrix Rn can be represented by the Euler-Rodrigues formula. Finally, the matrix we look for can be calculated as: With , . Our relative pose estimation framework is based on this initialization. 3.2Estimating the normal of planes via inverse-depthEnlightened by the u-v disparity of the stereo vision, the inverse-depth induced histograms show up. And We propose an efficient method based on that to directly extract the normal of the horizontal plane. A horizontal plane in the ith camera coordinate and the pinhole camera model can be defined as follows: where [X Y Z]T and [u υ 1]T are the 3D point and the point of the image acquired by the camera respectively. Using equation (2), we can get the relationship between v and Y. And let which is defined as the inverse-depth up to scale, we finally obtain: We can find that equation (3) (right) represents a straight line of (lh, υ) respectively in the horizontal-histogram image as shown in Figure 3b. Let si, υi represent the slope and v-intercept of the line, we have the following relation: Using equations (2) (left) and (4) (right), the normal nih of a horizontal plane in the ith frame can be written as: Once the y-axis of the robot’s camera is aligned with the normal vector of the floor, a vertical plane in the ith camera coordinate can be defined as X = aiZ + bi. In a similar way to the above, the normal niυ of a vertical plane in the ith frame can be written into: niυ = (1 0 – ai)T. Since the camera is calibrated, fy, cy, fx, cx are known and the normal of the plane is only associated with the v, u-intercept. Thus, we can extract the horizontal normal directly with line fitting (Figure 3c) by using RANSAC algorithm from the horizontal-histogram image (Figure 3b)), and (Figures 3f and 3e) for the vertical normal. In addition, this scheme can eliminate the impact of the moving objects, e.g., people in Figure 3a. Once we calculate the normal of the vertical plane, we obtain a local Manhattan frame. 3.3The u-intercept induced motion estimationInstead of directly using the plane-to-plane registration, we align the current local coordinate with the world coordinate to obtain the orientation and location. When there is only one local Manhattan coordinate, assume that the orientation with respect to the world coordinate at time ti and ti+1 are Ri and Ri+1, respectively. Then the relative rotation between time ti and ti+1 can be simply written as . Once the rotation is known, the relative translation can be found from a single point correspondence: P′ = Ri–1P + T, where P′ and P are current and previous 3D points, respectively. We implement a 1-point RANSAC routine to reject outliers. However, when the robot moves around the corner, there are sometimes multiple walls corresponding to different Manhattan frames. In this case, we consider the global structure consists of some local Manhattan frames (Figure 2b). We obtain multiple relative rotations since there are multiple orientations with respect to different local Manhattan frames. To find the right solution, we use the 1-point RANSAC to find the rotation and translation which correspond to the most inliers. 4.EXPERIMENTSWe test our method and result on large-scale real data. The real-world RGB-D datasets that satisfy weak Manhattan constraints are collected by the authors. We use a robot mounted with a Kinect V2 to capture the scenes (Figures 4a and 4c). The resolution of these images is set to 960 x 540. The other datasets are from Reference9, called Lee’s data (Figure 4b), at a resolution of 640 x 480. These real-world datasets do not have ground truth. Though there are some public datasets containing ground truth, most of them focus on general motion and do not apply to our application. 4.1Performance with large-scale real dataTo illustrate the practicability of our relative pose estimation framework, three large-scale scenes taken in 3 different indoor corridor-like environments were tested. These sequences were selected to reflect difficulties when using relative pose estimation in real applications. We make a comparison between our method and the SOTA one: ORB-RGBD SLAM7. Since we do not have ground truth, we only do a qualitative analysis with these datasets. For the scenes which contain a closed loop, we can measure the distance between the starting point and end point. 4.1.1Basement.The sequence is challenging for visual-based methods since it contains illumination variations, low-textured walls and similar structures. Figure 4a (third column) shows the trajectories of ORB-RGBD SLAM and ours. As we can see, ORB-RGBD SLAM fails in the second corner due to low-textured walls and then tracking gets lost. Note the value of the y-axis, they are tested in the same dataset while these two paths look different. 4.1.2Laboratory Building.Figure 4b (third column) shows that ORB-RGBD SLAM’s tracking gets lost in the corners. Our method produces an error of 1.22 m between the starting point and the endpoint and success with two loop closures. 4.1.3School Building.ORB-RGBD SLAM fails Figure 4c (third column) due to the empty scene since the features are very sparse and then relocates to the wrong place. The distance between the start point and the end point of our method is 0.72 m. It is also worth mentioning that our framework achieves this performance with only frame-to-frame relative pose estimation and without any loop closure detection algorithm or non-linear refinement. 4.2TimingsThe experiment device is a desktop computer with an Intel Core i5-9400F CPU and the process is without GPU acceleration. The results are averaged with almost 6, 000 images. Take a sequence with a resolution of 960 x 540 as an example, the average time for line fitting in a vertical-histogram image is 12 ms. The time about extracting and matching ORB features16 is 20 ms. Note that the line fitting and the feature matching can be parallel executed. From Table 1, we can see that our algorithm shows a huge advantage in speed while maintaining accuracy and robustness. It is efficient enough for real applications. Table 1.Comparison with plane fitting based methods.
5.CONCLUSIONWe propose a novel relative pose estimation framework for an indoor robot with an RGB-D camera. Unlike previous work, without fitting planes, we estimate the relative camera motion using planar structures that are extracted directly from the inverse-depth induced histograms. By aligning the current local coordinate with the world coordinate, the rotation estimation is drift-free. Furthermore, we present datasets that satisfy the weak Manhattan world assumption in indoor environments. The experiments on both public and the authors recorded data suggest that our method gives very excellent results in very challenging indoor scenarios. REFERENCESDavison, A. J., Reid, I. D., Molton, N. D. and Stasse, O.,
“Monoslam: Real-time single camera slam,”
IEEE Transactions on Pattern Analysis & Machine Intelligence, 1052
–1067
(2007). https://doi.org/10.1109/TPAMI.2007.1049 Google Scholar
Montemerlo, M., Thrun, S., Koller, D., Wegbreit, B., et al.,
“Fastslam: A factored solution to the simultaneous localization and mapping problem,”
AAAI/IAAI,
(2002). Google Scholar
Klein, G. and Murray, D.,
“Parallel tracking and mapping for small AR workspaces,”
6th IEEE and ACM Inter. Symp. on Mixed and Augmented Reality, 225
–234
(2007). Google Scholar
Mur-Artal, R., Montiel, J. M. M. and Tardos, J. D.,
“Orb-slam: A versatile and accurate monocular slam system,”
IEEE Transactions on Robotics, 31
(5), 1147
–1163
(2015). https://doi.org/10.1109/TRO.2015.2463671 Google Scholar
Kim, S. and Kim, J.,
“Occupancy mapping and surface reconstruction using local gaussian processes with kinect sensors,”
IEEE Transactions on Cybernetics, 43
(5), 1335
–1346
(2013). https://doi.org/10.1109/TCYB.2013.2272592 Google Scholar
Han, J., Shao, L., Xu, D. and Shotton, J.,
“Enhanced computer vision with microsoft kinect sensor: A review,”
IEEE Transactions on Cybernetics, 43
(5), 1318
–1334
(2013). https://doi.org/10.1109/TCYB.2013.2265378 Google Scholar
Artal, R. and Tard´os, J. D.,
“Orb-slam2: An open-source slam system for monocular, stereo, and RGB-d cameras,”
IEEE Transactions on Robotics, 33
(5), 1255
–1262
(2017). https://doi.org/10.1109/TRO.2017.2705103 Google Scholar
Thabet, A. K., Lahoud, J., Asmar, D. and Ghanem, B.,
“3D aware correction and completion of depth maps in piecewise planar scenes,”
in Asian Conf. on Computer Vision,
226
–241
(2014). Google Scholar
Yang, L., Dryanovski, I., Valenti, R. G., Wolberg, G. and Xiao, J.,
“Rgb-d camera calibration and trajectory estimation for indoor mapping,”
Autonomous Robots, 44
(8), 1485
–1503
(2020). https://doi.org/10.1007/s10514-020-09941-w Google Scholar
Zhang, C.,
“Pl-gm: Rgb-d slam with a novel 2d and 3d geometric constraint model of point and line features,”
IEEE Access, 9 9958
–9971
(2021). https://doi.org/10.1109/Access.6287639 Google Scholar
Campos, C., Elvira, R., Rodr´iguez, J. J. G., Montiel, J. M. and Tard´os, J. D.,
“Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,”
IEEE Transactions on Robotics,
(2021). https://doi.org/10.1109/TRO.2021.3075644 Google Scholar
Guo, Z., Yu, Q., Guo, R. and Lu, H.,
“Structural features based visual odometry for indoor textureless environments,”
2020 Chinese Automation Congress (CAC), 3984
–3989
(2020). https://doi.org/10.1109/CAC51589.2020 Google Scholar
Straub, J., Bhandari, N., Leonard, J. J. and Fisher, J. W.,
“Real-time Manhattan world rotation estimation in 3d,”
in 2015 IEEE/RSJ Inter. Conf. on Intelligent Robots and Systems (IROS),
1913
–1920
(2015). Google Scholar
Kim, P., Coltin, B. and Kim, H. J.,
“Low-drift visual odometry in structured environments by decoupling rotational and translational motion,”
in 2018 IEEE Inter. Conf. on robotics and automation (ICRA),
7247
–7253
(2018). Google Scholar
Yunus, R., Li, Y. and Tombari, F.,
“Manhattanslam: Robust planar tracking and mapping leveraging mixture of Manhattan frames,”
(2021). https://doi.org/10.1109/ICRA48506.2021.9562030 Google Scholar
Rublee, E., Rabaud, V., Konolige, K. and Bradski, G.,
“Orb: An efficient alternative to sift or surf,”
in 2011 IEEE Inter. Con. on Computer Vision,
2564
–2571
(2011). Google Scholar
Hou, Z., Ding, Y., Wang, Y., Yang, H. and Kong, H.,
“Visual odometry for indoor mobile robot by recognizing local Manhattan structures,”
Asian conf. on Computer Vision, 168
–182
(2018). Google Scholar
Le, P. H. and Kosecka, J.,
“Dense piecewise planar rgb-d slam for indoor environments,”
in IEEE/RSJ Inter. Conf. on Intelligent Robots and Systems (IROS),
4944
–4949
(2017). Google Scholar
|