Image-Based Tracking

Abstract
In Mixed Reality (MR) Environments, the user’s view is augmented with virtual, artificial objects. To visualize virtual objects, the position and orientation of the user’s view or the camera is needed. Tracking of the user’s viewpoint is an essential area in MR applications, especially for interaction and navigation. In present systems, the initialization is often complex. For this reason, we introduce a new method for fast initialization of markerless object tracking. This method is based on Speed Up Robust Features and paradoxically on a traditional marker- based library. Most markerless tracking algorithms can be divided into two parts: an offline and an online stage. The focus of this documentation is optimization of the offline stage, which is often time-consuming.

Introduction
The idea of measuring arbitrary 2D objects based on common markers to track them will be described in this documentation. The developed feature-based markerless system tracks textured flat objects. This documentation is focused on a fast and uncomplicated initialization step. This step contains an object detection followed by the estimation of the relative position between the structured object and the user’s view or the camera. In the offline stage the user has the choice between two provided methods:

  1. Initialization with reference: In this case the textured object will be measured by a traditional marker-based system. Hence, the estimation between the real object and the user’s viewpoint is exact.
  2. Initialization without reference: If there is no reference or the reference cannot be placed next to the object, it is still possible to initialize the tracker. In this case, a marker will be generated automatically after the user has selected the real object. The estimation between the structured object and the user’s viewpoint can be computed up to a scale factor. However the object can also be augmented without any problems.

Both the result of the first and the second method are called keyframe. This keyframe, the extracted features and the camera position and orientation calculated by the marker system are stored together as one reference frame. In the online stage, features (Speeded Up Robust Feature, SURF) are extracted every frame. With the detected features and the corresponding features from the reference frame the transformation (homography) between the current camera image and the reference frame can be calculated. Subsequently, the camera position and orientation can be computed by a marker-based library with the warped keyframe. In a final step, the environment can be augmented with virtual objects (compare figure 1d) or the tracked object can be used for interaction and navigation.

Implementation
The method introduced in this documentation only needs an image of the scene with or without a common marker for initialization. Based on this reference image, position and orientation of the user’s view relative to the real object are computed. In this section the algorithm will be described.

Offline stage: In the offline stage, a reference frame has to be captured, which may, but does not have to include a marker. Additionally, the user has to select the 2D object in the current frame. Then SURF features will be extracted and stored automatically. For stable tracking, features need to have special properties(scale and rotation invariance). Also, the recognition rate of the features is important An additional image to calculate the objects’s pose by a marker based tracking library will also be generated automatically (see figure). The marker is placed at the center of the selected object – this image is used for background processing and is not visible to the user.

Online stage: In the online stage the SURF features of each frame are extracted. For matching of these features with the reference frame, the user can select two methods. The first method is the Random Sample Consensus (RANSAC) and the second method is Least Median of Squares. Both algorithms can solve an equation which contains a huge number of outliers. With the resulting corresponding features a transformation (homography) is calculated. The homography is an invertible transformation which describes a plane to plane mapping: xi‘ = H • xi , where xi‘ are the matched feature points in the current frame and xi are the points of the reference frame, while H describes the desired transformation. The automatically generated marker image (see figure 1c) is now warped using the determined homography. The estimation of position and orientation between the object and the warped image is calculated by a conventional marker library.

Results
The major impact on performance is caused by feature extraction and matching of their descriptors. Therefore, two implementations were evaluated. On a single dualcore CPU (Intel Core2Duo T7200), the tracker achieves 4-5 fps depending on the number of extracted features. In the second implementation, a GPU-based library for feature detection has been used. This configuration improves performance up to 14-16 fps (nVidia GeForce 8800 GTX). Concerning accuracy and radius of movement, the tracker equals the common marker based tracking libraries, except some jitter and infrequent fails of position estimation. The introduced system is derived from a marker-based system, but in the online stage no marker is visible or needed (compare figure 1). The next step would be to not only track a selected object, but measure/track also unknown environments like other Structure from Motion methods do.