Tracking (Introduction to Video and Image Processing) Part 2

Good Features to Track

Instead of only focusing on the position when tracking objects we can also include the features we are using to classify the different objects. This basically means we are combining the matching problem describe above with the feature classification problem discussed in Chap. 7. In practice we base the matching on the approach from Sect. 7.3 and simply add the x- and y-positions of the object as two additional features. The binary table in Fig. 9.8(b) is then replaced by a table where each entry indicates the distance from a predicted object and to a detected object. The uncertainties related to the predicted and detected objects could/should be incorporated as weights as discussed in Sect. 7.4. To binarize this new table each entry is thresholded and we can therefore apply the same matching mechanisms as described above.

When tracking objects we can of course use any of the features described in Chap. 7. But when it comes to tracking multiple objects we usually require more details features. Below we describe two approaches namely color-based and texture-based.

 A color histogram with ten bins and how an object will be represented using the color histogram bins as features


Fig. 9.9 A color histogram with ten bins and how an object will be represented using the color histogram bins as features

The average color of an object can be a strong feature as it is relatively independent on how the shape and size of an object changes. Also, if a color space, where the intensity and chromaticity are separated, is used, the color feature is relatively robust to changes in the lighting. Sometimes an object contains multiple colors and the average may not be the best way to represent such an object. Instead a color histogram can be used. No matter which color space is used the different color components are concatenated and hence results in one histogram. Each histogram bin is normalized so the sum of all bins is equal to one. This makes the color histogram invariant to the scale of the object. To reduce the number of features, the resolution of the histogram bins is usually coarse. An example of a color histogram with ten bin, i.e., ten features, can be seen in Fig. 9.9.

While a color histogram is a better representation than the average color, it does not contain any information about spatial distribution of the different colors. Another approach is therefore to divide the object into a number of regions (usually horizontal dividers) and then represent each region by its average color (or color histogram). This approach is obvious sensitive to object rotation and care should therefore be taking before applying it.

As mentioned above the framerate will often be high compared to the movement of the object and it can therefore be assumed that the object does not change significantly from image to image. Inspired by this notion we can simply represent the object by its pixels and try to refind the object in the next image using template matching, see Sect. 5.2.1. For this to work the object (or a part of it) needs to be represented by a rectangle, but more importantly it is assumed that this rectangle is unique compared to the surroundings. Uniqueness here means the rectangle contains texture—the more the better—which is not repeated in the background. The level of textureness can be investigated by looking at the amount of edges in the rectangle. If many strong edges are present with different orientation, then there is a high likelihood that the rectangle is unique and can be refound in the next image. One concrete way of measuring this is to correlate the rectangle with the Sobel kernels from Sect. 5.2.2. This will produce two edge images. For each edge image the absolute value of each edge pixel is found and all these values are summed, and checked if the sum is above a threshold value. We do the same for the other edge image and if both sums are above the threshold value the rectangle is concluded to contain a high level of textureness, hence be a good template to track.

No matter which of these features are applied in tracking, care should be taking when combining them with the position and/or other features in order to ensure the different features are scaled properly, see Sect. 7.3. Another important issue is that the model for a particular object is very likely to change over time and should therefore be updated from image to image. The simple solution is to replace the model with the detection, but this is dangerous since the detection could be incorrect. A gradual update scheme, like in Eq. 9.3, is therefore suggested.

Further Information

An excellent way of implementing the predict-match-update framework is through the Kalmanfilter [19]. It does not cover the detection and matching blocks, but it has built-in mechanisms for updating the state based on the detections, the predictions and the related uncertainties. When is comes to tracking noisy detections, a branch of methods exist, which do not only predict where the objects are most likely to be, but also predict a number of likely hypotheses and maintain those over time. Such methods are known as Particle Filters, the Condensation algorithm, Sequential Monte Carlo filtering, or Multiple Hypothesis filters. One place to start a journey into such methods is [11].

Color features can be improved by also including information about position. One such method is the color correlogram [20]. But when it comes to more advanced tracking, texture is often preferred over color. A good tracking framework based on texture is the KLT-tracker [16]. It finds candidate rectangles containing a high level of texture and tracks these rectangles over time. The rectangles are small and a number of these should therefore be used to track a large object. The tracker detects when the texture of a particular rectangle has changed too much compared to when it was initiated and the tracker then reinitializes a new rectangle to be tracked.

If the texture changes too much between two images, template matching-based methods will not suffice and more advanced methods are required. A good example is the SIFT algorithm [13]. It represents the pixels in a rectangle by their gradient information. This is done in a clever way making the representation invariant to rotation and scale. In Fig. 9.10 an example is shown where the object is standing still, but the camera is moving. This is equivalent to when the camera is fixed and the object is moving. The SIFT algorithm is here used to find and track 100 points between two images. Note that such approaches often refer to the process of relocating features as finding the correspondence rather than tracking.

Next post:

Previous post: