Segmentation in Video Data (Introduction to Video and Image Processing) Part 2

Background Subtraction

Background subtraction is a simple and yet efficient method of extracting an object in a scene. This is especially true if the background can be designed to be uniform. In indoor and controlled setups this is indeed realistic, but for more complicated scenarios, other methods might be necessary. Even in the case of a controlled setup two issues must be considered:

1.    Is the background really constant?

2.    How to define the threshold value, which is used to binarize the difference image? When you point a video camera at a static scene, for example a wall, the images seem the same. Very often, however, they are not. The primary reason being that artificial lighting seldom produces a constant illumination. Furthermore, if sunlight enters the scene, then this will also contribute to the non-constant illumination due to the randomness associated with the incoming light rays. The effect of this is illustrated in Fig. 8.5. To the left an image from a static scene is shown. To the right two histograms are shown. The first histogram is based on the pixel values at position #1 for a few seconds and similar for the second histogram. If the images are actually the same, the histograms would contain only one non-zero bin. As can be seen this is not the case and in general no such thing as a static background exists.

Say that the pixel at position #2 in the first image of the video sequence has a value of 80 (not very likely according to the histogram, but nevertheless possible). If the first image is used as the reference image, then typical background images (around 100 according to the histogram) will result in a difference around 20. Depending on the threshold value, this could actually be interpreted as an object in the scene, since it seems different from the reference image. This is obviously not desirable and each pixel in the reference image should therefore be calculated as the mean of the first N images.The reference image at this particular position will then be around 100, which is much more appropriate according to the histogram. So to make the background subtraction more robust the first few seconds of processing should therefore be spent on calculating a good reference image.


Sometimes the background changes during processing. For example due to the changing position of the sun during the day or due to changes in the illumination sources, e.g., they are accidentally moved. In such situations a new reference image should be calculated. But how do we detect that this has happened? One way is, of course, if we can see that the performance of the system degrades. An automatic way is to gradually change the value of each pixel in the reference image in the following way:

tmp7470-29_thumb[2]

where r(x, y) is the reference image, f(x, y) is the current image, and α is aweight-ing factor that defines how fast the reference image is updated. The value of α depends on the application, but a typical value is α = 0.95.

Defining the Threshold Value

As for any other threshold operation, defining the actual threshold value is a trade-off between false positives and false negatives.

It is important to notice that Eq. 8.3 is actually based on the assumption that the histograms for different pixel positions are similar and only differ in their mean values. That is, it is assumed that the variation in the histograms is similar. In order to understand the implications of this assumption let us have a closer look at the bottom histogram in Fig. 8.5 together with Eq. 8.3. Say we define the threshold value to 25. This means that an object in an image needs to have a value below 75 or above 125 in order to be segmented as an object pixel and not a background pixel. This seems fine. But then have a look at the top histogram in Fig. 8.5. Clearly this histogram has a larger variation and applying a threshold of 25 will result in incorrect segmentation of pixel values in the intervals: [150, 175] and [225,255].

In many situations different histograms will occur simply because the different parts of the scene are exposed to different illumination conditions, which yields histograms with different variations. For example, you could have some parts of the background which move slightly (due to a draft for instance) and this will create a larger variation. So to sum up the above, the problem is that each position in the image is a associated with the same global threshold value.

The solution to this potential problem is to have a unique threshold value for each pixel position! Finding these manually, is not realistic simply due to the number of pixels and the threshold values are therefore found automatically by the use of the standard deviation for each pixel position.So when the mean of each pixel is calculated, so is the standard deviation. Equation 8.3 is therefore reformulated as

tmp7470-30_thumb[2]

where β is a scaling factor and σ(χ, y) is the standard deviation at the position (x, y). Since β is the same for every position, we have no more parameters to define than above, but now the thresholding is done with respect to the actual data, hence a local threshold.

Image Differencing

If the assumption of a static background is violated significantly then background subtraction will produce incorrect results. In such situations we can apply image differencing to detect changes in a scene. As stated above, image differencing operates as background subtraction, except for the fact that the reference image is now a previous image.

Image differencing is simple and can efficiently measure changes in the image. Unfortunately the method has two problems. The first is a lack of detecting new objects which are not in motion. Say a new object enters a scene. As long as the object moves, image differencing detects this in the image subtracting process, but if the object stops moving, the reference image will be equal to the current image and hence nothing is detected. This is a clear weakness compared to background subtraction, which is indifferent to whether the new object is moving or not, as long as the appearance of the object is different from the background.

The other problem associated with image differencing is the notion of ghost objects illustrated in Fig. 8.6. The figure contains artificial images from a sequence where an object is moving horizontally through a scene. To make it simple, the object is a square with uniform gray-scale value. What can be seen is that the image differencing produces two segments (smaller objects). One originates from the current object and the other one from the object in the reference image—where the object was. This latter segment type is denoted a ghost object, since no object is present. A ghost object can also be seen in Fig. 8.4.

If the goal is only to obtain the coarse motion in the image, then this does not matter. If, however, we are only interested in the position of the object in the current image, then we need to remove ghost objects. One approach for doing so is if we know the moving direction of the object. We can then infer which is the object and which is the ghost. Another approach is if we know that the object is always brighter than the background. Then the pixels belonging to the ghost will have negative values after the image subtraction.

Image differencing. The effect of changing the reference image

Fig. 8.6 Image differencing. The effect of changing the reference image

It should also be noticed that when the object is overlapping in the reference and current image, then we only detect a part of the object, as seen in Fig. 8.6. If we know the size and speed of the object we can calculate how long time there should be between the reference image and the current image to avoid overlap. Or in other words, the reference image need not be the previous image T = -1, it can also be for example T = -5, see Fig. 8.6.

Further Information

Video compression has for a long time been a cornerstone in video acquisition and allowed for transmission and storage of video data. Video compression is a research field in its own right and contains many more aspects than those basics presented in this topic.As the hardware and software has matured it has become possible to capture, transmit and store larger and larger amounts of video data. But even with today’s fast computers, clever transmission systems, and huge storage facilities, the handling of video data can still be too difficult and a reduced framerate/resolution/quality is necessary. To appreciate this fact just imagine the amount of video data captured, transmitted and stored in a surveillance setup with for example 100 cameras.

Background subtraction can be a powerful allied when it comes to segmenting objects in a scene. The method, however, has some build-in limitations that are exposed especially when processing video of outdoor scenes. First of all, the method requires the background to be empty when learning the background model for each pixel. This can be a challenge in a natural scene where moving objects may always be present. One solution is to median filter all training samples for each pixel. This will eliminated pixels where an object is moving through the scene and the resulting model of the pixel will be a true background pixel. An extension is to first order all training pixels (as done in the median filter) and then calculate the average of the pixels closest to the median. This will provide both a mean and variance per pixel. Such approaches assume that each pixel is covered by objects less than half the time in the training period.

Another problem that is especially apparent when processing outdoor video is the fact that a pixel may cover more than one background. Say we have a background pixel from a gray road. Imagine now that the wind sometime blows so a leaf covers the same pixel. This will result in two very different backgrounds for this pixel; a greenish color and a grayish color. If we find the mean for this pixel we will end up with something in between green and gray with a huge variance. This will render a poor segmentation of this pixel during background subtraction. A better approach is therefore to define two different background models for this pixel; one for the leaf and one for the road, see [12, 18] for specific examples and [9] for a general discussion.

Yet another problem in outdoor video is shadows due to strong sunlight. Such shadow pixels can easily appear different from the learnt background model and hence be incorrectly classified as object pixels. Different approaches can be followed in order to avoid such misclassifications. First of all, a background pixel in shadow tends to have the same color as when not in shadow—just darker. A more detailed version of this idea is based on the notion that when a pixel is in shadow it often means that it is not exposed to direct sunlight, but rather illuminated by the sky. And since the sky tends to me more bluish, the color of a background pixel in shadow can be expected to be more blueish too. Secondly, one can group neighboring object pixels together and analyze the layout of the edges within that region. If that layout is similar to the layout of the edges in the background model, then the region is likely to be a shadow and not an object. For more information please refer to [6, 15].

Next post:

Previous post: