Semantic Object Segmentation - Video Segmentation and Its Applications

Digital Signal Processing Reference

In-Depth Information

neighboring pixels, and long-range contextual information [ 3 ]. In order to model

the local appearance at a pixel, filter-banks and visual descriptors are applied to

the neighborhood around the pixel and their responses are used as the input of a

classifier to predict the object label. The filter-banks, visual descriptors and clas-

sifiers have to be carefully designed in order to achieve a good balance between

high discriminative power and invariance to noise, clutters, and the changes of

viewpoints and illuminations. In order to obtain smooth segmentation results, the la-

bel consistency between neighboring pixels needs to be considered. In order for the

segmentation to be consistent with the boundaries of objects, the algorithm should

encourage two neighboring pixels to have the same object label if there is no strong

edge between them. In addition to smoothness, the likelihood of two object classes

being neighbors should also be considered for local consistency. For example, it is

more likely for a cup to be on the top of desk than on a tree. Only considering the

appearance of an image patch leads to ambiguities when deciding its class label. For

example, a flat white patch could be from a wall, a car or an airplane. The long-range

contextual information of the image may help to solve the ambiguities to some ex-

tent. For example, some object classes such as horses and grass are more likely to

coexist in the same images. If it is known that the image is an outdoor scene, it is

more likely to observe sky, grass, and cars than computers, desks, and floors in that

image. Local appearance, local consistency, and long-range contextual can be incor-

porated in a Conditional Random Field (CRF) model [ 4 ], which has been popularly

used in semantic object segmentation.

The approaches of semantic object segmentation can be supervised or unsuper-

vised. The supervision at the training stage can be at three different levels:

Pixel-level: each pixel in an image is manually labeled as one of the object

classes.

•

Mask-level: an object in an image is located by a bounding box and assigned to

a object class.

•

Image-level: annotate object classes existing in an image without locating or seg-

menting objects.

•

Most discriminative object segmentation approaches including CRF need

pixel-level or mask-level labeling for training. They can learn the models of object

classes more accurately and efficiently. However, as the fast increase of images and

videos in many applications such as web-based image and video search, there are

an increasing number of object classes to be modeled. The workload of pixel-level

and mask-level labeling is heavy and impractical for a very large number of object

classes. In recent years, some generative models, such as topic models borrowed

from language processing, have become popular in semantic object segmentation.

They are able to learn the models of object classes from a collection of images and

videos without supervision or supervised by data labeled at the image-level, whose

labeling cost is much less. It is also possible for CRF and topic models to integrate

the strengths of both types of approaches.

A typical pipeline of semantic object segmentation is shown in Fig. 3.2 . Filter-

banks or visual descriptors are first applied to images to capture the local appearance

Video Segmentation and Its Applications

Search WWH ::

Custom Search

Home