Digital Signal Processing Reference
In-Depth Information
neighboring pixels, and long-range contextual information [ 3 ]. In order to model
the local appearance at a pixel, filter-banks and visual descriptors are applied to
the neighborhood around the pixel and their responses are used as the input of a
classifier to predict the object label. The filter-banks, visual descriptors and clas-
sifiers have to be carefully designed in order to achieve a good balance between
high discriminative power and invariance to noise, clutters, and the changes of
viewpoints and illuminations. In order to obtain smooth segmentation results, the la-
bel consistency between neighboring pixels needs to be considered. In order for the
segmentation to be consistent with the boundaries of objects, the algorithm should
encourage two neighboring pixels to have the same object label if there is no strong
edge between them. In addition to smoothness, the likelihood of two object classes
being neighbors should also be considered for local consistency. For example, it is
more likely for a cup to be on the top of desk than on a tree. Only considering the
appearance of an image patch leads to ambiguities when deciding its class label. For
example, a flat white patch could be from a wall, a car or an airplane. The long-range
contextual information of the image may help to solve the ambiguities to some ex-
tent. For example, some object classes such as horses and grass are more likely to
coexist in the same images. If it is known that the image is an outdoor scene, it is
more likely to observe sky, grass, and cars than computers, desks, and floors in that
image. Local appearance, local consistency, and long-range contextual can be incor-
porated in a Conditional Random Field (CRF) model [ 4 ], which has been popularly
used in semantic object segmentation.
The approaches of semantic object segmentation can be supervised or unsuper-
vised. The supervision at the training stage can be at three different levels:
Pixel-level: each pixel in an image is manually labeled as one of the object
classes.
Mask-level: an object in an image is located by a bounding box and assigned to
a object class.
Image-level: annotate object classes existing in an image without locating or seg-
menting objects.
Most discriminative object segmentation approaches including CRF need
pixel-level or mask-level labeling for training. They can learn the models of object
classes more accurately and efficiently. However, as the fast increase of images and
videos in many applications such as web-based image and video search, there are
an increasing number of object classes to be modeled. The workload of pixel-level
and mask-level labeling is heavy and impractical for a very large number of object
classes. In recent years, some generative models, such as topic models borrowed
from language processing, have become popular in semantic object segmentation.
They are able to learn the models of object classes from a collection of images and
videos without supervision or supervised by data labeled at the image-level, whose
labeling cost is much less. It is also possible for CRF and topic models to integrate
the strengths of both types of approaches.
A typical pipeline of semantic object segmentation is shown in Fig. 3.2 . Filter-
banks or visual descriptors are first applied to images to capture the local appearance
Search WWH ::




Custom Search