Semantic Object Segmentation - Video Segmentation and Its Applications

Digital Signal Processing Reference

In-Depth Information

Fig. 3.2 Typical steps of semantic object segmentation. They are done over image pixels, patches

or oversegmented superpixels

objects. Their responses are typically quantized into textons or visual words accord-

ing to codebooks learned in a supervised or unsupervised way. The histograms of

textons or visual words are used as input to a classifier to predict labels of ob-

ject classes. In order to well capture the local consistency and long-range contextual

information, CRF or generative models are used to incorporate with local classifiers.

These steps can be on at image pixels, patches, or oversegmented superpixels. Many

different technologies have been developed to improve each of the three steps. We

will review these technologies and discuss the major challenges for these steps. In

recent years, some benchmark databases, such as PASCAL VOC 2007 [ 5 ], PASCAL

VOC 2008 [ 6 ], PASCAL VOC 2009 [ 1 ], LabelMe [ 7 ], LHI [ 8 ],andMSRC21[ 2 ],

were published to evaluate the performance of different semantic object segmenta-

tion approaches.

In video segmentation, Markov random fields (MRFs) and CRFs are two main

frameworks. Statistically, video segmentation formulizes and maximizes a posterior

probability of the labels given by the observation data. In the case that there is no

or only small number of labeled data, some heuristic or prior knowledge based

distributions can be selected to describe the observation data. Based on the selected

distributions and the prior of labels modeled in a MRF, the MRF approaches for-

mulate the posterior via likelihoods and priors in Baye's rule. On the contrast, CRFs

model the posterior directly to improve the predictive performance if there are large

quantities of training data. In CRFs, the model of the observation data is obtained

by learning from the training data using some classifiers. Compared to MRFs, CRFs

relax the assumption of data independence, while large more expensive labeled data

is necessary in CRFs.

This chapter is organized as follows. Section 3.2 introduces different types of

filter-banks and visual descriptors to capture local appearance, and different tech-

niques to quantize their responses into textons or visual words. Some popular

classifiers on local appearance are reviewed in Sect. 3.3.1 . Section 3.3.2 introduces

CRF and different approaches of using CRF for semantic object segmentation.

Section 3.4 first introduces two classical topic models, Probabilistic Latent Se-

mantic Analysis [ 9 ] (pLSA) and Latent Dirichlet Allocation [ 10 ](LDA),which

were directly borrowed from language processing and applied to semantic ob-

ject segmentation. Both pLSA and LDA ignored the spatial distribution of image

patches. Spatial Latent Dirichlet Allocation [ 11 ], which is an extension of LDA

and other topic models incorporating spatial structures of objects are introduced in

Sects. 3.4.2 and 3.4.3 . The approaches of object segmentations in videos are dis-

cussed in Sect. 3.5 . Finally the summary is given in Sect. 3.6 .

Search WWH ::

Custom Search

Home