Semantic Object Segmentation - Video Segmentation and Its Applications

Digital Signal Processing Reference

In-Depth Information

object labels. Support Vector Machines (SVM) and Boosting are widely used to

model the appearance of object classes. Marsalek and Schmid [ 35 ] estimated the

shape mask of an object and its object category using nonlinear SVM and with

2

distance. The appearance of the object within the shape mask was represented by

a histogram of visual words. Shotton et al. [ 32 ] used the texton histograms and re-

gion priors, which were calculated from their proposed semantic texton forests, of

image regions as input of a one-vs-others SVM classifier to assign image regions

into different object classes. Gould et al. [ 36 ] used the boosting classifier to predict

the label of each pixel. Tahir et al. [ 25 ] used Spectral Regression Kernel Discrim-

inant Analysis(SRKDA) [ 37 ] and achieved better results than SVM on PASCAL

VOC 2008 [ 6 ]. It was also much more efficient than Kernel Discriminant Analysis

(KDA). Aldavert et al. [ 38 ] proposed an integral linear classifier, which used integral

images to efficiently calculate the outputs of linear classifiers based on histograms

of visual words at the pixel level.

χ

3.3.2

Conditional Random Fields

Although classifiers such as SVM and Boosting can predict the object label of a

pixel based on the appearance within its neighborhood, they cannot capture local

consistency of other contextual features, such as “sky” appears above buildings

but not the other way around. Local appearance, local consistency and contex-

tual features can be well incorporated under a Conditional Random Fields (CRF)

framework.

3.3.2.1

Multiscale Conditional Random Fields

He et al. [ 39 ] were the first to use CRF for semantic object segmentation. Their

proposed CRF framework is described as following. Suppose X

= {

x i

}

are image

patches and Z

are their object class labels. In [ 39 ], the conditional distribu-

tion over Z given by input X was defined by multiplicatively combining component

conditional distributions.

= {

z i

}

(

|

) ∝

(

|

)

(

|

)

(

|

) .

P

Z

X

P C

Z

X

P R

Z

X

P G

Z

X

(3.1)

P C , P R ,and P G capture statistical structures at three different spatial scales: local

classifier, regional features, and global features (see Fig. 3.6 ).

The local classifier P C produces a distribution over the label z i given by its image

patch x i as input,

, λ )= ∏ i

P C (

Z

|

X

P C (

z i |

x i , λ ) ,

(3.2)

where

is the parameter of the local classifier. A 3-layer multilayer perceptron

(MLP) was used in [ 39 ].

λ

Video Segmentation and Its Applications

Search WWH ::

Custom Search

Home