Collective Classification for Text Classification - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

3.2 Collective Classification: Notation and Problem

Definition

Collective classification is a combinatorial optimization problem, in which

we are given a set of documents, or nodes,

V

=

{

V 1 ,...,V n }

and a neigh-

borhood function

N

,where

N i ⊆V\{

V i }

, which describes the underlying

network structure. Each node in

V

is a random variable that can take a value

from an appropriate domain.

V

is further divided into two sets of nodes:

X

,

the nodes for which we know the correct values (observed variables) and

,

the nodes whose values need to be determined. Our task is to label the nodes

Y i ∈Y

Y

L

{

L 1 ,...,L q }

with one of a small number of labels,

=

; we'll use the

shorthand y i to denote the label of node Y i .

We explain the notation further using a document classification example

shown in Figure 3.1 . In this example, we will use the words (and phrases)

contained in the documents as local attributes. Each document is indicated

by a box, the corresponding topic of the webpage is indicated by an ellipse

inside the box, and each word that appears in the document is represented

using a circle inside the box. The observed random variables

X

are shaded

whereas the unobserved ones

are not. We will assume that the domain

of the unobserved label variables is

Y

. Figure 3.1 shows a network with

two unobserved variables ( Y 1 and Y 2 ), which require prediction, and seven

observed variables ( X 3 , X 4 , X 5 , X 6 , X 7 , X 8 and X 9 ). Note that some of the

observed variables happen to be labels of webpages ( X 6 and X 8 )forwhichwe

know the correct values.

As mentioned in the introduction, due to the large body of work done in

this area of research, we have a number of approaches for collective classifica-

tion. At a broad level of abstraction, these approaches can be divided into two

distinct types, the first where we use a collection of unnormalized local condi-

tional classifiers and the second, where we define the collective classification

problem as one global objective function to be optimized. We next describe

these two approaches and, for each approach, we describe two approximate

inference algorithms.

L

3.3 Approximate Inference Algorithms for Approaches

Based on Local Conditional Classifiers

Two of the most commonly used approximate inference algorithms follow-

ing this approach are the iterative classification algorithm (ICA) and Gibbs

sampling (GS), and we next describe these in turn.

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home