Information Technology Reference
In-Depth Information
Hence the corpus becomes a matrix
m
x
n
, which have
m
rows and
n
cols with
respect to
m
document and
n
terms.
D
i
is called document vector.
The essence of document classification is to use supervised learning algorithms
in order to classify corpus into groups of documents; each group is labeled. In this
chapter we apply two methods namely support vector machine, decision tree and
neural network for document classification.
2 Document Classification Based on Support Vector Machine
2.1 Support Vector Machine
Support vector machine (SVM) [Cristianini, Shawe-Taylor 2000] is a supervised
learning algorithm for classification and regression. Given a set of
n-
dimensional
vectors in vector space, SVM finds the separating hyper-plane that splits vector
space into sub-set of vector; each separated sub-set (so-called data set) is assigned
one class. There is the constraint for this separating hyper-plane: “it must
maximize the margin between two sub-sets”.
Fig. 1
Separating hyper-planes
Suppose we have some
n-
dimensional vectors; each of them belongs to one of
two classes. We can find many
n-1
dimensional hyper-planes that classify such
vectors but there is only one hyper-plane that maximizes the margin between two
classes. In other words, the nearest between a point in one side of this hyper-plane
and other side of this hyper-plane is maximized. Such hyper-plane is called
maximum-margin hyper-plane and it is considered as maximum-margin classifier.
Let {
X
1
, X
2
,…, X
n
} be the training set of vectors and let
y
i
= {1, -1}be the class
label of vector
X
i
. It is necessary to determine the maximum-margin hyper-plane
that separates vectors belonging to
y
i
=1 from vectors belonging to
y
i
= -1. This
hyper-plane is written as the set of point satisfying: