Discovering User Interests by Document Classification - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

Hence the corpus becomes a matrix m x n , which have m rows and n cols with

respect to m document and n terms. D i is called document vector.

The essence of document classification is to use supervised learning algorithms

in order to classify corpus into groups of documents; each group is labeled. In this

chapter we apply two methods namely support vector machine, decision tree and

neural network for document classification.

2 Document Classification Based on Support Vector Machine

2.1 Support Vector Machine

Support vector machine (SVM) [Cristianini, Shawe-Taylor 2000] is a supervised

learning algorithm for classification and regression. Given a set of n- dimensional

vectors in vector space, SVM finds the separating hyper-plane that splits vector

space into sub-set of vector; each separated sub-set (so-called data set) is assigned

one class. There is the constraint for this separating hyper-plane: “it must

maximize the margin between two sub-sets”.

Fig. 1 Separating hyper-planes

Suppose we have some n- dimensional vectors; each of them belongs to one of

two classes. We can find many n-1 dimensional hyper-planes that classify such

vectors but there is only one hyper-plane that maximizes the margin between two

classes. In other words, the nearest between a point in one side of this hyper-plane

and other side of this hyper-plane is maximized. Such hyper-plane is called

maximum-margin hyper-plane and it is considered as maximum-margin classifier.

Let { X 1 , X 2 ,…, X n } be the training set of vectors and let y i = {1, -1}be the class

label of vector X i . It is necessary to determine the maximum-margin hyper-plane

that separates vectors belonging to y i =1 from vectors belonging to y i = -1. This

hyper-plane is written as the set of point satisfying:

Search WWH ::

Custom Search

Home