Database Reference
In-Depth Information
Semantically Enabled Knowledge Technologies (IST-1-506826-IP) and PAS-
CAL Network of Excellence (IST-2002-506778). This publication only reflects
the authors' views.
2.9 Appendix A: Support Vector Machines
Support vector machine is a family of algorithms that has gained a wide
recognition in the recent years as one of the state-of-the-art machine learn-
ing algorithms for tasks such as classification, regression, etc. In the basic
formulation they try to separate two sets of training examples by hyperplane
that maximizes the margin (distance between the hyperplane and the closest
points). In addition one usually permits few training examples to be misclas-
sified; this is know as the soft-margin SVM. The linear SVM is known to be
one of the best performing methods for text categorization, e.g., in (2).
The linear SVM model can also be used for feature selection. In (13), the
hyperplane's normal vector is used for ranking the features. In this paper
we use this approach to find which features (in our case words) are the most
important for a news article being classified in to one of the two outlets.
2.10 Appendix B: Bag of Words and Vector Space Mod-
els
The classic representation of a text document in Information Retrieval is
as Bag of Words (a bag is a set where repetitions are allowed), also known
as Vector Space Model, since a bag can be represented as a (column) vector
recording the number of occurrences of each word of the dictionary in the
document at hand.
A document is represented, in the vector-space model, by a vertical vector d
indexed by all the elements of the dictionary ( i -th element from the vector is
the frequency of i -thterminthedocumentTF i ). A corpus is represented by
amatrix D , whose columns are indexed by the documents and whose rows
are indexed by the terms, D =( d 1 ,..., d N ). We also call the data matrix D
the term-document matrix.
Since not all terms are of the same importance for determining similarity
between the documents we introduce term weights. A term weight corre-
sponds to the importance of the term for the given corpus and each element
from the document vector is multiplied with the respective term weight. The
most widely used weighting is called TFIDF weighting.
 
Search WWH ::




Custom Search