Detection of Bias in Media Outlets with Statistical Learning Methods - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

Semantically Enabled Knowledge Technologies (IST-1-506826-IP) and PAS-

CAL Network of Excellence (IST-2002-506778). This publication only reflects

the authors' views.

2.9 Appendix A: Support Vector Machines

Support vector machine is a family of algorithms that has gained a wide

recognition in the recent years as one of the state-of-the-art machine learn-

ing algorithms for tasks such as classification, regression, etc. In the basic

formulation they try to separate two sets of training examples by hyperplane

that maximizes the margin (distance between the hyperplane and the closest

points). In addition one usually permits few training examples to be misclas-

sified; this is know as the soft-margin SVM. The linear SVM is known to be

one of the best performing methods for text categorization, e.g., in (2).

The linear SVM model can also be used for feature selection. In (13), the

hyperplane's normal vector is used for ranking the features. In this paper

we use this approach to find which features (in our case words) are the most

important for a news article being classified in to one of the two outlets.

2.10 Appendix B: Bag of Words and Vector Space Mod-

els

The classic representation of a text document in Information Retrieval is

as Bag of Words (a bag is a set where repetitions are allowed), also known

as Vector Space Model, since a bag can be represented as a (column) vector

recording the number of occurrences of each word of the dictionary in the

document at hand.

A document is represented, in the vector-space model, by a vertical vector d

indexed by all the elements of the dictionary ( i -th element from the vector is

the frequency of i -thterminthedocumentTF i ). A corpus is represented by

amatrix D , whose columns are indexed by the documents and whose rows

are indexed by the terms, D =( d 1 ,..., d N ). We also call the data matrix D

the term-document matrix.

Since not all terms are of the same importance for determining similarity

between the documents we introduce term weights. A term weight corre-

sponds to the importance of the term for the given corpus and each element

from the document vector is multiplied with the respective term weight. The

most widely used weighting is called TFIDF weighting.

Search WWH ::

Custom Search

Home