Detection of Bias in Media Outlets with Statistical Learning Methods - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

that they represent different descriptions of the same events. We will use now

various techniques from pattern analysis to extract information about any

systematic differences found between the two outlets.

2.4 News Outlet Identification

Given this dataset of 816 pairs of news-items, we can test the hypothesis

that each outlet has its own bias in describing the events, which is reflected

in the choice of words for the article. We will use Support Vector Machines

(SVM) [see Appendix A ] to learn a linear classifier capable of identifying the

outlet of a news item by just looking at its content. If this is possible in a sta-

tistically significant way, then clearly the two documents are distinguishable,

or can be modeled as having been generated from a different distribution of

probability. Differences between the distributions underlying the two news

outlets will be the focus of our investigation.

We trained a SVM with a subset of the data, and tested it on the remaining

data. The task of the classifier was to guess if a given news article came from

CNN or from Al Jazeera. We used ten-fold cross-validation to evaluate the

classifiers. The data were randomly split into 10 folds of equal size and in

each turn one fold was held out. A classifier was trained on the remaining 9

folds and then evaluated on the fold that was held out. This was repeated for

all 10 folds and the results were averaged over these 10 iterations.

The performance in the task was measured by calculating the break-even-

point (BEP) which is a hypothetical point where precision (ratio of positive

documents among retrieved ones) and recall (ratio of retrieved positive docu-

ments among all positive documents) meet when varying the threshold. Other

measures are possible, and can be justified, in this context. Our choice of BEP

has advantages when we have imbalanced negative and positive sets, which is

the case when we try to assign a news item to a large set of possible outlets,

and hence negative examples are more frequent than positive ones.

Before using the 816 pairs that we selected by the matching process, we

decided to try by using the whole set of 9185 CNN and Al Jazeera news

articles, and used ten-fold cross-validation to evaluate the linear SVM classifier

trained on the set.

We obtained 91% BEP, a very high score showing that indeed it is very

easy to separate the two outlets. This high score can be expected since CNN

and AJ cover different topics (e.g., covers the whole world while Al Jazeera

mostly focuses on the topics regarding the Middle East). This means that

the outlet of an item can be more easily identified as the result of its topic.

In order to isolate the effect of term-choice bias, we will have to restrict our

analysis only to comparable news-items: those 816 news items that have been

Search WWH ::

Custom Search

Home