Database Reference
In-Depth Information
that they represent different descriptions of the same events. We will use now
various techniques from pattern analysis to extract information about any
systematic differences found between the two outlets.
2.4 News Outlet Identification
Given this dataset of 816 pairs of news-items, we can test the hypothesis
that each outlet has its own bias in describing the events, which is reflected
in the choice of words for the article. We will use Support Vector Machines
(SVM) [see Appendix A ] to learn a linear classifier capable of identifying the
outlet of a news item by just looking at its content. If this is possible in a sta-
tistically significant way, then clearly the two documents are distinguishable,
or can be modeled as having been generated from a different distribution of
probability. Differences between the distributions underlying the two news
outlets will be the focus of our investigation.
We trained a SVM with a subset of the data, and tested it on the remaining
data. The task of the classifier was to guess if a given news article came from
CNN or from Al Jazeera. We used ten-fold cross-validation to evaluate the
classifiers. The data were randomly split into 10 folds of equal size and in
each turn one fold was held out. A classifier was trained on the remaining 9
folds and then evaluated on the fold that was held out. This was repeated for
all 10 folds and the results were averaged over these 10 iterations.
The performance in the task was measured by calculating the break-even-
point (BEP) which is a hypothetical point where precision (ratio of positive
documents among retrieved ones) and recall (ratio of retrieved positive docu-
ments among all positive documents) meet when varying the threshold. Other
measures are possible, and can be justified, in this context. Our choice of BEP
has advantages when we have imbalanced negative and positive sets, which is
the case when we try to assign a news item to a large set of possible outlets,
and hence negative examples are more frequent than positive ones.
Before using the 816 pairs that we selected by the matching process, we
decided to try by using the whole set of 9185 CNN and Al Jazeera news
articles, and used ten-fold cross-validation to evaluate the linear SVM classifier
trained on the set.
We obtained 91% BEP, a very high score showing that indeed it is very
easy to separate the two outlets. This high score can be expected since CNN
and AJ cover different topics (e.g., covers the whole world while Al Jazeera
mostly focuses on the topics regarding the Middle East). This means that
the outlet of an item can be more easily identified as the result of its topic.
In order to isolate the effect of term-choice bias, we will have to restrict our
analysis only to comparable news-items: those 816 news items that have been
 
Search WWH ::




Custom Search