Detection of Bias in Media Outlets with Statistical Learning Methods - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

For the first two experiments, we constructed a paired corpus of news-items,

much like is done in cross-language content analysis, where each pair is formed

by one item from AJ and one item from CNN, reporting on the same story.

The corpus was created by extracting the text of each story from HTML pages,

using a support vector machine, and later it was paired using an algorithm

developed for this purpose. The SVM was necessary as we described each

portion of text in the HTML page with a set of features, and we needed to

classify these feature vectors in order to identify the portion corresponding to

the actual content. Starting from 9185 news-items gathered over a period of

13 months in 2005 and 2006 from those two news outlets, 816 pairs were so

obtained, most of which turned out to be related to Middle East politics and

events.

The first task for the learning algorithm was to identify the outlet where a

given news item had appeared, based only on its content. Furthermore, it has

been possible to isolate a subset of words that are crucial in informing this

decision. These are words that are used in different ways by the two outlets.

In other words, the choice of terms is biased in the two outlets, and these

keywords are the most polarized ones. This includes a preference for terms

such as ' insurgency ,' ' militants ,' ' terrorists ' in CNN when describing the same

stories in which Al Jazeera prefers using the words ' resistance ,' ' fighters ,' and

' rebels .'

For the last set of experiments, involving the generation of Maps, we have

used the full corpus. Obtained with the same techniques and for the same

time interval, it contains 21552 news items: 2142 for AJ, 6840 for CNN, 2929

for DN, and 9641 for IHT. The two news outlets with more regional focus

(AJ and DN) have the smallest set of news, as well as having the smallest

intersection, resulting in few stories being covered by all 4 newspapers. Most

stories that were covered by all four news outlets were mostly Middle East

related.

2.3 Data Collection and Preparation

The dataset used in all three experiments was gathered between March 31st

2005 and April 14th 2006 from the websites of AJ, CNN, DN, and IHT. A

subset of matching item-pairs was then identified for each pair of news outlets.

The acquisition and the matching algorithms are described below. For CNN

and Al Jazeera 816 pairs were determined to be matching, and used in the

first two experiments. Not surprisingly, these referred mostly to Middle East

events.

Search WWH ::

Custom Search

Home