Database Reference
In-Depth Information
For the first two experiments, we constructed a paired corpus of news-items,
much like is done in cross-language content analysis, where each pair is formed
by one item from AJ and one item from CNN, reporting on the same story.
The corpus was created by extracting the text of each story from HTML pages,
using a support vector machine, and later it was paired using an algorithm
developed for this purpose. The SVM was necessary as we described each
portion of text in the HTML page with a set of features, and we needed to
classify these feature vectors in order to identify the portion corresponding to
the actual content. Starting from 9185 news-items gathered over a period of
13 months in 2005 and 2006 from those two news outlets, 816 pairs were so
obtained, most of which turned out to be related to Middle East politics and
events.
The first task for the learning algorithm was to identify the outlet where a
given news item had appeared, based only on its content. Furthermore, it has
been possible to isolate a subset of words that are crucial in informing this
decision. These are words that are used in different ways by the two outlets.
In other words, the choice of terms is biased in the two outlets, and these
keywords are the most polarized ones. This includes a preference for terms
such as ' insurgency ,' ' militants ,' ' terrorists ' in CNN when describing the same
stories in which Al Jazeera prefers using the words ' resistance ,' ' fighters ,' and
' rebels .'
For the last set of experiments, involving the generation of Maps, we have
used the full corpus. Obtained with the same techniques and for the same
time interval, it contains 21552 news items: 2142 for AJ, 6840 for CNN, 2929
for DN, and 9641 for IHT. The two news outlets with more regional focus
(AJ and DN) have the smallest set of news, as well as having the smallest
intersection, resulting in few stories being covered by all 4 newspapers. Most
stories that were covered by all four news outlets were mostly Middle East
related.
2.3 Data Collection and Preparation
The dataset used in all three experiments was gathered between March 31st
2005 and April 14th 2006 from the websites of AJ, CNN, DN, and IHT. A
subset of matching item-pairs was then identified for each pair of news outlets.
The acquisition and the matching algorithms are described below. For CNN
and Al Jazeera 816 pairs were determined to be matching, and used in the
first two experiments. Not surprisingly, these referred mostly to Middle East
events.
 
Search WWH ::




Custom Search