Database Reference
In-Depth Information
carried a given news item; it is possible to decompose the space of documents
into topics, and detect the most polarizing ones; it is possible to recognize
which terms contribute the most to the bias; these quantities can also be used
to design two independent measures of similarity between news outlets, one
capturing their topic-choice bias, the other capturing their term-choice bias.
Maps of the media system could be created based on these metrics, and since
every step of this analysis has been done automatically, these could scale up
to very large sizes.
This Chapter is organized as follows: in the next section we will give an
overview of the experiments we performed; in Section 3 we will describe how
we obtained and prepared the data, including the method we used to identify
news-items covering the same story in different outlets; in Section 4 we will
describe the outlet identification experiments using SVMs; in Section 5 we will
describe the kCCA experiments to isolate the topics in which polarization is
most present; in Section 6 we will show how similarity measures between
outlets can be designed based on the previous experiments; and in Section 7
we will discuss the results, and - importantly - various recent results that are
closely related to this study, including work on detecting author's perspective
based on the contents of a document.
2.2 Overview of the Experiments
An automatic system based on learning algorithms has been used to create a
corpus of news-items that appeared in the online versions of the 4 international
news outlets between 31st March 2005 and 14th of April 2006. We have
performed three experiments on this dataset, aimed at extracting patterns
from the news content that relate to a bias in lexical choice when reporting
the same events, or a bias in choosing the events to cover.
The first experiment, using Support Vector Machines (4) and limited to
CNN and AJ, demonstrates how it is possible to identify the outlet of a news
item based on its content, and identifies the terms that are most helpful in this
discrimination. The second experiment, using Canonical Correlation Analysis
(14), identifies topics in the CNN/AJ part of the corpus, and then identifies
words that are discriminative for the two outlets in each topic. Finally, we
have generated maps reflecting the distance separating the 4 outlets, based
both on topic-choice and on lexical-choice features.
In order to separate the two effects (choice of topics and of lexicon) we de-
veloped an algorithm to identify corresponding news-items in different outlets
(based on a combination of date and bag-of-words similarity). This means
that any patterns in lexical difference we identify are obtained by comparing
different versions of the same stories.
 
Search WWH ::




Custom Search