Detection of Bias in Media Outlets with Statistical Learning Methods - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

carried a given news item; it is possible to decompose the space of documents

into topics, and detect the most polarizing ones; it is possible to recognize

which terms contribute the most to the bias; these quantities can also be used

to design two independent measures of similarity between news outlets, one

capturing their topic-choice bias, the other capturing their term-choice bias.

Maps of the media system could be created based on these metrics, and since

every step of this analysis has been done automatically, these could scale up

to very large sizes.

This Chapter is organized as follows: in the next section we will give an

overview of the experiments we performed; in Section 3 we will describe how

we obtained and prepared the data, including the method we used to identify

news-items covering the same story in different outlets; in Section 4 we will

describe the outlet identification experiments using SVMs; in Section 5 we will

describe the kCCA experiments to isolate the topics in which polarization is

most present; in Section 6 we will show how similarity measures between

outlets can be designed based on the previous experiments; and in Section 7

we will discuss the results, and - importantly - various recent results that are

closely related to this study, including work on detecting author's perspective

based on the contents of a document.

2.2 Overview of the Experiments

An automatic system based on learning algorithms has been used to create a

corpus of news-items that appeared in the online versions of the 4 international

news outlets between 31st March 2005 and 14th of April 2006. We have

performed three experiments on this dataset, aimed at extracting patterns

from the news content that relate to a bias in lexical choice when reporting

the same events, or a bias in choosing the events to cover.

The first experiment, using Support Vector Machines (4) and limited to

CNN and AJ, demonstrates how it is possible to identify the outlet of a news

item based on its content, and identifies the terms that are most helpful in this

discrimination. The second experiment, using Canonical Correlation Analysis

(14), identifies topics in the CNN/AJ part of the corpus, and then identifies

words that are discriminative for the two outlets in each topic. Finally, we

have generated maps reflecting the distance separating the 4 outlets, based

both on topic-choice and on lexical-choice features.

In order to separate the two effects (choice of topics and of lexicon) we de-

veloped an algorithm to identify corresponding news-items in different outlets

(based on a combination of date and bag-of-words similarity). This means

that any patterns in lexical difference we identify are obtained by comparing

different versions of the same stories.

Search WWH ::

Custom Search

Home