Detection of Bias in Media Outlets with Statistical Learning Methods - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

Recently, significant attention has been paid to various aspects of text anal-

ysis that have relevance to the task of automating media content analysis.

Opinion analysis, sentiment analysis, topic categorization, have all reached a

reliable level of performance, and most of the main outlets have now a free

digital version available over the internet. This creates the opportunity to

automatize large part of the media-content analysis process.

From the technical point of view, coding by using a questionnaire is akin

to what machine learning researchers call “pattern matching”: the detection

of a pre-fixed property or pattern in a set of data. This is often done by

matching keywords in certain positions, in the context of classical content

analysis. What is increasingly becoming possible, however, is the transition

to “pattern discovery”: the detection of interesting properties in the data,

that do not belong to a pre-compiled list of properties. In other words, the

questionnaire used by human coders could be replaced by statistical patterns

discovered by a machine learning algorithm, if high quality annotated data is

available.

In this Chapter, we present a case study where subtle biases are detected in

the content of four online media outlets: CNN, Al Jazeera (AJ), International

Herald Tribune (IHT), Detroit News (DN). We focus on two types of bias,

corresponding to two degrees of freedom in the outlets: the choice of stories to

cover, and the choice of terms when reporting on a given story. We will show

how algorithms from statistical learning theory (and particularly kernel-based

methods, in this case) can be combined with ideas from traditional statistics,

in order to detect and validate the presence of systematic biases in the content

of news outlets.

We will ask the following questions: can we identify which outlet has writ-

ten a given news-item? If so, after correcting for topic-choice bias, we would

be able to claim that patterns in the language are responsible for this identifi-

cation. Another - orthogonal - question we will address is: which news-items

are more likely to be carried by a given outlet? Technically, we address this

question by devising a measure of statistical similarity between two outlets,

based on how much they overlap in their choice of stories to cover. Finally,

we use a technique from cross-language text analysis, to automatically de-

compose the set of topics covered in our corpus, in order to find the most

polarizing topics, that is those topics where term-choice bias is more evident.

This case study will demonstrate the application of Support Vector Ma-

chines (SVM), kernel Canonical, Correlation Analysis (kCCA), Multi Dimen-

sional Scaling (MDS), in the context of media content analysis. After report-

ing the results of our experiments, and their p-values, we will also speculate

about possible interpretations of these results. While the first aspect will con-

tain objective information, the interpretation will necessarily be subjective,

and we will alert the reader to this fact.

While the emphasis of this Chapter is to demonstrate a new use of Statistical

Learning technology, the experimental results are of interest in their own right,

and can be summarized as follows: it is possible to identify which news outlet

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home