Database Reference
In-Depth Information
Recently, significant attention has been paid to various aspects of text anal-
ysis that have relevance to the task of automating media content analysis.
Opinion analysis, sentiment analysis, topic categorization, have all reached a
reliable level of performance, and most of the main outlets have now a free
digital version available over the internet. This creates the opportunity to
automatize large part of the media-content analysis process.
From the technical point of view, coding by using a questionnaire is akin
to what machine learning researchers call “pattern matching”: the detection
of a pre-fixed property or pattern in a set of data. This is often done by
matching keywords in certain positions, in the context of classical content
analysis. What is increasingly becoming possible, however, is the transition
to “pattern discovery”: the detection of interesting properties in the data,
that do not belong to a pre-compiled list of properties. In other words, the
questionnaire used by human coders could be replaced by statistical patterns
discovered by a machine learning algorithm, if high quality annotated data is
available.
In this Chapter, we present a case study where subtle biases are detected in
the content of four online media outlets: CNN, Al Jazeera (AJ), International
Herald Tribune (IHT), Detroit News (DN). We focus on two types of bias,
corresponding to two degrees of freedom in the outlets: the choice of stories to
cover, and the choice of terms when reporting on a given story. We will show
how algorithms from statistical learning theory (and particularly kernel-based
methods, in this case) can be combined with ideas from traditional statistics,
in order to detect and validate the presence of systematic biases in the content
of news outlets.
We will ask the following questions: can we identify which outlet has writ-
ten a given news-item? If so, after correcting for topic-choice bias, we would
be able to claim that patterns in the language are responsible for this identifi-
cation. Another - orthogonal - question we will address is: which news-items
are more likely to be carried by a given outlet? Technically, we address this
question by devising a measure of statistical similarity between two outlets,
based on how much they overlap in their choice of stories to cover. Finally,
we use a technique from cross-language text analysis, to automatically de-
compose the set of topics covered in our corpus, in order to find the most
polarizing topics, that is those topics where term-choice bias is more evident.
This case study will demonstrate the application of Support Vector Ma-
chines (SVM), kernel Canonical, Correlation Analysis (kCCA), Multi Dimen-
sional Scaling (MDS), in the context of media content analysis. After report-
ing the results of our experiments, and their p-values, we will also speculate
about possible interpretations of these results. While the first aspect will con-
tain objective information, the interpretation will necessarily be subjective,
and we will alert the reader to this fact.
While the emphasis of this Chapter is to demonstrate a new use of Statistical
Learning technology, the experimental results are of interest in their own right,
and can be summarized as follows: it is possible to identify which news outlet
Search WWH ::




Custom Search