Detection of Bias in Media Outlets with Statistical Learning Methods - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

2.3.1 Article Extraction from HTML Pages

We implemented a system to automatically retrieve every day news items

from different news outlets of the web. Some work was done to automatically

recognize the content within the HTML page. This was also based on the

use of SVMs, in order to create a general-purpose extractor that can work

with any outlet, but will not be described here in much detail, due to space

limitations.

By using a crawler every day for more than 1 year over the 4 outlets men-

tioned above, and extracting titles and contents from the HTML pages, we

obtained a total of more than 21000 news items, most of which are about

Middle East politics and events. For each news item its outlet, date, title,

and content are known. The table below gives a precise description of the

corpus we created. Further filtering of the news stories will be achieved at a

later stage, since the matching algorithm will discard all the news items that

cannot be paired reliably.

TABLE 2.1: Number of news items collected

from different outlets.

outlet

No. of news

Al Jazeera

2142

CNN

6840

Detroit News

2929

International Herald Tribune

9641

The news collection on which we performed the first part of our analysis

consisted of just two outlets, Al Jazeera and CNN, while in the second part of

our experiments we use all four news outlets for constructing a map of outlets

based on topic similarity and a map based on vocabulary bias.

2.3.2 Data Preparation

The 21552 documents generated by the algorithm described above are

purely text files. As part of data preparation we removed stop words and

replaced the rest of the words with their appropriate stems. We used a list of

523 stop words and porter stemmer. After the initial cleaning we extracted a

list of words, bigrams, and trigrams (or terms in short) that appear at least

five times in the news collection. We used the extracted list of terms to define

the dimensions in the bag-of-words space [see Appendix B ]. We also replaced

each stemmed word with the most frequent word from the news collection

with the same stem, for the purposes of visualization of results at the end of

the pipeline.

The implementations of text mining and machine learning algorithms for

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home