Database Reference
In-Depth Information
2.3.1 Article Extraction from HTML Pages
We implemented a system to automatically retrieve every day news items
from different news outlets of the web. Some work was done to automatically
recognize the content within the HTML page. This was also based on the
use of SVMs, in order to create a general-purpose extractor that can work
with any outlet, but will not be described here in much detail, due to space
limitations.
By using a crawler every day for more than 1 year over the 4 outlets men-
tioned above, and extracting titles and contents from the HTML pages, we
obtained a total of more than 21000 news items, most of which are about
Middle East politics and events. For each news item its outlet, date, title,
and content are known. The table below gives a precise description of the
corpus we created. Further filtering of the news stories will be achieved at a
later stage, since the matching algorithm will discard all the news items that
cannot be paired reliably.
TABLE 2.1: Number of news items collected
from different outlets.
outlet
No. of news
Al Jazeera
2142
CNN
6840
Detroit News
2929
International Herald Tribune
9641
The news collection on which we performed the first part of our analysis
consisted of just two outlets, Al Jazeera and CNN, while in the second part of
our experiments we use all four news outlets for constructing a map of outlets
based on topic similarity and a map based on vocabulary bias.
2.3.2 Data Preparation
The 21552 documents generated by the algorithm described above are
purely text files. As part of data preparation we removed stop words and
replaced the rest of the words with their appropriate stems. We used a list of
523 stop words and porter stemmer. After the initial cleaning we extracted a
list of words, bigrams, and trigrams (or terms in short) that appear at least
five times in the news collection. We used the extracted list of terms to define
the dimensions in the bag-of-words space [see Appendix B ]. We also replaced
each stemmed word with the most frequent word from the news collection
with the same stem, for the purposes of visualization of results at the end of
the pipeline.
The implementations of text mining and machine learning algorithms for
Search WWH ::




Custom Search