Detection of Bias in Media Outlets with Statistical Learning Methods - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

reduce the search space for nearest-neighbors and to eliminate false positives

from the detected matches).

Note that by using a nearest-neighbor list with n> 1, one news article

can have multiple mates. For example: let A be an article from outlet 1 and

B and C articles from outlet 2 and let n

2. If B and C are on the A's

nearest-neighbors list and A is on both B and C nearest-neighbor list, than

both articles A and B and articles A and C are selected as mates.

The result is a small subset of news items for each outlet for which we are

reasonably sure there is a matching item in the other news outlet. Of course,

by tuning the parameter n one can create larger subsets, at the expense of

more noise in the matching process. As expected, CNN started with more

stories and focuses on more global issues,soonlyasmallfractionofthoseare

present also in Al Jazeera. In turn, Al Jazeera has a more regional focus, and

smaller set of news, so a larger fraction of its stories are found to have a mate

in CNN.

≥

TABLE 2.2: Number of discovered news pairs and the percentage of

the articles from each news outlet that appear in at least in one pair. AJ

stands for Al Jazeera.

n

1

2

3

4

5

6

7

8

9

10

pairs

421

816

1101

1326

1506

1676

1865

2012

2169

2339

CNN

6%

9%

13%

14%

16%

17%

18%

19%

20%

21%

AJ

20%

33%

35%

39%

42%

45%

48%

51%

53%

56%

Table 2.2 shows the number of discovered pairs as a function of the param-

eter n . The last two rows are the percentage of news articles from each of the

two outlets that appear in at least one pair. To evaluate the discovered pairs

we randomly selected a subset of 100 pairs for n =1 , 2 and evaluated them

by close inspection. The precision for n = 1 was found to be 96% and the

precision for n = 2 was found to be 86%.

The number of discovered pairs increases significantly by increasing the size

of nearest-neighbor list size n . We can use estimated precision to approximate

that for n = 1 the algorithm found around 400 correct pairs and for n =2

around 700 pairs. From this we can see that by increasing the nearest-neighbor

list size to n = 2 the precision of discovered pairs drops for 10% but at the

same time the recall increases significantly. We can not give an accurate

estimate of recall since we do not have a complete list of matchings for out

data.

By further increasing the parameter n eventually each news from CNN

would be matched with each of the news from Al Jazeera within the time

window (15 days). Since we are interested in a large while still accurate set of

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home