Database Reference
In-Depth Information
reduce the search space for nearest-neighbors and to eliminate false positives
from the detected matches).
Note that by using a nearest-neighbor list with n> 1, one news article
can have multiple mates. For example: let A be an article from outlet 1 and
B and C articles from outlet 2 and let n
2. If B and C are on the A's
nearest-neighbors list and A is on both B and C nearest-neighbor list, than
both articles A and B and articles A and C are selected as mates.
The result is a small subset of news items for each outlet for which we are
reasonably sure there is a matching item in the other news outlet. Of course,
by tuning the parameter n one can create larger subsets, at the expense of
more noise in the matching process. As expected, CNN started with more
stories and focuses on more global issues,soonlyasmallfractionofthoseare
present also in Al Jazeera. In turn, Al Jazeera has a more regional focus, and
smaller set of news, so a larger fraction of its stories are found to have a mate
in CNN.
TABLE 2.2: Number of discovered news pairs and the percentage of
the articles from each news outlet that appear in at least in one pair. AJ
stands for Al Jazeera.
n
1
2
3
4
5
6
7
8
9
10
pairs
421
816
1101
1326
1506
1676
1865
2012
2169
2339
CNN
6%
9%
13%
14%
16%
17%
18%
19%
20%
21%
AJ
20%
33%
35%
39%
42%
45%
48%
51%
53%
56%
Table 2.2 shows the number of discovered pairs as a function of the param-
eter n . The last two rows are the percentage of news articles from each of the
two outlets that appear in at least one pair. To evaluate the discovered pairs
we randomly selected a subset of 100 pairs for n =1 , 2 and evaluated them
by close inspection. The precision for n = 1 was found to be 96% and the
precision for n = 2 was found to be 86%.
The number of discovered pairs increases significantly by increasing the size
of nearest-neighbor list size n . We can use estimated precision to approximate
that for n = 1 the algorithm found around 400 correct pairs and for n =2
around 700 pairs. From this we can see that by increasing the nearest-neighbor
list size to n = 2 the precision of discovered pairs drops for 10% but at the
same time the recall increases significantly. We can not give an accurate
estimate of recall since we do not have a complete list of matchings for out
data.
By further increasing the parameter n eventually each news from CNN
would be matched with each of the news from Al Jazeera within the time
window (15 days). Since we are interested in a large while still accurate set of
Search WWH ::




Custom Search