Database Reference
In-Depth Information
text preprocessing, Support Vector Machine, Kernel Canonical Correlation
Analysis, and Multidimensional scaling which were used in the experiments
were all taken from the Text Garden (8) software library.
2.3.3 Detection of Matching News Items
We are interested in investigating how different outlets report the same
events. To this end, the first step is to identify items from two news outlets, for
example Al Jazeera and CNN, that do refer to the same event. We call them
“mates,” and we call the problem of finding them the “matching problem.”
Here is an example of two mate articles, the first one is from CNN and the
second one is from Al Jazeera:
UK soldiers cleared in Iraqi death-SevenBritishsoldiers
were acquitted on Thursday of charges of beating an inno-
cent Iraqi teenager to death with rifle butts. A judge at
a specially convened military court in eastern England or-
dered the adjudicating panel to return 'not guilty' verdicts
against the seven because he did not believe there was suf-
ficient evidence against them, the Ministry of Defence said.
...
British murderers in Iraq acquitted - The judge at a court-
martial on Thursday dismissed murder charges against seven
soldiers, from the 3rd Battalion, the Parachute Regiment,
who're accused of murdering Iraqi teenager; claiming there's
insufficient evidence to secure a conviction, The Associated
Press reported Thursday. . . .
For finding matching news items we used a method similar to what is used in
bioinformatics to detect homologous genes: the method called Best Reciprocal
Hit (BRH). Two genes are homologous (respectively, two articles are mates)
if they belong to different organisms (respectively, news outlets) and are each
other's nearest neighbor (in some appropriate similarity metric).
We represented the documents as bags of words, and used the cosine in
the resulting vector space representation as the similarity measure. We also
relaxed the method somewhat: our algorithm operates on a list of top n
nearest-neighbors for each news item. The nearest-neighbors for a particular
news item are only selected from the opposite news outlet and within a 15
days time window around the news item. If two articles appear in each other's
nearest-neighbors lists and if they appeared in the news with at most one day
difference then the two articles are selected as mates. This ensures that the
documents have both word similarity and date similarity (we take advantage
of the fact that each news item has an assigned date and use the date to
Search WWH ::




Custom Search