Database Reference
In-Depth Information
matched by the algorithm described above.
The top 20 most important words for determining the outlet, when using
the full corpus, are:
Keywords for CNN: ap, insurgency, militants, national, police,
troops, china, vote, terrorists, authorities, united, united state, percent,
million, protests, suicide, years, allegations, program, day
Keywords for Al Jazeera: iraq, israel, iraqis, israeli, occupation,
americans, nuclear, aljazeera, palestinians, resistance, claim, withdraw,
attacks, guantanamo, mr, gaza stripped, war, shia, stripped, iranian
From the keywords we can see that the topics about the Middle East (' iraq ,'
' israel ,' ' gaza ') are more significant for Al Jazeera while business (' percent ,'
' million ,') elections (' vote '), and topics about other parts of the world (' china ')
are more significant for CNN. We can also see some difference in the vocabu-
lary, for example ' insurgency ,' ' militants ,' and ' terrorists 'versus' resistances .'
These keywords are the result of using the full corpus. As mentioned above,
we want to isolate the effect due to lexical bias to the effect due to topic bias,
by focussing only on those stories that are covered by both outlets.
For this new comparison of the two news outlets we used the set of news
pairs which we obtained automatically with the news matching algorithm.
Finding a correct news outlet for these articles is now a much harder task
since we remove any clues due to topic-choice, and we force the system to rely
solely on term-choice bias for distinguishing the two outlets. If we can train
a classifier which is better than random, then we can confidently state that
there is a significant and consistent difference in the vocabulary used by the
news outlets.
Results for ten-fold cross-validation on the news pairs are given in Table
2.3 and 2.4 . We can see that the BEP slowly increases to 87% when n in-
creases and decreases to 79% when time window increases. This matches our
observations from previous section that increasing n also increases noise in
the data while increasing window size decreases noise.
TABLE 2.3: Results for outlet identification of a news item, using
different sizes of nearest-neighbor list. Time windows size is fixed to 15
days.
n
1
2
3
4
5
6
7
8
9
10
BEP
73%
81%
84%
84%
85%
85%
85%
86%
87%
87%
The high result for low values of n and large sizes of time window indicates
that there is a bias in the choice of vocabulary used by the two news outlets
when covering the same events. To assess the significance of the results from
 
Search WWH ::




Custom Search