Database Reference
In-Depth Information
Table 6. Top 10 features ranked by χ 2 statistic for four Digg topics. The respective topic corpora consist
of 50,000 documents each
Technology
World & Business
Entertainment
Sports
#
Term
χ 2
Term
χ 2
Term
χ 2
Term
χ 2
1
Apple
1215.88
Bush
2532.04
RIAA
1457.53
Amazing
1264.88
2
Windows
1080.11
president
1456.52
Movie
1346.79
Nfl
1181.59
3
Linux
995.02
Iraq
1182.86
movies
1015.56
game
1158.17
4
Google
945.72
house
1016.15
Industry
940.50
baseball
1146.77
5
Firefox
919.13
years
974.60
Time
727.12
time
1040.09
6
Just
784.83
administration
964.33
Just
672.92
history
975.46
7
Digg
773.09
congress
872.78
Show
671.09
year
972.57
8
Mac
756.55
officials
841.12
Says
647.37
just
894.85
9
OS
715.56
war
27.25
ilm
635.63
Top
892.82
10
check
643.97
federal
18.37
Lost
628.37
player
877.17
Table 7. Popularity prediction results, namely Precision (P), Recall (R) and F-measure (F). Classifier
abbreviations used: NB → Naïve Bayes, SVM → Support Vector Machines and C4.5 → Quinlan's
decision trees. Tests were carried out with the use of 500 features on a 50,000 story randomly selected
corpus and by use of 10-fold cross validation to obtain the recorded measures
Popular
Non-popular
Classifier
Accuracy (%)
P
R
F
P
R
F
NB (CHI)
67.21
0.160
0.426
0.233
0.903
0.705
0.791
NB (DF)
88.32
1.000
0.001
0.003
0.883
1.000
0.938
SVM (CHI)
88.10
0.130
0.003
0.006
0.883
0.997
0.937
C4.5 (CHI)
88.18
0.222
0.004
0.008
0.883
0.998
0.937
Although in terms of accuracy, the combination
of Naïve Bayes with DF-selected features appears
to perform best, a closer examination of the Preci-
sion and Recall measures obtained separately for
the classes Popular and Non-popular provides a
different insight. Specifically, it appears that all
classifiers have trouble achieving descent clas-
sification performance when the input stories are
Popular. That means that classifiers can predict
accurately that a story will remain Non-popular
(when indeed that is the case), but they usually
fail to identify Popular stories. The combination
of Naïve Bayes with CHI-selected features is bet-
ter in that respect. The aforementioned problem
is related to the well recognized class imbalance
problem in machine learning (Japkowicz, 2000).
Indeed, in the 50,000 stories of the evaluation
dataset the ratio of Popular to Non-popular stories
is 0.132, i.e. there is almost only one Popular story
for every ten Non-popular ones.
Similar results are also obtained for the case that
the feature selection and classification process is
applied separately per topic. The results of Table
8 provide the respective evidence.
 
Search WWH ::




Custom Search