Database Reference
In-Depth Information
all terms from the Digg stories, it was possible to
create the two-way contingency matrix of Table 4
for each term t . According to this, A is the number
of popular stories containing term t , B the number
of non-popular stories containing t , and C and D
are the number of popular and non-popular stories
respectively that don't contain term t .
Then, the χ 2 statistic is calculated based on the
following equation:
words and numeric strings were also filtered out
of the feature selection process. In order to keep
the experiments simple, no stemming or other
term normalization was carried out. Table 5 lists
the top 30 terms of this story set along with their
χ 2 scores. Although not very informative to the
human inspector, these keywords can be consid-
ered as the most appropriate (from a text feature
perspective) for use in making the distinction
between Popular and Non-popular stories.
In order to get a more fine-grained view of
such keywords per topic, we also calculate the χ 2
scores on independent corpora that contain stories
only from particular topics. Table 6 provides such
a topic-specific χ 2 -based ranking of terms.
After ranking the terms of each corpus based
on their class separation ability, it is possible
to select the top K of them and use them in an
automatic text classification scheme. Table 7
presents the results achieved by such a classifi-
cation scheme, i.e. the success of predicting the
popularity of Digg stories, where three classifiers
are compared, namely a Naïve Bayes classifier, an
SVM and a C4.5 decision tree. The dataset used
consists of 50,000 randomly selected stories and
the performance metrics were calculated by use of
10-fold cross validation (i.e. repeated splitting of
the dataset to 10 parts and usage of nine of them
for training and one of them for testing).
= × × -×
+× +× +× +
NADCB
AC BD ABCD
(
)
2
c 2
()
t
(
) (
) (
) (
)
(8)
The χ 2 statistic naturally takes a value of zero if
term t and the class of Popular stories are indepen-
dent. The measure is only problematic when any
of the contingency table cells is lightly populated,
which is the case for low-frequency terms. For
that reason, we filter low-frequency terms prior
to the calculation of Equation 8. In addition, stop
Table 4. Two-way contingency table of term t
Popular
Non-popular
Term exists
A
B
Term doesn't exist
C
D
Table 5. Top 30 text features based on χ 2 statistic
#
Term
χ 2 (10 6 )
#
Term
χ 2 (10 6 )
#
Term
χ 2 (10 6 )
1
see
61.5
11
seen
7.7
21
news
4.2
2
drive
60.4
12
nintendo
7.6
22
making
4.1
3
japanese
28.5
13
program
5.6
23
breaking
4.0
4
video
17.2
14
way
5.4
24
amazing
3.5
5
google
12.9
15
gets
5.2
25
say
3.4
6
long
11.7
16
computer
4.9
26
coolest
3.4
7
cool
11.0
17
need
4.8
27
release
3.3
8
term
9.9
18
want
4.7
28
right
.9
9
look
8.8
9
play
4.7
29
xbox
2.8
10
high
7.9
20
job
4.3
30
looks
.5
 
Search WWH ::




Custom Search