Analysis of Content Popularity in Social Bookmarking Systems - Evolving Application Domains of Data Warehousing and Mining - page 249

Database Reference

In-Depth Information

all terms from the Digg stories, it was possible to

create the two-way contingency matrix of Table 4

for each term t . According to this, A is the number

of popular stories containing term t , B the number

of non-popular stories containing t , and C and D

are the number of popular and non-popular stories

respectively that don't contain term t .

Then, the χ 2 statistic is calculated based on the

following equation:

words and numeric strings were also filtered out

of the feature selection process. In order to keep

the experiments simple, no stemming or other

term normalization was carried out. Table 5 lists

the top 30 terms of this story set along with their

χ 2 scores. Although not very informative to the

human inspector, these keywords can be consid-

ered as the most appropriate (from a text feature

perspective) for use in making the distinction

between Popular and Non-popular stories.

In order to get a more fine-grained view of

such keywords per topic, we also calculate the χ 2

scores on independent corpora that contain stories

only from particular topics. Table 6 provides such

a topic-specific χ 2 -based ranking of terms.

After ranking the terms of each corpus based

on their class separation ability, it is possible

to select the top K of them and use them in an

automatic text classification scheme. Table 7

presents the results achieved by such a classifi-

cation scheme, i.e. the success of predicting the

popularity of Digg stories, where three classifiers

are compared, namely a Naïve Bayes classifier, an

SVM and a C4.5 decision tree. The dataset used

consists of 50,000 randomly selected stories and

the performance metrics were calculated by use of

10-fold cross validation (i.e. repeated splitting of

the dataset to 10 parts and usage of nine of them

for training and one of them for testing).

= × × -×

+× +× +× +

NADCB

AC BD ABCD

(

)

2

c 2

()

t

(

) (

) (

) (

)

(8)

The χ 2 statistic naturally takes a value of zero if

term t and the class of Popular stories are indepen-

dent. The measure is only problematic when any

of the contingency table cells is lightly populated,

which is the case for low-frequency terms. For

that reason, we filter low-frequency terms prior

to the calculation of Equation 8. In addition, stop

Table 4. Two-way contingency table of term t

Popular

Non-popular

Term exists

A

B

Term doesn't exist

C

D

Table 5. Top 30 text features based on χ 2 statistic

#

Term

χ 2 (10 6 )

#

Term

χ 2 (10 6 )

#

Term

χ 2 (10 6 )

1

see

61.5

11

seen

7.7

21

news

4.2

2

drive

60.4

12

nintendo

7.6

22

making

4.1

3

japanese

28.5

13

program

5.6

23

breaking

4.0

4

video

17.2

14

way

5.4

24

amazing

3.5

5

google

12.9

15

gets

5.2

25

say

3.4

6

long

11.7

16

computer

4.9

26

coolest

3.4

7

cool

11.0

17

need

4.8

27

release

3.3

8

term

9.9

18

want

4.7

28

right

.9

9

look

8.8

9

play

4.7

29

xbox

2.8

10

high

7.9

20

job

4.3

30

looks

.5

Next Page

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home