Database Reference
In-Depth Information
plain inspection of their textual content. In other
words, one could cast the problem of popularity
prediction as a text classification task, comprising
a feature extraction and a training step (exploited
by some machine learning algorithm).
Text classification usually takes place by
processing a corpus of documents (bookmarked
content items in our case), extracting all possible
terms from them and considering them as the
dimensions of the vector space where the clas-
sification task will be applied. This is commonly
referred to as the Vector Space Model (VSM) in
the Information Retrieval literature (Salton et
al., 1975). However, due to the extremely large
variety of vocabulary in large corpora (millions
of unique terms), the dimensionality of such a
vector space is frequently prohibitive for direct
application of the model. The problem gets even
worse when combinations of more than one term
are considered as text features. For that reason,
feature selection (or reduction) techniques are
crucial in order to end up with a manageable set
of dimensions that are sufficient for the classifica-
tion task at hand.
In (Yang & Pedersen, 1997), a series of mea-
sures were evaluated regarding their effective-
ness to select the “proper” features (i.e. these
features that would result in higher classification
performance). The simplest of these measures is
the Document Frequency (DF), i.e. the number
of content items where the particular feature
(term) appears. Information Gain (IG) is a more
sophisticated feature selection measure. It mea-
sures the number of bits of information obtained
for class prediction (popular vs. non-popular) by
knowing the presence or absence of a term in a
document. Further, Mutual Information (MI) is an
additional criterion for quantifying the importance
of a feature in a classification problem; however,
Yang & Pedersen (1997) found this measure inef-
fective for selection of discriminative features.
In contrast, they found that the χ 2 statistic (CHI),
which measures the lack of independence between
a term and a class, was quite effective in that re-
spect. Finally, another interesting feature selection
measure introduced in the aforementioned paper is
the Term Strength (TS), which estimates the term
importance based on how commonly a term is
likely to appear in closely-related documents.
Assuming that a subset of terms from the corpus
under study have been selected as features, each
content item is processed so that a feature vector
(corresponding to the selected feature space) is
extracted from it. Then, it is possible to apply a
variety of machine learning techniques in order
to create a model that permits classification of
unknown pieces of texts to one of the predefined
classes. Previous efforts in the area of sentiment
classification (Dave et al., 2003; Pang et al., 2002)
have employed Support Vector Machines (SVM),
Naïve Bayes, as well as Maximum Entropy clas-
sifiers to tackle this problem.
In our case study, we investigated the potential
of popularity prediction based purely on textual
features. We conducted a series of feature extrac-
tion and text classification experiments on a corpus
of 50,000 bookmarked articles. The feature selec-
tion measures used to reduce the dimensionality
of the feature space were DF and CHI. For the
classification task, three standard methods were
used: Naïve Bayes (Duda et al., 2001), SVM
(Cristianini & Shawe-Taylor, 2000) and C4.5
decision trees (Quinlan, 1993).
Social Impact on SBS usage
It was previously argued that story popularity
depends on the textual content of the particular
story. However, it is widely recognized that read-
ers do not select their content in isolation. Users
of social bookmarking applications form online
relations and are constantly made aware of the
preferences and content consumption patterns of
other SBS users. Therefore, one would expect the
emergence of viral phenomena in online content
consumption within an SBS. The value of un-
derstanding and exploiting viral phenomena has
been already acknowledged in online knowledge
Search WWH ::




Custom Search