Database Reference
In-Depth Information
[ 33 ] to classify documents from different datasets. The micro-averaged and the macro-
averaged of the precision-recall break-even point (BEP) were used to compare
classification performance with respect to the baseline approach. Compared with
the baseline method, the proposed approach yielded an improvement of 2-5%.
A new classification model based on Wikipedia information was proposed by
Schonhofen in [ 45 ]. The relatedness between a document and a category of the
Wikipedia taxonomy was computed by evaluating the similarity between that
document and the titles of articles classified under that category. Since each article
can belong to different categories, relevance statistics are used to rank the cate-
gories. The method was tested on the Wikipedia article body, which is not used to
build the model, and on news datasets. The best results are achieved by combining
Wikipedia categorization with the top terms identified by tf-idf. For example, the
accuracy achieved on the news dataset is around 89%.
An improvement over Sch
onhofen's approach was suggested in [ 46 ]. The
authors propose to exploit both the words appearing in the article titles and in the
hyperlinks. In fact, the hyperlinks better characterize the content of the article.
Empirical results on a subset of Wikipedia articles show an improvement in
precision and recall with respect to Sch
onhofen's method. Using only the top-3
Wikipedia categories returned by the method, the improvement in precision and
recall is around 15% and 35%, respectively.
A different text categorization approach based on an RDF ontology extracted
from Wikipedia Infoboxes was presented in [ 47 ]. The method focuses on news
documents of varying themes. For each document, the authors manually selected
the Wikipedia category which best relates to its topic. A text document is then
converted into a “thematic graph” of entities occurring in the document. Since the
thematic graph can include uncorrelated entities, a selection of the most dominant
component is applied. Finally, the text is classified according to the best coverage
class of the entities belonging to the graph. The accuracy achieved by this approach
on two different document collections is worse than that of a Na¨ve Bayes classifier
[ 33 ] based on BOW representation. One of the reasons for misclassifications may
be the manual mapping of Wikipedia categories to the document topics. Moreover,
news documents - unlike encyclopedia content - may be biased to reflect the
interest of the readers. Yet, an interesting highlight of the ontology-based categori-
zation approach is that it does not require a training phase, since all information
about categories is stored in the ontology.
2.6.2.2 Search Engine
Several search engines based on Wikipedia have been developed to retrieve docu-
ments which are highly correlated with the keywords typed by the user. As shown in
the taxonomy depicted in Fig. 2.4 , the proposed approaches can be classified
according to the Wikipedia information employed in the query analysis.
Search WWH ::




Custom Search