Database Reference
In-Depth Information
2.6.2 Wikipedia
The huge amount of data available with Wikipedia articles represents an interesting
media collection that can be used to improve automatic document understanding
and retrieval as shown in Fig. 2.4 . An overview of results achieved in the document
categorization domain is presented in Sect. 2.6.2.1 , while Sect. 2.6.2.2 discusses
some research efforts targeted at developing novel and efficient search engines.
2.6.2.1 Document Classification
The categorization task is usually performed by building models based on the
statistical properties of small document collections. The variety of Wikipedia
articles, the hyperlink graph structure, and the taxonomy of categories have been
employed in different studies to build automatic categorization approaches or
improve the performance of existing models. According to the representation, we
can divide the discussed works into three categories as depicted in the taxonomy
in Fig. 2.4 : (a) bag-of-word, (b) Wikipedia taxonomy analysis, and (c) Wikipedia
graph analysis.
In [ 43 ], Gabrilovich and Markovitch presented one of the first works employing
Wikipedia as an external resource for the document categorization task. The idea is
to improve document representation by using the knowledge stored in the encyclo-
pedia. A feature generator identifies the most relevant encyclopedia articles for
each document. Then, the titles of the articles are used as new features to augment
the bag-of-words (BOW) representation of the document. In the BOW representa-
tion, a document or a sentence is represented as an unordered collection of words,
disregarding the structure of the text. This representation is usually associated with
statistical measures such as tf-idf (term frequency-inverse document frequency).
The tf-idf is used to evaluate how important a word is with respect to a document in
a collection: the higher this value, the more representative the word. Empirical
evaluation shows that, using background knowledge stored in Wikipedia, classifi-
cation performance on short and long documents drawn from different datasets can
be improved with respect to traditional classification approaches based only on the
BOW representation.
A similar idea was presented in [ 44 ], where the authors automatically con-
structed a thesaurus of concepts from Wikipedia. The thesaurus was extracted
using redirect and disambiguation pages and the hyperlink graph of Wikipedia
articles. Similarly to the previous approach, the authors search candidate concepts
mentioned in each document, but then they add synonyms, hyponyms, and correlated
concepts of these candidate concepts, used as new features to enrich the BOW
representation. This extended knowledge can be leveraged to relate documents
which did not originally share common terms. Therefore, such documents are shifted
closer to each other in the new representation. The effectiveness of this approach
was empirically demonstrated by means of a linear Support Vector Machine (SVM)
Search WWH ::




Custom Search