Community-Contributed Media Collections: Knowledge at Our Fingertips - Community-Built Databases: Research and Development

Database Reference

In-Depth Information

2.6.2 Wikipedia

The huge amount of data available with Wikipedia articles represents an interesting

media collection that can be used to improve automatic document understanding

and retrieval as shown in Fig. 2.4 . An overview of results achieved in the document

categorization domain is presented in Sect. 2.6.2.1 , while Sect. 2.6.2.2 discusses

some research efforts targeted at developing novel and efficient search engines.

2.6.2.1 Document Classification

The categorization task is usually performed by building models based on the

statistical properties of small document collections. The variety of Wikipedia

articles, the hyperlink graph structure, and the taxonomy of categories have been

employed in different studies to build automatic categorization approaches or

improve the performance of existing models. According to the representation, we

can divide the discussed works into three categories as depicted in the taxonomy

in Fig. 2.4 : (a) bag-of-word, (b) Wikipedia taxonomy analysis, and (c) Wikipedia

graph analysis.

In [ 43 ], Gabrilovich and Markovitch presented one of the first works employing

Wikipedia as an external resource for the document categorization task. The idea is

to improve document representation by using the knowledge stored in the encyclo-

pedia. A feature generator identifies the most relevant encyclopedia articles for

each document. Then, the titles of the articles are used as new features to augment

the bag-of-words (BOW) representation of the document. In the BOW representa-

tion, a document or a sentence is represented as an unordered collection of words,

disregarding the structure of the text. This representation is usually associated with

statistical measures such as tf-idf (term frequency-inverse document frequency).

The tf-idf is used to evaluate how important a word is with respect to a document in

a collection: the higher this value, the more representative the word. Empirical

evaluation shows that, using background knowledge stored in Wikipedia, classifi-

cation performance on short and long documents drawn from different datasets can

be improved with respect to traditional classification approaches based only on the

BOW representation.

A similar idea was presented in [ 44 ], where the authors automatically con-

structed a thesaurus of concepts from Wikipedia. The thesaurus was extracted

using redirect and disambiguation pages and the hyperlink graph of Wikipedia

articles. Similarly to the previous approach, the authors search candidate concepts

mentioned in each document, but then they add synonyms, hyponyms, and correlated

concepts of these candidate concepts, used as new features to enrich the BOW

representation. This extended knowledge can be leveraged to relate documents

which did not originally share common terms. Therefore, such documents are shifted

closer to each other in the new representation. The effectiveness of this approach

was empirically demonstrated by means of a linear Support Vector Machine (SVM)

Search WWH ::

Custom Search

Home