Retrieving Wiki Content Using an Ontology - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

4.2 Universe of Analyzed Documents

Through the interface of the retrieval tool it is possible to configure where in the

wiki site topics should be considered in a retrieval process. It is done through the

selection of what wiki URLs are considered more interesting. A drag-and-drop

mechanism is used. The user chooses the wiki and browses its content. If one page

is considered interesting, the user selects its URL and drag it to an icon that repre-

sents the retrieval tool. The icon was drawn to look like a dark hole.

The possibility of considering a whole topic part or just the latest revisions on

them allows knowing the recent contributions concerning a given time period. To

decide what topics or part of them are recent, the creation and modification dates

are stored along with the correspondent topic part in a database.

Each wiki topic is equivalent to two queries in the classic vector model and a

distance is calculated concerning each different class family in the ontology. Each

ontological family is equivalent to one document in the collection considered in

the classical model.

4.3 Semantic Weight

The new semantic scenario requires a new weighting scheme to quantify the rele-

vance of term equivalents (ontology concepts elements) in each class family

against each considered topic part. The new weight is named semantic weight and

is calculated according to the location of each term equivalent in the hierarchy of

each class family or through the Relevancy Adjust property. Considering k as the

k th concept in the equivalent to a term vector, its depth in a class family cf k where

it appears, and the greatest depth among the ontology class families maxdepth cfk

the semantic weight sw k formula for k is presented in (1).

sw k = (depth k, cfk ) / (maxdepth cfk ) (1)

4.4 Inverse Document Frequency

The inverse document frequency idf is an indicator in the vector model that bene-

fits documents with terms whose frequency is relatively low concerning the total

document set. It is also responsible to avoid that highly frequent terms influence

relevance calculations. In order to avoid the appearance of severe numeric distor-

tions, the logarithm function is used. The original formula to compute de idf is

shown in (2).

idf k = log(N / n k ) (2)

Where N is the amount of elements in the document set and n k corresponds to the

number of documents where the k th term occurs, ignoring the amount of its occur-

rences in each document.

Considering the formula in (2), it can be perceived a drawback. Relevant terms

that appear in all considered documents turn to make no positive influence in the

calculation results because the obtained idf is zero. Because the idf index is used in

other formulas as a multiplying factor, correspondent results will be zero.

Search WWH ::

Custom Search

Home