Information Technology Reference
In-Depth Information
4.2 Universe of Analyzed Documents
Through the interface of the retrieval tool it is possible to configure where in the
wiki site topics should be considered in a retrieval process. It is done through the
selection of what wiki URLs are considered more interesting. A drag-and-drop
mechanism is used. The user chooses the wiki and browses its content. If one page
is considered interesting, the user selects its URL and drag it to an icon that repre-
sents the retrieval tool. The icon was drawn to look like a dark hole.
The possibility of considering a whole topic part or just the latest revisions on
them allows knowing the recent contributions concerning a given time period. To
decide what topics or part of them are recent, the creation and modification dates
are stored along with the correspondent topic part in a database.
Each wiki topic is equivalent to two queries in the classic vector model and a
distance is calculated concerning each different class family in the ontology. Each
ontological family is equivalent to one document in the collection considered in
the classical model.
4.3 Semantic Weight
The new semantic scenario requires a new weighting scheme to quantify the rele-
vance of term equivalents (ontology concepts elements) in each class family
against each considered topic part. The new weight is named semantic weight and
is calculated according to the location of each term equivalent in the hierarchy of
each class family or through the Relevancy Adjust property. Considering k as the
k th concept in the equivalent to a term vector, its depth in a class family cf k where
it appears, and the greatest depth among the ontology class families maxdepth cfk
the semantic weight sw k formula for k is presented in (1).
sw k = (depth k, cfk ) / (maxdepth cfk ) (1)
4.4 Inverse Document Frequency
The inverse document frequency idf is an indicator in the vector model that bene-
fits documents with terms whose frequency is relatively low concerning the total
document set. It is also responsible to avoid that highly frequent terms influence
relevance calculations. In order to avoid the appearance of severe numeric distor-
tions, the logarithm function is used. The original formula to compute de idf is
shown in (2).
idf k = log(N / n k ) (2)
Where N is the amount of elements in the document set and n k corresponds to the
number of documents where the k th term occurs, ignoring the amount of its occur-
rences in each document.
Considering the formula in (2), it can be perceived a drawback. Relevant terms
that appear in all considered documents turn to make no positive influence in the
calculation results because the obtained idf is zero. Because the idf index is used in
other formulas as a multiplying factor, correspondent results will be zero.
Search WWH ::




Custom Search