Information Technology Reference
In-Depth Information
where poly(i) is the polysemy (number of senses) of i. For example, the word music has five senses in
WordNet, so the probability that it is used to express a specific meaning is equal to 1/5.
Therefore, we build a representation of the retrieved Web pages using the DSN; each word in the
page which matches any of the terms in the DSN is a component of the document representation and
the links between them are the relations in the DSN.
sy Milarity Metric
Given a conceptual domain, in order to individuate the interesting pages by using a DSN, it is necessary
to define a grading system to assign a vote to the documents on the basis of their Syntactic and Semantic
content. Therefore, to measure the relevance of a given document we consider the Semantic relatedness
between terms and, using relevance feedback techniques, statistical information about them.
The proposed measure considers two types of information; one concerning syntactic information
based on the concepts of word frequency and term centrality and another one concerning the Semantic
component calculated on each set of words in the document. The relevance feedback techniques we
used take into account two types of feedback: explicit and blind feedback.
The first one is performed after the first results presentation. In fact, the system, using the metric
for ranking described below, presents to the user a result list and shows for each result the top 2 ranked
sentences from the related page. The top sentences are detected using the system metric on each sen-
tence in the document and ordering them. With this information the user can manually choose relevant
documents or he can open the whole page.
With the blind approach the user can allow the system to automatically perform the relevance feed-
back on a defined number of documents.
The first metric contribution is called the Syntactic-Semantic grade (SSG). In this chapter we propose
a new approach to calculate the SSG and compare it with the one proposed in Albanese, Picariello &
Rinaldi (2004); the metric proposed there represents our standard metric. We can define the relevance
of a word in a given conceptual domain and, if the feedback functions are chosen, in the set of selected
documents. Therefore we use a hybrid approach exploiting both statistical and Semantic information.
The statistical information is obtained by applying the relevance feedback technique described in Weiss,
Vélez & Sheldon (1996), and it is enriched with the Semantic information provided by computing the
centrality of the terms (Equation 1). In this way we divide the terms into classes, on the basis of their
centrality:
TF
0.5 0.5
+
i k
,
TF
i
max,
k
(2)
SSG
=
i k
,
2
TF
( )
2
0.5 0.5
+
i k
,
TF
i
i k
max,
k
where k is the k-th document, i is the i-th term, TF i,k is the term frequency of i in k, TF max,k is the maxi-
mum term frequency in k, i is the centrality of i.
We use this approach to improve the precision of the model of the domain of interest and to
overcome the lack of very specific terms in Wordnet (e.g. computer science specific terminology). Thus,
the use of relevance feedback re-weights and expands the Semantic network by adding new terms -not
Search WWH ::




Custom Search