Database Reference
In-Depth Information
κ ( d 1 ,d 2 )= φ ( d 1 ) PP φ ( d 2 ) (1.3)
which corresponds to representing a document by a less sparse vector φ ( d ) P
that has non-zero entries for all terms that are semantically similar to those
present in the document d .
The matrix PP encodes the semantic strength among terms.
We can
expand the equation (1.3) substituting PP with Q
κ ( d 1 ,d 2 )=
i,j
φ ( d 1 ) i Q i,j φ ( d 2 ) j
so that we can view Q ij as encoding the amount of semantic relation be-
tween terms i and j . Note that defining the similarity by inferring Q requires
the additional constraint that Q be positive semi-definite, suggesting that
defining P will in general be more straightforward. A simple example of se-
mantic similarity mapping is stemming , that consists of removing inflection
from words.
1.3.2.1
Designing the Proximity Matrix
Extracting semantic information among terms in documents is still an open
issue in IR. More techniques have been developed in the last few years. In this
part of the chapter, we introduce different methods to compute the matrix
P , learning the relationship directly from a corpus or a set of documents.
Though we present the algorithms in a term-based representation, we will in
many cases show how to implement them in dual form, hence avoiding the
explicit computation of the matrix P .
Semantic Information from Semantic Network. Wordnet (9) is a well
known example of freely available semantic network. It contains semantic
relationship between terms in a hierarchical structure. More general terms
occur higher in the tree structure. A semantic proximity matrix can be ob-
tained by the distance between two terms in the hierarchical tree provided by
Wordnet, by setting the entry P ij to reflect the semantic proximity between
the terms i and j .
Generalized VSM. The generalized VSM (GVSM) is a variation of the
classical VSM, where semantic similarity between terms is used. The main
idea of this approach is that two terms are semantically related if they fre-
quently co-occur in the same documents. This implies that two documents
can be considered similar also if they do not share any terms, but the terms
they contain co-occur in other documents. If the VSM represents a document
as bag of words, the GSVM represents a document as a vector of its similar-
ities with the different documents in the corpus. A document is represented
by
φ ( d )= φ ( d ) D ,
Search WWH ::




Custom Search