Analysis of Text Patterns Using Kernel Methods - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

κ ( d 1 ,d 2 )= φ ( d 1 ) PP φ ( d 2 ) (1.3)

which corresponds to representing a document by a less sparse vector φ ( d ) P

that has non-zero entries for all terms that are semantically similar to those

present in the document d .

The matrix PP encodes the semantic strength among terms.

We can

expand the equation (1.3) substituting PP with Q

κ ( d 1 ,d 2 )=

i,j

φ ( d 1 ) i Q i,j φ ( d 2 ) j

so that we can view Q ij as encoding the amount of semantic relation be-

tween terms i and j . Note that defining the similarity by inferring Q requires

the additional constraint that Q be positive semi-definite, suggesting that

defining P will in general be more straightforward. A simple example of se-

mantic similarity mapping is stemming , that consists of removing inflection

from words.

1.3.2.1

Designing the Proximity Matrix

Extracting semantic information among terms in documents is still an open

issue in IR. More techniques have been developed in the last few years. In this

part of the chapter, we introduce different methods to compute the matrix

P , learning the relationship directly from a corpus or a set of documents.

Though we present the algorithms in a term-based representation, we will in

many cases show how to implement them in dual form, hence avoiding the

explicit computation of the matrix P .

Semantic Information from Semantic Network. Wordnet (9) is a well

known example of freely available semantic network. It contains semantic

relationship between terms in a hierarchical structure. More general terms

occur higher in the tree structure. A semantic proximity matrix can be ob-

tained by the distance between two terms in the hierarchical tree provided by

Wordnet, by setting the entry P ij to reflect the semantic proximity between

the terms i and j .

Generalized VSM. The generalized VSM (GVSM) is a variation of the

classical VSM, where semantic similarity between terms is used. The main

idea of this approach is that two terms are semantically related if they fre-

quently co-occur in the same documents. This implies that two documents

can be considered similar also if they do not share any terms, but the terms

they contain co-occur in other documents. If the VSM represents a document

as bag of words, the GSVM represents a document as a vector of its similar-

ities with the different documents in the corpus. A document is represented

by

φ ( d )= φ ( d ) D ,

Search WWH ::

Custom Search

Home