Analysis of Text Patterns Using Kernel Methods - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

where D is the document-term matrix, equivalent to taking P = D .This

definition does not make immediately clear that it implements a semantic

similarity, but if we compute the corresponding kernel

κ ( d 1 ,d 2 )= φ ( d 1 ) D D φ ( d 2 ) ,

we can observe that the matrix D D has a non-zero ( i, j )- th entry if and only

if there is a document in the corpus in which the i - th and j - th terms co-occur,

since

( D D ) ij =

d

tf ( i, d ) tf ( j, d ).

The strength of a semantic relationship between two terms that co-occurs in

a document is measured by the frequency and number of their co-occurrences.

This approach can be used to reduce the space dimension. In fact, if we have

less documents than terms, we map from the vectors indexed by terms to a

lower-dimensional space indexed by the documents of the corpus.

Latent Semantic Kernels. Another approach based on the use of co-

occurence information is Latent Semantic Indexing (LSI) (7). This method is

very close to GSVM, the main difference is that it uses singular value decom-

position (SVD) to extract the semantic information from the co-occurrences.

SVD of a matrix considers the first k columns of the left and right singu-

lar vectors matrices U and V corresponding to the k largest singular values.

Thus, the word-by-document matrix D is factorized as

D = UΣV

where U and V are unitary matrices whose columns are the eigenvectors of

D D and DD respectively. LSI now projects the documents into the space

spanned by the first k columns of U , using these new k -dimensional vectors

for subsequent processing

φ ( d ) U k ,

where U k is the matrix containing the first k columns of U . The eigenvectors

define the subspace that minimizes the sum-squared differences between the

points and their projections, so it defines the subspace with minimal sum-

squared residuals. Hence, the eigenvectors for a set of documents can be

viewed as concepts described by linear combinations of terms chosen in such

a way that the documents are described as accurately as possible using only

k such concepts. The aim of SVD is to extract few high correlated dimen-

sions/concepts able to approximately reconstruct the whole feature vector.

The new kernel can be defined as

d

−→

κ ( d 1 ,d 2 )= φ ( d 1 ) U k U k φ ( d 2 ) ,

Search WWH ::

Custom Search

Home