Analysis of Text Patterns Using Kernel Methods - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

and the proximity matrix P is hence equal to U k . Note that there is a

correspondence between the LSI and PCA in the feature space.

1.3.3 String Kernels

A document can be seen in different ways. Often it is modelled as a sequence

of paragraphs, or a sequence of sentences. Reducing the granularity, it can be

seen as a sequence of words or a string of symbols. In the previous section, we

have seen a document viewed as a bag of words. Now we consider a document

as a string of letters. This new representation allows different kinds of analysis.

In this section, we introduce several kernels that provide an interesting way

to compare documents working on substrings.

The p -spectrum Kernel. Perhaps the most natural way to compare two

strings in many applications is to count how many (contiguous) substrings

of length p they have in common. We define the spectrum of order p (or p-

spectrum ) of a sequence s as the histogram of frequencies of all its (contiguous)

substrings of length p . We define the p -spectrum kernel (15) as the inner

product of the p -spectra. Formally, the feature space F associated with the

p -spectrum kernel is indexed by I =Σ p , where Σ denotes the alphabet, and

Σ p is the set of all finite strings of length p , with the embedding given by

φ u ( s )=

Σ p

( v 1 ,v 2 ): s = v 1 uv 2 }|

, u

∈

and the associated p -spectrum kernel between sequences s and t is defined as

φ p ( s ) ,φ p ( t )

φ u ( s ) φ u ( t ) .

κ p ( s, t )=

∈

Σ p

The Mismatch Kernel. When isolated substitutions are likely to occur in

a sequence, the p -spectrum kernel might be too stringent to result in a useful

similarity measure. In those cases, it makes sense to use a modification of the

p -spectrum, where the feature of a sequence s associated to the substring u

is equal to the number of contiguous substrings in s that differ by no more

than a maximal number m of characters from u . For two substrings u and v

of equal length, we use d ( u, v ) to denote the number of characters in which u

and v differ. The mismatch kernel (16) κ p,m is defined by the feature mapping

φ p,m

( s )=

( v 1 ,v 2 ): s = v 1 vv 2 :

= p, d ( u, v )

≤

The associated mismatch kernel is defined as

u∈ Σ p

φ p,m ( s ) ,φ p,m ( t )

φ p,m

( s ) φ p,m

κ p,m ( s, t )=

( t ) .

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home