Database Reference
In-Depth Information
and the proximity matrix P is hence equal to U k . Note that there is a
correspondence between the LSI and PCA in the feature space.
1.3.3 String Kernels
A document can be seen in different ways. Often it is modelled as a sequence
of paragraphs, or a sequence of sentences. Reducing the granularity, it can be
seen as a sequence of words or a string of symbols. In the previous section, we
have seen a document viewed as a bag of words. Now we consider a document
as a string of letters. This new representation allows different kinds of analysis.
In this section, we introduce several kernels that provide an interesting way
to compare documents working on substrings.
The p -spectrum Kernel. Perhaps the most natural way to compare two
strings in many applications is to count how many (contiguous) substrings
of length p they have in common. We define the spectrum of order p (or p-
spectrum ) of a sequence s as the histogram of frequencies of all its (contiguous)
substrings of length p . We define the p -spectrum kernel (15) as the inner
product of the p -spectra. Formally, the feature space F associated with the
p -spectrum kernel is indexed by I p , where Σ denotes the alphabet, and
Σ p is the set of all finite strings of length p , with the embedding given by
φ u ( s )=
Σ p
|{
( v 1 ,v 2 ): s = v 1 uv 2 }|
, u
and the associated p -spectrum kernel between sequences s and t is defined as
=
u
φ p ( s ) p ( t )
φ u ( s ) φ u ( t ) .
κ p ( s, t )=
Σ p
The Mismatch Kernel. When isolated substitutions are likely to occur in
a sequence, the p -spectrum kernel might be too stringent to result in a useful
similarity measure. In those cases, it makes sense to use a modification of the
p -spectrum, where the feature of a sequence s associated to the substring u
is equal to the number of contiguous substrings in s that differ by no more
than a maximal number m of characters from u . For two substrings u and v
of equal length, we use d ( u, v ) to denote the number of characters in which u
and v differ. The mismatch kernel (16) κ p,m is defined by the feature mapping
φ p,m
u
( s )=
|{
( v 1 ,v 2 ): s = v 1 vv 2 :
|
u
|
|
v
|
= p, d ( u, v )
m
}|
.
=
The associated mismatch kernel is defined as
=
u∈ Σ p
φ p,m ( s ) p,m ( t )
φ p,m
u
( s ) φ p,m
u
κ p,m ( s, t )=
( t ) .
Search WWH ::




Custom Search