Database Reference
In-Depth Information
and the proximity matrix
P
is hence equal to
U
k
. Note that there is a
correspondence between the LSI and PCA in the feature space.
1.3.3 String Kernels
A document can be seen in different ways. Often it is modelled as a sequence
of paragraphs, or a sequence of sentences. Reducing the granularity, it can be
seen as a sequence of words or a string of symbols. In the previous section, we
have seen a document viewed as a bag of words. Now we consider a document
as a string of letters. This new representation allows different kinds of analysis.
In this section, we introduce several kernels that provide an interesting way
to compare documents working on substrings.
The
p
-spectrum Kernel.
Perhaps the most natural way to compare two
strings in many applications is to count how many (contiguous) substrings
of length
p
they have in common. We define the
spectrum of order p
(or
p-
spectrum
) of a sequence
s
as the histogram of frequencies of all its (contiguous)
substrings of length
p
. We define the
p
-spectrum kernel (15) as the inner
product of the
p
-spectra. Formally, the feature space
F
associated with the
p
-spectrum kernel is indexed by
I
=Σ
p
, where Σ denotes the alphabet, and
Σ
p
is the set of all finite strings of length
p
, with the embedding given by
φ
u
(
s
)=
Σ
p
|{
(
v
1
,v
2
):
s
=
v
1
uv
2
}|
,
u
∈
and the associated
p
-spectrum kernel between sequences
s
and
t
is defined as
=
u
φ
p
(
s
)
,φ
p
(
t
)
φ
u
(
s
)
φ
u
(
t
)
.
κ
p
(
s, t
)=
∈
Σ
p
The Mismatch Kernel.
When isolated substitutions are likely to occur in
a sequence, the
p
-spectrum kernel might be too stringent to result in a useful
similarity measure. In those cases, it makes sense to use a modification of the
p
-spectrum, where the feature of a sequence
s
associated to the substring
u
is equal to the number of contiguous substrings in
s
that differ by no more
than a maximal number
m
of characters from
u
. For two substrings
u
and
v
of equal length, we use
d
(
u, v
) to denote the number of characters in which
u
and
v
differ. The mismatch kernel (16)
κ
p,m
is defined by the feature mapping
φ
p,m
u
(
s
)=
|{
(
v
1
,v
2
):
s
=
v
1
vv
2
:
|
u
|
|
v
|
=
p, d
(
u, v
)
≤
m
}|
.
=
The associated mismatch kernel is defined as
=
u∈
Σ
p
φ
p,m
(
s
)
,φ
p,m
(
t
)
φ
p,m
u
(
s
)
φ
p,m
u
κ
p,m
(
s, t
)=
(
t
)
.
Search WWH ::
Custom Search