Database Reference
In-Depth Information
The Trie-based Implementation. Direct implementations of these ker-
nels would be very slow to evaluate due to the potentially large dimensionality
of the feature space, which is exponential in p . Fortunately, however, much
faster implementations of string kernels can be obtained by exploiting an ef-
ficient data structure known as a 'trie.' A trie over an alphabet Σ is a tree
whose edges are labeled with a symbol from Σ. A complete trie of depth p is a
trie containing the maximal number of nodes consistent with the depth of the
tree being p , each parent node having a downward branch for each alphabet
symbol from Σ.
In a complete trie there is a one to one correspondence between the nodes
at depth k and the strings of length k , the correspondence being between the
node and the string on the path to that node from the root. (The string
associated with the root node is the empty string ε .) Hence, we will refer to
the nodes of a trie by their associated string. The key observation behind
the trie-based approach is that one can regard the leaves of the complete
trie of depth p as the indices of the feature space indexed by the set Σ p of
strings of length p . So the coordinates of the vector φ ( s ) (corresponding to
the dimensions of the feature space F ) are conveniently organized in the trie,
which can be used to obtain an ecient search strategy.
The use of this data structure reduces the computational cost of the p -
spectrum kernel and mismatch kernel. In both of the kernels, the implemen-
tation is based on the traversal of the trie in a depth-first fashion, each time
attaching to the explored node a list of substrings of s that match to the
substring corresponding to that node. The key difference between p -spectrum
kernel and mismatch kernel trie implementation is that in the mismatch ker-
nel when we process a substring it can be added to lists associated with more
than one child node. We have an overall complexity of O ( p (
|
s
|
|
t
|
+
))for the
p -spectrum kernel and O p m +1
m ( |s| + |t| ) for the mismatch kernel. In
this chapter we do not go deep into the implementation of these two kernels
using the trie data structure; for more details see ( 17 ) and ( 27 ).
| Σ |
Computing an Entire Kernel Matrix. Instead of maintaining a list for
two strings s and t at each internal node of the trie, we can maintain a list for
a whole set of strings between which we want to compute the kernel functions.
Whenever a leaf node is reached, all these kernel functions can be incremented
based on the feature values of each of the strings corresponding to that leaf
node. This can be carried out eciently; the traversal of the trie remains
linear in the sum of the lengths of all strings, and only the operations at the
leaf nodes, where the kernel values are incremented, are inherently quadratic
in the number of strings. The result is the full kernel matrix K , containing
the kernel function between the i th and j th sequences at position ( i, j )and
symmetrically at position ( j, i ). Normalized kernel and distance matrices can
then promptly be computed from it.
Search WWH ::




Custom Search