Analysis of Text Patterns Using Kernel Methods - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

The Trie-based Implementation. Direct implementations of these ker-

nels would be very slow to evaluate due to the potentially large dimensionality

of the feature space, which is exponential in p . Fortunately, however, much

faster implementations of string kernels can be obtained by exploiting an ef-

ficient data structure known as a 'trie.' A trie over an alphabet Σ is a tree

whose edges are labeled with a symbol from Σ. A complete trie of depth p is a

trie containing the maximal number of nodes consistent with the depth of the

tree being p , each parent node having a downward branch for each alphabet

symbol from Σ.

In a complete trie there is a one to one correspondence between the nodes

at depth k and the strings of length k , the correspondence being between the

node and the string on the path to that node from the root. (The string

associated with the root node is the empty string ε .) Hence, we will refer to

the nodes of a trie by their associated string. The key observation behind

the trie-based approach is that one can regard the leaves of the complete

trie of depth p as the indices of the feature space indexed by the set Σ p of

strings of length p . So the coordinates of the vector φ ( s ) (corresponding to

the dimensions of the feature space F ) are conveniently organized in the trie,

which can be used to obtain an ecient search strategy.

The use of this data structure reduces the computational cost of the p -

spectrum kernel and mismatch kernel. In both of the kernels, the implemen-

tation is based on the traversal of the trie in a depth-first fashion, each time

attaching to the explored node a list of substrings of s that match to the

substring corresponding to that node. The key difference between p -spectrum

kernel and mismatch kernel trie implementation is that in the mismatch ker-

nel when we process a substring it can be added to lists associated with more

than one child node. We have an overall complexity of O ( p (

|

s

|

t

|

+

))for the

p -spectrum kernel and O p m +1

m ( |s| + |t| ) for the mismatch kernel. In

this chapter we do not go deep into the implementation of these two kernels

using the trie data structure; for more details see ( 17 ) and ( 27 ).

| Σ |

Computing an Entire Kernel Matrix. Instead of maintaining a list for

two strings s and t at each internal node of the trie, we can maintain a list for

a whole set of strings between which we want to compute the kernel functions.

Whenever a leaf node is reached, all these kernel functions can be incremented

based on the feature values of each of the strings corresponding to that leaf

node. This can be carried out eciently; the traversal of the trie remains

linear in the sum of the lengths of all strings, and only the operations at the

leaf nodes, where the kernel values are incremented, are inherently quadratic

in the number of strings. The result is the full kernel matrix K , containing

the kernel function between the i th and j th sequences at position ( i, j )and

symmetrically at position ( j, i ). Normalized kernel and distance matrices can

then promptly be computed from it.

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home