Database Reference
In-Depth Information
Searching the trie for an exact match is easy. Given a query string
v = σ 1 ···σ the trie is traversed from the root toward the leaves by
sequentially following the nodes labeled σ 1 through σ . If the search
stops before all characters have been examined, then v does not exist
in the data. Clearly, the trie has excellent performance (linear in the
length of the query) for finding exact matches or prefixes of the query.
The trie cannot be used eciently for substring matching (i.e., for
finding data strings that have v as a substring). Notice that the fanout
at every level of the trie can, in the worst case, be equal to
.
Several algorithms have been proposed for extending tries to answer
string similarity queries using edit distance, as will be discussed in
Section 8.
|
Σ
|
5.2.2
Sux Trees
A sux tree is a data structure based on tries that indexes all the
suxes of all the strings s
σ m has a total of m
suxes σ 1 ···σ m 2 ···σ m ,...,σ m . By building a trie over all suxes in
S a sux trie is produced.
The advantage of the sux trie is that it can be used to find sub-
string matches very eciently. Given a query v , a traversal of the sux
trie will produce all strings s
S . A string s = σ 1 ···
S that contain v as a substring. The
disadvantage is that the sux trie consumes significantly more space
than a simple trie. Moreover, the construction algorithm of the su x
trie has complexity O ( N 2 ), where N is the total length of the strings
in S (i.e., N = s∈S |
), which is prohibitively large for long strings.
The size of the su x trie can be reduced by collapsing all paths
consisting of nodes with only one child into a single node, similar to a
Patricia trie. The resulting Patricia sux trie is more commonly called
a sux tree. An example is shown in Figure 5.3.
The sux tree has two advantages. First, the overall size of the
structure can be reduced to O ( N ), or linear in the length of the
strings to be indexed. Second, the sux tree construction algorithm
also requires O ( N ) time and space, provided that the tree fits entirely
in main memory. (Sux tries are more expensive to construct since
one node for every character of every sux of every string has to be
s
|
Search WWH ::




Custom Search