Database Reference
In-Depth Information
5
Index Structures
There are three major data structures used for indexing strings in order
to answer similarity queries: the Inverted Index, Tries, and B-trees.
Most algorithms are based on these three data structures that are
explained in detail next.
5.1
Inverted Indexes
Consider a collection of strings. Each string is represented as a sequence
of tokens. An example is shown in Table 5.1. Clearly, one can easily
invert this representation by constructing one set per token (instead
of one set per string), where each token set consists of all the strings
containing this token at a specified position.
Definition 5.1 (Inverted List). Given a token λ
Λ the inverted
list L ( λ ) is the set of all strings s
S s.t. λ
s .
An inverted index is a collection of inverted lists, one list per token
in Λ. Storing the actual strings in the lists results in increased space
requirements, especially for long strings, since each string is duplicated
292
 
Search WWH ::




Custom Search