Database Reference
In-Depth Information
1
Introduction
Arguably, one of the most important primitive data types in modern
data processing is strings. Short strings comprise the largest percentage
of data in relational database systems, long strings are used to repre-
sent proteins and DNA sequences in biological applications, as well as
HTML and XML documents on the Web. In fact this very monograph
is safely stored in multiple formats (HTML, PDF, TeX, etc.) as a col-
lection of very long strings. Searching through string datasets is a fun-
damental operation in almost every application domain. For example,
in SQL query processing, information retrieval on the Web, genomic
research on DNA sequences, product search in eCommerce applica-
tions, and local business search on online maps. Hence, a plethora of
specialized indexes, algorithms, and techniques have been developed
for searching through strings.
Due to the complexity of collecting, storing and managing strings,
string datasets almost always contain representational inconsistencies,
spelling mistakes, and a variety of other errors. For example, a represen-
tational inconsistency occurs when the query string is 'Doctors With-
out Borders' and the data entry is stored as 'Doctors w/o Borders'. A
spelling mistake occurs when the user mistypes the query as 'Doctors
268
 
Search WWH ::




Custom Search