Introduction - Approximate String Processing

Database Reference

In-Depth Information

1

Introduction

Arguably, one of the most important primitive data types in modern

data processing is strings. Short strings comprise the largest percentage

of data in relational database systems, long strings are used to repre-

sent proteins and DNA sequences in biological applications, as well as

HTML and XML documents on the Web. In fact this very monograph

is safely stored in multiple formats (HTML, PDF, TeX, etc.) as a col-

lection of very long strings. Searching through string datasets is a fun-

damental operation in almost every application domain. For example,

in SQL query processing, information retrieval on the Web, genomic

research on DNA sequences, product search in eCommerce applica-

tions, and local business search on online maps. Hence, a plethora of

specialized indexes, algorithms, and techniques have been developed

for searching through strings.

Due to the complexity of collecting, storing and managing strings,

string datasets almost always contain representational inconsistencies,

spelling mistakes, and a variety of other errors. For example, a represen-

tational inconsistency occurs when the query string is 'Doctors With-

out Borders' and the data entry is stored as 'Doctors w/o Borders'. A

spelling mistake occurs when the user mistypes the query as 'Doctors

268

Search WWH ::

Custom Search

Home