Biomedical Engineering Reference
In-Depth Information
of characters in the search string but then backs up to compare characters if a possible match exists.
Approximate Searches
Algorithms that efficiently locate exact matches have many applications in bioinformatics, including
searching for data in PubMed or some other bibliographic reference database by a specific disease or
author, searching a clinical database by a specific disease or patient identification number, or
searching any database where data are indexed by a known, controlled vocabulary. However, search
algorithms that look for approximate matches are more useful in one of the most computationally
challenging tasks in bioinformatics—that of searching sequence databases for homologies of
particular sequences. Approximate match algorithms vary from the use of templates, to the use of a
distance function, and the use of how words sound when spoken.
String search algorithms based on templates use metacharacters to specify the range of permissible
strings that must be matched exactly. For example, the UNIX utility "grep" (general regular
expression parser) uses metacharacters such as "*", "\", "$", "+", and "^" to perform a brute-force
search. As such, applications such as grep don't do true approximate searches. Similarly, the Find
function in the Windows operating system allows a search string such as "*research.doc" to locate
Microsoft Word documents that include "MyResearch.doc", "DNAResearch.doc", and
"ProteinResearch.doc". However, the search wouldn't locate documents such as "MyReserch.doc",
"DNAResaerch.doc", or "ProteinRsch.doc", because of missing or transposed characters in the file
names compared to the search string.
True approximate search algorithms allow approximate matches, permit the transposition of adjacent
characters, substitution of characters, and assign different weights to different types of errors. An
approximate match algorithm for nucleotide sequences should be able to locate nucleotide sequences
despite the presence of single nucleotide polymorphism, for example. Searching with an approximate
match algorithm for a nucleotide sequence that contains the string "AAGGTTAA" should be able to
locate the sequence "ATGGTTAA", where the second "A" in the first string is replaced by a "T" in the
second string.
Phonetic comparison algorithms, typified by Soundex and Metaphone, are examples of true
approximate search algorithms that have application in bioinformatics. For example, they can be
used to search bibliographic databases by author name when the exact spelling of the author's name
may be unknown, or search a taxonomy database by phonetically spelling a species name. The
Soundex approximate search algorithm addresses the problem of uncertain spelling by indexing and
searching databases by an encoded string. These encoded strings are created by dropping vowels
and silent consonants and assigning one of six values to the remaining consonants (see Table 4-3 ).
Table 4-3. Soundex Codes. Vowels and silent consonants are dropped from
the word and the consonants are converted to a three-digit numeric codes,
headed by the first character in the word. The Soundex algorithm is
especially useful in performing approximate searches for names of authors,
taxonomies, and other text strings that can be pronounced.
Characters
Value
AEIOUHWY
Dropped
BFPV
1
CGJKQSXZ
2
DT
3
Search WWH ::




Custom Search