Search Engines - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

of characters in the search string but then backs up to compare characters if a possible match exists.

Approximate Searches

Algorithms that efficiently locate exact matches have many applications in bioinformatics, including

searching for data in PubMed or some other bibliographic reference database by a specific disease or

author, searching a clinical database by a specific disease or patient identification number, or

searching any database where data are indexed by a known, controlled vocabulary. However, search

algorithms that look for approximate matches are more useful in one of the most computationally

challenging tasks in bioinformatics—that of searching sequence databases for homologies of

particular sequences. Approximate match algorithms vary from the use of templates, to the use of a

distance function, and the use of how words sound when spoken.

String search algorithms based on templates use metacharacters to specify the range of permissible

strings that must be matched exactly. For example, the UNIX utility "grep" (general regular

expression parser) uses metacharacters such as "*", "\", "$", "+", and "^" to perform a brute-force

search. As such, applications such as grep don't do true approximate searches. Similarly, the Find

function in the Windows operating system allows a search string such as "*research.doc" to locate

Microsoft Word documents that include "MyResearch.doc", "DNAResearch.doc", and

"ProteinResearch.doc". However, the search wouldn't locate documents such as "MyReserch.doc",

"DNAResaerch.doc", or "ProteinRsch.doc", because of missing or transposed characters in the file

names compared to the search string.

True approximate search algorithms allow approximate matches, permit the transposition of adjacent

characters, substitution of characters, and assign different weights to different types of errors. An

approximate match algorithm for nucleotide sequences should be able to locate nucleotide sequences

despite the presence of single nucleotide polymorphism, for example. Searching with an approximate

match algorithm for a nucleotide sequence that contains the string "AAGGTTAA" should be able to

locate the sequence "ATGGTTAA", where the second "A" in the first string is replaced by a "T" in the

second string.

Phonetic comparison algorithms, typified by Soundex and Metaphone, are examples of true

approximate search algorithms that have application in bioinformatics. For example, they can be

used to search bibliographic databases by author name when the exact spelling of the author's name

may be unknown, or search a taxonomy database by phonetically spelling a species name. The

Soundex approximate search algorithm addresses the problem of uncertain spelling by indexing and

searching databases by an encoded string. These encoded strings are created by dropping vowels

and silent consonants and assigning one of six values to the remaining consonants (see Table 4-3 ).

Table 4-3. Soundex Codes. Vowels and silent consonants are dropped from

the word and the consonants are converted to a three-digit numeric codes,

headed by the first character in the word. The Soundex algorithm is

especially useful in performing approximate searches for names of authors,

taxonomies, and other text strings that can be pronounced.

Characters

Value

AEIOUHWY

Dropped

BFPV

1

CGJKQSXZ

2

DT

3

Bioinformatics Computing

Search WWH ::

Custom Search

Home