Biomedical Engineering Reference
In-Depth Information
L
4
MN
5
R
6
Soundex encoding uses the first letter of the word, followed by up to three codes, depending on the
length of the word. Double consonants and repetitions of the same consonant group are dropped. For
example, "protein" is converted to "P635", and "Pill", "Phil", and "Philly" are converted to "P4".
Soundex errs on the side of sensitivity instead of specificity, in that it tends to pick up strings that
are only vaguely similar to the search string. The major limitations of Soundex are that the first letter
of a word must match exactly and that the string must be intended to be spoken. That is, Soundex
isn't intended to work with a text string such as "ATTAATTGGA". Similarly, a search for a word that
begins with "ph" won't find a word that actually begins with "f", even though the words may sound
identical.
A major improvement on the Soundex approach of encoding search and index strings is the
Metaphone algorithm. Like Soundex, Metaphone disposes of vowels—unless the word begins with a
vowel. However, Metaphone encoding is based on diphthongs rather than consonants. For example,
the Metaphone algorithm transforms "X" to "KS" before encoding the text string. As a result, a search
using Metaphone is generally more specific than Soundex. However, the Metaphone encoding scheme
doesn't overcome the limitation of being unable to encode and search for unpronounceable text
strings, such as nucleotide sequence data.
When it comes to searching sequence databases for sequence homologues, the gold standard is to
use a search engine based on a dynamic programming algorithm. Dynamic programming is a
computationally expensive but thorough search algorithm that recursively searches through a
database for a sequence that approximates the search string. Not only does a dynamic programming
algorithm search a database from beginning to end, but it keeps the results of previous match
attempts in memory. As a result, running a search for a sequence only a few dozen nucleotides long
against a database such as the human genome database can take hours, even with a dedicated
supercomputer.
Because the routine use of dynamic programming search techniques is unreasonably expensive in
time and computer resources, modified versions of dynamic programming, such as the FASTA
algorithm, are more practical. FASTA makes a dynamic programming approach to string search
tenable by limiting the area of the sequence database that is searched. The downside to an algorithm
such as FASTA is that it's possible to miss potential matches because the search isn't exhaustive.
When it comes to performing approximate searches on sequence data, by far the most popular
algorithm is the Basic Local Alignment Search Tool (BLAST), which achieves computational efficiency
by using heuristics that are weighted toward local sequence alignments. The BLAST heuristic
assumes that sections of protein are often conserved without gaps, so that the gaps can be ignored.
As such, it's able to detect relationships among sequences that share only isolated regions of
similarity. BLAST is used by virtually all of the major bioinformatics centers, including NCBI. Using
BLAST over the Internet, sequence searches against the full human genome can be completed in only
a few seconds, even when the system is being used by multiple users.
Because of the statistical techniques used to narrow the focus of the BLAST algorithm, it can miss
potential matches in a nucleotide database. To extend the capabilities of BLAST so that it finds
additional matches, NCBI developed Position-Specific Iterated BLAST (PSI-BLAST) that extends the
original BLAST algorithm using a position-specific scoring matrix that is capable of detecting subtle
nucleotide sequence similarities. NCBI and other research centers have similarly created specialized
versions of BLAST that are tuned to specific problems or areas. As described in more detail in Chapter
8 , "Pattern Matching," there are versions of BLAST that are optimized for human, microbial, and
malaria genomes, vector contamination, and immunoglobins.
Search WWH ::




Custom Search