Biomedical Engineering Reference
In-Depth Information
Computational Methods
A common activity in bioinformatics work is to search through a database and locate a substring—a
nucleotide sequence, for example—that matches a target string. Some computational methods
provide results faster, allowing a researcher to check the results of an experiment frequently, at the
expense of more false positive matches. Later, more selective, albeit slower techniques can be used
to verify the results of these quick screening techniques. The performance and selectivity of a search
are also a function of the search string and how the data are represented in the database to be
searched.
Most programming languages provide string search capabilities, but these tend to be unacceptably
inefficient when performed on large data sets typical of nucleotide sequence databases, and most
don't support approximate match capabilities. Because the built-in algorithms tend to use brute-force
methods that don't make use of heuristics or algorithmic tricks to increase efficiency or to accelerate
the search process, special-purpose algorithms have to be used with computationally intensive tasks
such as sequence searching.
Search Algorithms
Regardless of their intended purpose or design, search algorithms can be characterized by setup
time, running time, and need for backtracking. The first two elements, setup time and running time,
are related by the function:
search time = setup time + (comparisons x characters)
In this relationship, comparisons is the number of comparisons made per character in the text body
to be searched, and characters is the number of characters in the body of text or sequence data that
is to be searched. Setup time is the time invested prior to the actual search, and includes
programming to establish lookup tables and other data that can be used to simplify or accelerate the
search once it's initiated.
A search algorithm is considered inefficient if, in the process of a search, the number of comparisons
made is equal to or greater than the number of characters in the source text. For example, a search
algorithm in which the number of comparisons per character is greater than one is considered
inefficient. Conversely, an efficient algorithm is one that makes less than one comparison per
character to complete a search. Making less than one comparison per character might seem
counterintuitive at first, but this is possible through the use of heuristics that involve skipping
characters and repeating characters. Heuristic techniques use previous information to identify a
solution.
To illustrate the first heuristic, skipping characters, consider the search scenario in Figure 4-11 where
a search for the string "RNA" progresses from left to right in the search text "The synthesis of protein
is by way of an RNA intermediary". At the initial stage of this example, the search position is at the
right end of the first word, as indicated by the underline "e" in "The". Now, consider the progression
of the marker as the search string is compared with subsequent characters in the search text.
Figure 4-11. Character Skipping. The search, indicated by the marker box,
progresses from left to right, skipping ahead up to three characters at a
time until a character in the search string "RNA" is found in the skipped-to
region of the search text.
 
 
Search WWH ::




Custom Search