Search Engines - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

Computational Methods

A common activity in bioinformatics work is to search through a database and locate a substring—a

nucleotide sequence, for example—that matches a target string. Some computational methods

provide results faster, allowing a researcher to check the results of an experiment frequently, at the

expense of more false positive matches. Later, more selective, albeit slower techniques can be used

to verify the results of these quick screening techniques. The performance and selectivity of a search

are also a function of the search string and how the data are represented in the database to be

searched.

Most programming languages provide string search capabilities, but these tend to be unacceptably

inefficient when performed on large data sets typical of nucleotide sequence databases, and most

don't support approximate match capabilities. Because the built-in algorithms tend to use brute-force

methods that don't make use of heuristics or algorithmic tricks to increase efficiency or to accelerate

the search process, special-purpose algorithms have to be used with computationally intensive tasks

such as sequence searching.

Search Algorithms

Regardless of their intended purpose or design, search algorithms can be characterized by setup

time, running time, and need for backtracking. The first two elements, setup time and running time,

are related by the function:

search time = setup time + (comparisons x characters)

In this relationship, comparisons is the number of comparisons made per character in the text body

to be searched, and characters is the number of characters in the body of text or sequence data that

is to be searched. Setup time is the time invested prior to the actual search, and includes

programming to establish lookup tables and other data that can be used to simplify or accelerate the

search once it's initiated.

A search algorithm is considered inefficient if, in the process of a search, the number of comparisons

made is equal to or greater than the number of characters in the source text. For example, a search

algorithm in which the number of comparisons per character is greater than one is considered

inefficient. Conversely, an efficient algorithm is one that makes less than one comparison per

character to complete a search. Making less than one comparison per character might seem

counterintuitive at first, but this is possible through the use of heuristics that involve skipping

characters and repeating characters. Heuristic techniques use previous information to identify a

solution.

To illustrate the first heuristic, skipping characters, consider the search scenario in Figure 4-11 where

a search for the string "RNA" progresses from left to right in the search text "The synthesis of protein

is by way of an RNA intermediary". At the initial stage of this example, the search position is at the

right end of the first word, as indicated by the underline "e" in "The". Now, consider the progression

of the marker as the search string is compared with subsequent characters in the search text.

Figure 4-11. Character Skipping. The search, indicated by the marker box,

progresses from left to right, skipping ahead up to three characters at a

time until a character in the search string "RNA" is found in the skipped-to

region of the search text.

Search WWH ::

Custom Search

Home