Biomedical Engineering Reference
In-Depth Information
Search Engine Technology
Working with the Entrez system illustrates several points regarding search engine technology. The
first is that the state of the art in search engine integration provides only partial, high-level
integration with the growing number of rapidly expanding molecular biology databases. As a result,
most intra- and inter-database links are database-specific. Furthermore, the granularity or depth of
integration depends on the features that front-end or portal developers have the time and resources
to implement.
Even a well-designed system such as Entrez is a compromise from the perspective of user interface.
One purpose of a user interface is to hide the complexity of the underlying data structures and
database systems. However, Entrez requires users to have some low-level knowledge of the
databases included in the system. For example, different limits options are available as a function of
the database selected, and it's up to the user to understand the lack of uniformity in options available
through the user interface. That is, it's possible for a relatively naive user to try a search that will fail
because he or she assumes that what works in the search of one database will also work in any
other. As a result, for optimum use of Entrez or any other Internet-based, link-integrated database
system, users should be familiar with the underlying databases.
The popular Entrez system also illustrates that the links available through specialized search engines,
like general-purposes systems, yield results of varying quality. A researcher will quickly discard many
results. Furthermore, data contained in so-called secondary databases are calculated from data
contained in primary databases. Entrez supports searches on molecular weight, for example, based
on molecular weights calculated from the amino acid sequence data. As a result, errors in the
primary databases propagate to secondary databases in a way that may not be obvious by examining
the data in the secondary database because it's internally consistent. Furthermore, errors may not be
discovered until the data are validated by a wet lab experiment months or years later. The point is
that data validation isn't ensured simply because databases are integrated at some level. In contrast,
the process of creating a central integrated database, such as PubMed Central (PAC), necessarily
involves the validation of data during the integration process. PAC provides integration of life-science
journal literature in a common format and in a single repository, providing a single, unified access
portal to scientific literature instead of combination of links to disparate databases, each with their
own idiosyncrasies in vocabularies and infrastructures.
Working with the Entrez system demonstrates several knowledge management issues and
challenges, beyond data validation. These include what to do with search results, how to update
databases so that propagation of errors is controlled and traceable, how to determine who is
responsible for maintenance, and how to communicate information to users on database updates and
corrections. For the databases included in the Entrez system, third parties provide the maintenance.
However, for private and commercial databases, these and other knowledge management activities
must be assigned, monitored, and assessed.
In addition to the shortcomings of link-based database integration, Entrez also highlights the benefits
of a high-level database search system. Without a system like Entrez or a related system like the
NCBI Discovery Space that is designed to facilitate Single Nucleotide Polymorphism (SNP) research,
users would have to alternatively login, copy, and paste or otherwise transfer results from one search
to the input of another. Entrez saves users time and minimizes errors owing to mistakes made by
transferring data from one database to another. Unfortunately, creating systems such as Entrez is a
major endeavor. Most search engines simply create dynamic links to content that last for the
duration of the session, or that at best can be saved for future reference.
Intelligent Agents
As illustrated in Table 4-2 , search engine technology isn't limited to dynamically inter-linking
databases, but includes a range of capabilities that apply to bioinformatics work. One particularly
active area of R&D is in the area of intelligent agents—search engines with advanced pattern-
 
 
Search WWH ::




Custom Search