Biomedical Engineering Reference
In-Depth Information
of the form:
SELECT molecular_wt FROM protein_database
WHERE protein = hemoglobin
In addition to NLP, there are a number of technologies that are useful in locating textual and graphic
data in very large databases as well. One of them is image-based query by example, where the user
selects from a library of images to create and then refine a search. Using this technology, the user
selects an image of a protein structure and then either selects the closest fit or a representative of
additional image libraries, depending on the extent of the database. The same approach is often used
in commercial search engines, where the user is able to specify a search for "more like these." The
system takes the exemplars and creates a search that may include terms and constraints that may
not have been included in the user's initial search. The advantage of a search-by-example tool is that
refining a search is relatively painless and doesn't require any particular knowledge of vocabulary,
database contents, or other low-level details. However, the disadvantage of most query-by-example
systems is that the search query that is actually generated is hidden from the user. As a result, an
expert may not be able to manually refine the search even further. The ability to override a computer-
generated search, such as the utility provided in Entrez where a user can edit the search criteria
generated through the use of pull-down menus, may or may not be an issue, depending on the
expertise of the user.
One of the advantages of using NLP or query by example is that it frees the user from having to learn
a controlled vocabulary. An NLP engine can map concepts and use the appropriate synonyms that the
underlying database management systems expect in order to provide optimum search results.
However, the power of an NLP engine or an ability to manually override a search query lies in the
granularity of the vocabulary used to index the data originally. For example, if all genes dealing with
the heart are indexed under "cardiac," without distinguishing between normal and diseased
conditions, then a researcher won't be able to narrow a search to normal heart pathology.
The optimum condition exists when the controlled vocabulary is made available to users during the
search process. For example, PubMed is indexed using the Medical Subject Heading (MeSH)
vocabulary, maintained by the U.S. National Library of Medicine. Knowing this, a researcher can use
the online MeSH browser to identify the most appropriate search terms to use to retrieve the data of
interest.
For a research group establishing an internal database, MeSH may not be the most appropriate
controlled vocabulary for indexing and searching. Even within the relatively narrow domain of clinical
medicine, there are several popular controlled vocabulary systems in use. In addition to MeSH, there
is the Unified Medical Language System (UMLS), the Read Classification System (RCS), Systemized
Nomenclature of Human and Veterinary Medicine (SNOMED), International Classification of Diseases
(ICD-10), and Current Procedural Terminology (COPT). Each system has its strengths, weaknesses,
and primary purpose. For example, SNOMED is optimized for accessing and indexing clinical
information in human and veterinary medicine databases, whereas the COPT is optimized to identify
medical procedures.
The advantage of using one of these public controlled vocabularies is that the vocabulary is
immediately available. Time-consuming tasks such as removing redundancies in the vocabulary,
which ultimately limits scalability, have been performed by someone else—presumably experts in the
field. Another advantage is that databases indexed with a public controlled vocabulary can more
readily share the database with others without having to distribute the indexing vocabulary. For
example, if an academic research center wants to publish its research on SNPs and drug responses
on the Internet, it can provide a simple keyword search interface to the database and simply list the
appropriate search vocabulary, such as MeSH.
The major disadvantage of using a public controlled vocabulary, or its given representation, is that its
granularity may not exactly fit the needs of the laboratory. Another limitation is that the public
Search WWH ::




Custom Search