Mining the Bibliome - Translational Informatics: Realizing the Promise of Knowledge-Driven Healthcare

Information Technology Reference

In-Depth Information

case for leveraging templates to systematically enhance otherwise unstructured

documents with structure to support subsequent uses for the document (e.g . , for

knowledge discovery).

5.2.2

Natural Language Understanding

Appreciating that there will be likely be a signifi cant component of biomedical lit-

erature that continues to be represented in narrative form, there will be a continued

demand for the development and use of computational approaches to identify poten-

tial embedded knowledge. The sheer volume of biomedical literature that needs to

be analyzed will perpetually necessitate the use of computational approaches. As

mentioned earlier, MEDLINE consists of more than 20 million citations. Even more

impressive is the growth rate of MEDLINE - currently exceeding 1.5 million arti-

cles a year, up from 500,000 articles a year less than a decade ago. It is not incon-

ceivable, that with the growth of biomedical data generation, that the interpretations

that are embodied into biomedical literature will result in continued growth of

annual MEDLINE entries. The sheer volume of text represents a challenge that will

increasingly depend on automated approaches for the elicitation of embodied

knowledge that might be sequestered in text form.

Natural language processing systems are built around algorithms to mediate

between unstructured data and human understanding [ 18 ]. Natural language pro-

cessing systems are of two fl avors: (1) Natural Language Understanding (NLU);

and, (2) Natural Language Generation (NLG). Both types of systems are rife with

challenges. The combination of NLU and NLG systems in fact embody the ultimate

Turing test - where the human is able to directly communicate with the computer in

natural language without the human being able to detect that it is not interacting

with a computer. For the present discussion, we will focus the discussion on NLU

systems, since they focus on extracting information from unstructured data such as

embodied in biomedical literature.

NLU systems are generally built on a combination of linguistic heuristics that

approximate human interpretation of concept recognition, grammar, and ultimately

meaning connoted from text. At a high-level, there are three major aspects of NLU:

(1) Lexical Analysis - identifi cation of named concepts that can be matched to a

dictionary of terms; (2) Syntactic Analysis - identifi cation of syntax used to encode

grammar in context of identifi ed terms; and, (3) Semantic Analysis - identifi cation

of concepts represented by identifi ed terms. NLU systems have been developed that

focus on either one or a combination of these major areas. The inherent variety that

is afforded through the power of natural language is also what continues to support

the need for advanced research in the development of NLU systems.

The challenges faced by NLU systems not withstanding, the potential to leverage

automated routines for extracting information from volumes of text addresses a key

issue in the leveraging of potentially available knowledge. The recent exposition of

artifi cial intelligence supported by NLU systems is the Watson system developed by

Search WWH ::

Custom Search

Home