Information Technology Reference
In-Depth Information
domain. Further we give an overview of related work in this field and finally we
describe the design of the IE module in our system.
IE is a relatively new discipline within the more general field of Natural Language
Processing (NLP). IE is not Information Retrieval (IR), in which key words are used
to select relevant documents from a large collection e.g. the internet. In contrast IE
extracts relevant information from documents and can be used to post-process the IR
output. As usual with emerging technologies, there are a number of definitions of
information extraction. We understand IE as a procedure that selects, extracts and
combines data from unstructured text in order to produce structured information.
This structured information can than be easily transformed to a database record.
Much of the work in this field is influenced by the Message Understanding
Conferences (MUCs) instituted by the Defence Advanced Research Projects Agency
(DARPA) in the late '80s. The MUCs were created to have a platform for the
evaluation of IE systems. For this purpose a set of 5 tasks described further below
were defined which the competing systems have to fulfil. Altogether 7 Conferences
were held, each with a different focus and additional tasks defined. MUC-7 was
concerned with newspaper articles about space vehicle and missile launches. The
defined tasks were also adopted more or less in most of the IE systems implemented
in other domains than specified by MUC-7. The tasks can be defined as follows:
Named entity recognition (NE),
The NE task deals with the identification of predefined entities ( e.g. names,
organizations, locations etc.). In our case these entities are gene or protein names.
Coreference resolution (CO),
This task requires connecting all references to “identical” entities. This includes
variant forms of name expressions ( e.g. Paul/ he/…)
Template Element Filling (TE),
The required information should be extracted from the text an filled into predefined
templates which reflect a structured representation of the information wanted.
Template Relations
(TR),
TR covers identifying relationships between template elements.
Scenario Template
(ST).
The ST task fits TE and TR results into specified event scenarios.
A more detailed description of the tasks may be found in the MUCs Proceedings
( http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_toc.htm
l). To evaluate the systems with a simple metric, two scores were defined, recall (3)
and precision (4). The calculation of the scores is very similar for the different tasks
except for the CO task. As an example we show the calculation of the scores for the
TE task which are taken from the MUCs Scoring Software User's Manual. Two filled
templates are compared, one filled manually which contains the keys and one filled by
the software which contains responses .
Given the definitions:
COR Correct - the two single fills are considered identical,
INC Incorrect - the two single fills are not identical,
PAR Partially Correct - the two single fills are not identical, but partial credit
should still be given,
MIS Missing - a key object has no response object aligned with it,
SPU Spurious - a response object has no key object aligned with it,
Search WWH ::




Custom Search