Databases Reference
In-Depth Information
A Logical Framework for Template Creation
and Information Extraction
David Corney 1 , Emma Byrne 2 , Bernard Buxton 1 , and David Jones 1
1
Department of Computer Science, University College London, Gower Street,
London WC1E 6BT, UK
D.Corney@ucl.ac.uk, B.Buxton@cs.ucl.ac.uk, D.Jones@cs.ucl.ac.uk
2
School of Primary Care and Population Sciences, University College London,
Highgate Hill, London N19 5LW, UK
emma.byrne@ucl.ac.uk
Summary. Information extraction is the process of automatically identifying facts
of interest from pieces of text, and so transforming free text into a structured data-
base. Past work has often been successful but ad hoc, and in this paper we propose
a more formal basis from which to discuss information extraction. We introduce a
framework which will allow researchers to compare their methods as well as their
results, and will help to reveal new insights into information extraction and text
mining practices.
One problem in many information extraction applications is the creation of
templates, which are textual patterns used to identify information of interest. Our
framework describes formally what a template is and covers other typical information
extraction tasks. We show how common search algorithms can be used to create and
optimise templates automatically, using sequences of overlapping templates, and we
develop heuristics that make this search feasible. Finally we demonstrate a success-
ful implementation of the framework and apply it to a typical biological information
extraction task.
1 Introduction
Information extraction (IE) [7] has developed over recent decades with appli-
cations analysing text from news sources [8], financial sources [6], and biologi-
cal research papers [1,5,12]. Competitions such as MUC and TREC have been
promoted as using real text sources to highlight problems in the real world,
and more recently TREC has included a genomics track [11], again highlight-
ing biology and medicine as growing areas of IE research. It has long been
recognised that there is a need to share resources between research groups
in order to allow a fair comparison of their different systems and to moti-
vate and direct further research. We strongly feel that there is also a need
to provide a theoretical framework within which these information extraction
Search WWH ::




Custom Search