Databases Reference
In-Depth Information
candidate solutions; a means of generating new candidate solutions; and an
algorithm for guiding the search (including starting and stopping). Any useful
framework describing IE must provide a way to define and create templates,
and our framework proposes using these AI search methods, an idea we expand
in Sect. 6.2, where we “grow” useful templates from given seed phrases.
One alternative to using templates is co-occurrence analysis [16]. This
identifies pieces of text (typically sentences, abstracts or entire documents)
that mention two entities, and assumes that this implies that the two entities
are in some way related. Within our framework, this can be seen as a special
case of a template, albeit a very simple one, as we show in Sect. 2.3.
The framework itself is presented in Sects. 2-5, with the subsequent sec-
tions discussing various implementation issues.
Section 2 defines various concepts formally, moving from words and docu-
ments to templates and information extraction. Section 3 describes how tem-
plates can be ordered according to how specific or general they are, as a
precursor to template creation and optimisation. Section 4 discusses how to
modify a template to make it more general. Section 5 gives formal definitions
of recall and precision within our framework and discusses how they might
be estimated in practice. Section 6 discusses heuristic search algorithms and
their implementation and includes a detailed example, before a concluding
discussion.
A shorter form of this work is published in [4].
2 Basic Definitions
In this section, we define several terms culminating in a formal definition of
information extraction templates.
Definition 1. A literal λ is a word in the form of an ordered list of characters.
We assume implicitly a fixed alphabet of characters.
Examples: “cat”, “jumped”, “2,5-dihydroxybenzoic”.
Definition 2. A document d is a tuple (ordered list) of literals: d =
1 2 ,...,λ |d| >.
Examples: d 1 = < the, cat, sat, on, the, mat > , d 2 = < a, mouse, ran, up, the,
clock > .
Definition 3. A corpus D is a set of documents: D =
{
d 1 ,d 2 ,...,d |D| }
.
Example: D 1 =
{
d 1 ,d 2 }
.
Definition 4. A lexicon Λ is the set of all literals found in all documents in
a corpus: Λ D =
{
λ
|
λ
d and d
D
}
.
Example: Λ D 1 =
{
the, cat, sat, on, mat, a, mouse, ran, up, clock
}
.
Search WWH ::




Custom Search