A Logical Framework for Template Creation and Information Extraction - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

candidate solutions; a means of generating new candidate solutions; and an

algorithm for guiding the search (including starting and stopping). Any useful

framework describing IE must provide a way to define and create templates,

and our framework proposes using these AI search methods, an idea we expand

in Sect. 6.2, where we “grow” useful templates from given seed phrases.

One alternative to using templates is co-occurrence analysis [16]. This

identifies pieces of text (typically sentences, abstracts or entire documents)

that mention two entities, and assumes that this implies that the two entities

are in some way related. Within our framework, this can be seen as a special

case of a template, albeit a very simple one, as we show in Sect. 2.3.

The framework itself is presented in Sects. 2-5, with the subsequent sec-

tions discussing various implementation issues.

Section 2 defines various concepts formally, moving from words and docu-

ments to templates and information extraction. Section 3 describes how tem-

plates can be ordered according to how specific or general they are, as a

precursor to template creation and optimisation. Section 4 discusses how to

modify a template to make it more general. Section 5 gives formal definitions

of recall and precision within our framework and discusses how they might

be estimated in practice. Section 6 discusses heuristic search algorithms and

their implementation and includes a detailed example, before a concluding

discussion.

A shorter form of this work is published in [4].

2 Basic Definitions

In this section, we define several terms culminating in a formal definition of

information extraction templates.

Definition 1. A literal λ is a word in the form of an ordered list of characters.

We assume implicitly a fixed alphabet of characters.

Examples: “cat”, “jumped”, “2,5-dihydroxybenzoic”.

Definition 2. A document d is a tuple (ordered list) of literals: d =

<λ 1 ,λ 2 ,...,λ |d| >.

Examples: d 1 = < the, cat, sat, on, the, mat > , d 2 = < a, mouse, ran, up, the,

clock > .

Definition 3. A corpus D is a set of documents: D =

{

d 1 ,d 2 ,...,d |D| }

.

Example: D 1 =

{

d 1 ,d 2 }

.

Definition 4. A lexicon Λ is the set of all literals found in all documents in

a corpus: Λ D =

{

λ

|

λ

∈

d and d

∈

D

}

.

Example: Λ D 1 =

{

the, cat, sat, on, mat, a, mouse, ran, up, clock

}

.

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home