A Logical Framework for Template Creation and Information Extraction - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

the core of information extraction. The words that are matched define the

information that is to be extracted.

Example: Given a template τ 1 = <

{

DT

}

,

{

ANIMAL

}

,

{

VBD

}

> and a cor-

pus D 1 (as defined after Definition 3), then µ ( τ 1 ,D 1 )=

{

< the, cat, sat > ,

< a, mouse, ran >

}

.

2.3 Co-occurrence Analysis

Co-occurrence analysis assumes that two entities in the same piece of text

are related, without attempting a more sophisticated linguistic analysis of

the text. In our framework, this can be represented by a template (or set of

templates) that defines the two entities with a series of wildcards between

them.

For example, suppose we use co-occurrence analysis to discover every sen-

tence in a corpus that mentions two entities, as matched by template ele-

ments T i and T j . Let us assume that all sentences to be considered are finite

with a maximum length of Q words. Then we could define two template

τ 1 = <T 1 ,...,T i ,...,T j ,...,T Q > and τ 2 = <T 1 , ...,T j , ...,T i ,...,T Q > .

Two templates are required if we wish to allow for sentences with the two

entities in different orders. We replace every template element except for T i

and T j with the wildcard element “?” so as to match any sentence contain-

ing words that match our terms. With three or more entities, larger sets of

templates may be required.

3 Template Ordering

One motivation for creating this framework is to enable the use of common

search algorithms for template creation. To do this effectively, we must define

an ordering over the templates, which we can then use to develop practical

search heuristics.

For any given document, each template matches a certain number of frag-

ments. A template that matches every possible fragment is useless, as is one

that matches no fragments at all. Somewhere between these two extremes of

generic templates and specific templates, lie useful templates that match the

interesting fragments only, so the aim of template creation is to find a suit-

able trade-off between the generic and the specific. We therefore suggest that

a useful ordering is one based on the number of fragments that a template is

likely to match. We can use such an order to search across a range of tem-

plates and explore the trade-off. For unseen text, it is impossible to predict

the amount of information to be extracted in advance, so instead, we develop

a heuristic ordering that approximates it.

In this section, we define possible orderings of terms and templates. In

the next section, we define algorithms that use these orderings to modify

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home