Databases Reference
In-Depth Information
the core of information extraction. The words that are matched define the
information that is to be extracted.
Example: Given a template τ 1 = <
{
DT
}
,
{
ANIMAL
}
,
{
VBD
}
> and a cor-
pus D 1 (as defined after Definition 3), then µ ( τ 1 ,D 1 )=
{
< the, cat, sat > ,
< a, mouse, ran >
}
.
2.3 Co-occurrence Analysis
Co-occurrence analysis assumes that two entities in the same piece of text
are related, without attempting a more sophisticated linguistic analysis of
the text. In our framework, this can be represented by a template (or set of
templates) that defines the two entities with a series of wildcards between
them.
For example, suppose we use co-occurrence analysis to discover every sen-
tence in a corpus that mentions two entities, as matched by template ele-
ments T i and T j . Let us assume that all sentences to be considered are finite
with a maximum length of Q words. Then we could define two template
τ 1 = <T 1 ,...,T i ,...,T j ,...,T Q > and τ 2 = <T 1 , ...,T j , ...,T i ,...,T Q > .
Two templates are required if we wish to allow for sentences with the two
entities in different orders. We replace every template element except for T i
and T j with the wildcard element “?” so as to match any sentence contain-
ing words that match our terms. With three or more entities, larger sets of
templates may be required.
3 Template Ordering
One motivation for creating this framework is to enable the use of common
search algorithms for template creation. To do this effectively, we must define
an ordering over the templates, which we can then use to develop practical
search heuristics.
For any given document, each template matches a certain number of frag-
ments. A template that matches every possible fragment is useless, as is one
that matches no fragments at all. Somewhere between these two extremes of
generic templates and specific templates, lie useful templates that match the
interesting fragments only, so the aim of template creation is to find a suit-
able trade-off between the generic and the specific. We therefore suggest that
a useful ordering is one based on the number of fragments that a template is
likely to match. We can use such an order to search across a range of tem-
plates and explore the trade-off. For unseen text, it is impossible to predict
the amount of information to be extracted in advance, so instead, we develop
a heuristic ordering that approximates it.
In this section, we define possible orderings of terms and templates. In
the next section, we define algorithms that use these orderings to modify
Search WWH ::




Custom Search