A Logical Framework for Template Creation and Information Extraction - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

7 Discussion and Extensions

We now briefly discuss a few of the possible extensions to the framework and

its implementation.

We could introduce other wildcards, such as a wildcard which matches an

entire phrase, which could itself be defined as a series of terms, much like a

template. This would allow optional subclauses, such as subordinate clauses,

to be matched. Let ? τ 1 designate an optional wildcard that matches a sequence

of literals defined by template τ 1 or matches nothing at all. Then if τ 1 =

<

>

would match the fragment < the, cat, which, was, black, sat > as well as < the,

cat, sat > .

An additional approach to finding good templates is to repeatedly merge

useful templates to produce more general templates [18]. Our framework could

easily be extended to allow this, by ensuring that the product of merging two

templates matches every fragment that either template matches. This could

be achieved by considering each pair of template elements in turn, and either

performing a simple set union if they belong to the same category, or else

generalising them both to the same category before such a union. In either

case, the new template would match the union of the true-positives matched

by the two parents, and the union of the false-positives, allowing the lower-

bounds on each to be calculated.

So far, we have considered template that exist in isolation, whereas in

practical systems, it is more common to apply a set of templates together.

Our framework can be extended to include this by using a sequential covering

algorithm. Suppose we have a template τ that matches some true positives

and some false positives. We could reduce the number of false positives by

creating a second template τ that is optimised to match just the false positive

fragments matched by τ . This could be achieved by defining two new versions

of D + and D N based on the fragments matched by τ , and using these to

guide the search for τ . We could then apply τ and τ together, predicting

interesting fragments as µ ( τ,D )

{

which

}

,

{

*

}

,

{

*

}

> , the template τ 2 = <

{

DT

}

,

{

ANIMAL

}

,

{

? τ 1 }

,

{

sat

}

µ ( τ ,D ) (i.e. fragments matched by τ but

not by τ .). In many practical applications, more than one template will be

applied to a set of documents, each designed to match a different piece of

information, or a different way of expressing that information.

We have assumed that we do not have a set of annotated examples, i.e.

fragments known in advance to be positive or negative. Creating and anno-

tating large sets of examples is extremely time consuming for a user, although

giving a yes/no response to automatic annotations is simpler [20]. One en-

hancement to our system therefore would be to start with the estimates of

true positive and false positive as outlined above, and search for a good tem-

plate, and then use this template to annotate a number of fragments and to

present these to the user. The user then marks each fragment as interest-

ing or not interesting, and this could then be used to improve the quality of

\

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home