A Logical Framework for Template Creation and Information Extraction - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

positive and one false positive. By the definition of superset generalisation,

we know that every template derived from this second template will have at

least one true positive and one false positive. So if, at one stage of a search,

we only want to consider templates with no false positives, then we need not

consider any descendent of this template. The two descendents shown ( < the,

∗

> ) need not be evaluated explicitly therefore; they can be

annotated as having at least one false positive, although in the figure, the

fully evaluated scores are shown.

Now consider the third row of templates in Fig. 4, i.e. those created after

two generalisation functions. From the left of the graph, three of these have

two true positives ( < the, FELINE, VBD > , < the, FELINE,

,

∗

> and <

∗

,

∗

,

∗

> and < the,

ANIMAL, VBD > ) and one has only one true positive, so we focus the search

on the first three. These have zero, one and one false positives respectively,

making the first one look most promising: < the, FELINE, VBD > .Thisisthe

best unevaluated template so far. Several further generalisations are possible

from this template, including applying generalise( τ,κ Ω , 3) to form < the, FE-

LINE,

∗

> . But this has already been evaluated, so need not be considered

again. Applying generalise( τ,κ Π , 1) forms < DT, FELINE, VBD > which has

three true positives, and still no false positives. Given the two very small doc-

ument sets, this is an optimal template, so we stop the search here. In a more

realistic application, the search would continue until a stopping criterion was

reached, but would not find any superior template.

One modification to the algorithm would require more memory but should

lead to a faster convergence to a (possibly) superior solution. This is to expand

the unevaluated template set O by repeated applications of the generalise

function until it contains every template that could possibly be derived from

the seed phrase, or as many as is practically possible. We would then proceed

with a best-first search, but with the advantage that after each evaluation,

a large number of unevaluated templates will have the lower bounds of their

true- and false-positive estimates updated, because their ancestry would be

explicitly represented on the search graph. This should improve the selection

of the best template at each stage.

The algorithm above does not prune any part of the search graph, but

merely tends to search promising areas first. In practice, this is likely to lead

to very large memory requirements, which can be avoided through pruning.

Pruning search graphs is a lot more di cult than pruning trees because each

node can have multiple parents, and so after a node is pruned, it may reappear

as part of a different path. Although algorithms such as A* are inappropriate

here, 6 recent variations may provide useful pruning methods, such as SMA*

[24, p. 104] and Sweep A* [27].

∗

6 We have no notion of a path cost, and are only interested in the final solution

rather than the path to it.

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home