Databases Reference
In-Depth Information
positive and one false positive. By the definition of superset generalisation,
we know that every template derived from this second template will have at
least one true positive and one false positive. So if, at one stage of a search,
we only want to consider templates with no false positives, then we need not
consider any descendent of this template. The two descendents shown ( < the,
> ) need not be evaluated explicitly therefore; they can be
annotated as having at least one false positive, although in the figure, the
fully evaluated scores are shown.
Now consider the third row of templates in Fig. 4, i.e. those created after
two generalisation functions. From the left of the graph, three of these have
two true positives ( < the, FELINE, VBD > , < the, FELINE,
,
> and <
,
,
> and < the,
ANIMAL, VBD > ) and one has only one true positive, so we focus the search
on the first three. These have zero, one and one false positives respectively,
making the first one look most promising: < the, FELINE, VBD > .Thisisthe
best unevaluated template so far. Several further generalisations are possible
from this template, including applying generalise( τ,κ , 3) to form < the, FE-
LINE,
> . But this has already been evaluated, so need not be considered
again. Applying generalise( τ,κ Π , 1) forms < DT, FELINE, VBD > which has
three true positives, and still no false positives. Given the two very small doc-
ument sets, this is an optimal template, so we stop the search here. In a more
realistic application, the search would continue until a stopping criterion was
reached, but would not find any superior template.
One modification to the algorithm would require more memory but should
lead to a faster convergence to a (possibly) superior solution. This is to expand
the unevaluated template set O by repeated applications of the generalise
function until it contains every template that could possibly be derived from
the seed phrase, or as many as is practically possible. We would then proceed
with a best-first search, but with the advantage that after each evaluation,
a large number of unevaluated templates will have the lower bounds of their
true- and false-positive estimates updated, because their ancestry would be
explicitly represented on the search graph. This should improve the selection
of the best template at each stage.
The algorithm above does not prune any part of the search graph, but
merely tends to search promising areas first. In practice, this is likely to lead
to very large memory requirements, which can be avoided through pruning.
Pruning search graphs is a lot more di cult than pruning trees because each
node can have multiple parents, and so after a node is pruned, it may reappear
as part of a different path. Although algorithms such as A* are inappropriate
here, 6 recent variations may provide useful pruning methods, such as SMA*
[24, p. 104] and Sweep A* [27].
6 We have no notion of a path cost, and are only interested in the final solution
rather than the path to it.
Search WWH ::




Custom Search