A Logical Framework for Template Creation and Information Extraction - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

1. begin

2. C ←∅ , O ←∅

3. initialise( f )= τ root → O

4. while not finished do

a) find estimated best template τ in O

b) evaluate τ

c) delete τ from O

d) add τ to C

e) expand τ by adding to O all templates that can be created by superset

generalisation of τ

f) update lower bounds on TP and FP for all templates in O

5. return best template from C .

6. end

Fig. 3. A best-first algorithm. C is the “closed” set of evaluated templates and O

is the “open” set of unevaluated templates. See Sect. 6.4 for further details

Figure 3 gives the algorithm. After initialisation, we take the best template

that has not yet been evaluated, and evaluate it, i.e. calculate its true posi-

tive and false positive scores. We then generalise it in every way possible to

create child templates, and update the lower bounds on the true and false

positive scores for these children. Because there are several ways that each

template could be created, some children will have more than one parent. In

these cases, the lower bounds are the maximum of the lower bounds of all

the parents.

By “best template” (Fig. 3, Steps 4a and 5), we mean select the template

with the highest lower-bound on the number of true positives, as inherited

from each template's ancestors. If more than one template has the same max-

imum true positive lower-bound, then we choose between them by selecting

the template with the smallest lower-bound on false positives. If this still se-

lects more than one template, we can either pick one randomly; use them all

successively; or use all the selected templates together in the subsequent steps

(i.e. evaluate and generalise more than one template in one pass through the

main loop).

In Step 4f we could choose to either update the bounds of just the descen-

dents of τ in O , or else update the bounds of the descendents of every template

in C . The latter would be more computationally expensive, but would lead to

better estimates of the lower bounds. For each new template created, it would

require a search through C to find all the other possible ancestors, besides τ ,

in order to calculate the new lower bounds.

Finally, we terminate the search when some pre-specified criterion is sat-

isfied. Possible criteria include terminating: when O is empty (in which case

every possible template has been evaluated); or when a limit on the number of

evaluations is reached (e.g. stop after 1,000 templates have been evaluated);

or when a certain number of true positives (or false positives) are matched by

the current best template. It would be quite possible for the user to stop the

search and consider the current best template, before re-starting the search if

necessary.

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home