A Logical Framework for Template Creation and Information Extraction - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

When comparing two templates, we can say that one is certainly better

than the other if either: a) it matches more true positive fragments and no

more false positives; or b) it matches fewer false positive fragments and no

fewer true positives. Unfortunately, we cannot be certain of whether it is

better when c) it matches more true positives and more false positives. We

need some way of exploring this trade-off, an issue we return to later.

We can now generate a series of templates of varying generality and com-

pare them with each other in order to guide our search for useful templates.

We discuss this further in the following section.

6.2 Search Algorithms

As stated in the introduction, heuristic searching requires: definitions of can-

didate solutions; a means to generate and evaluate candidate solutions; and a

suitable search algorithm. We have now defined the candidate solutions (i.e.

templates, Definition 13) and means to generate (Sect. 4) and evaluate them

(using estimates of true positive and false positive scores given in Sect. 5,

Definitions 27-28). We now turn to the search algorithms themselves. First,

we define an exhaustive search over all possible templates, given a particular

seed phrase. We will then show that in general, this is computationally infea-

sible, owing to the combinatorial growth of the search space with respect to

the length of the template. We then introduce heuristics to make the search

feasible while still (we hope) producing good templates, and define and dis-

cuss a simple best-first search algorithm. We also discuss possible merits of

evolutionary algorithms.

In all cases, we assume that we are given a seed fragment f .Theroot

node of the search corresponds to a template consisting of a tuple of template

elements, each containing a single literal: τ root = initialise( f ) (Definition 18).

From this, we can modify each element in the template by a single application

of the generalise( τ,κ,x ) function (Definition 19). We can estimate the num-

ber of true positives and false positives for each of these new templates, and

then decide which template to explore and modify next. The exact number

of templates produced at each stage depends on the words themselves, be-

cause each generalise( τ,κ,x ) function will return 0, 1 or more templates (see

Definition 17 and the accompanying discussion). We must also decide when

to terminate the search, as we do not know a priori the quality of the best

possible template, as we are assuming that we do not have a fully labelled

corpus.

A simple exhaustive search method would be to start with a literal tem-

plate created from the seed phrase using the initialise function. For each ele-

ment in this template, we then apply the generalisation function, using each

category in turn, so that each application generates a new template. We need

some fixed order over the categories, but the actual order is not critical here 5 .

5 One example would be to use the order: literals, stems, gazetteers, parts-of-speech

and wildcards. This order reflects the typical size of the membership functions.

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home