Databases Reference
In-Depth Information
When comparing two templates, we can say that one is certainly better
than the other if either: a) it matches more true positive fragments and no
more false positives; or b) it matches fewer false positive fragments and no
fewer true positives. Unfortunately, we cannot be certain of whether it is
better when c) it matches more true positives and more false positives. We
need some way of exploring this trade-off, an issue we return to later.
We can now generate a series of templates of varying generality and com-
pare them with each other in order to guide our search for useful templates.
We discuss this further in the following section.
6.2 Search Algorithms
As stated in the introduction, heuristic searching requires: definitions of can-
didate solutions; a means to generate and evaluate candidate solutions; and a
suitable search algorithm. We have now defined the candidate solutions (i.e.
templates, Definition 13) and means to generate (Sect. 4) and evaluate them
(using estimates of true positive and false positive scores given in Sect. 5,
Definitions 27-28). We now turn to the search algorithms themselves. First,
we define an exhaustive search over all possible templates, given a particular
seed phrase. We will then show that in general, this is computationally infea-
sible, owing to the combinatorial growth of the search space with respect to
the length of the template. We then introduce heuristics to make the search
feasible while still (we hope) producing good templates, and define and dis-
cuss a simple best-first search algorithm. We also discuss possible merits of
evolutionary algorithms.
In all cases, we assume that we are given a seed fragment f .Theroot
node of the search corresponds to a template consisting of a tuple of template
elements, each containing a single literal: τ root = initialise( f ) (Definition 18).
From this, we can modify each element in the template by a single application
of the generalise( τ,κ,x ) function (Definition 19). We can estimate the num-
ber of true positives and false positives for each of these new templates, and
then decide which template to explore and modify next. The exact number
of templates produced at each stage depends on the words themselves, be-
cause each generalise( τ,κ,x ) function will return 0, 1 or more templates (see
Definition 17 and the accompanying discussion). We must also decide when
to terminate the search, as we do not know a priori the quality of the best
possible template, as we are assuming that we do not have a fully labelled
corpus.
A simple exhaustive search method would be to start with a literal tem-
plate created from the seed phrase using the initialise function. For each ele-
ment in this template, we then apply the generalisation function, using each
category in turn, so that each application generates a new template. We need
some fixed order over the categories, but the actual order is not critical here 5 .
5 One example would be to use the order: literals, stems, gazetteers, parts-of-speech
and wildcards. This order reflects the typical size of the membership functions.
Search WWH ::




Custom Search