Databases Reference
In-Depth Information
These are very crude estimates, as they assume that every fragment matched
by τ in D + contains information of interest, and that every fragment matched
by τ in D N contains no information of interest. Thus they cannot be used
to estimate the precision or recall scores 4 , but they are su cient to guide
the search of useful templates. In fact, as a successful search progresses and
the quality of the template improves, then these estimates will become more
accurate, although they are unreliable, especially at the start of the search
process.
Let us consider some other properties of templates generated using su-
perset generalisation. Suppose we have two templates, τ 1 and τ 2 ,andthat
τ 1 > s τ 2 . Then by definition,
.If τ 2 is a generalisation
of τ 1 derived using superset generalisation, then µ ( τ 1 ,D ) ⊂ µ ( τ 2 ,D ). This
relative specificity relation holds for any set of documents, so if one template
matches fewer fragments than another in one corpus, then it will in any other
corpus as well. Thus given two corpora D a and D b :
|
µ ( τ 1 ,D )
|
<
|
µ ( τ 2 ,D )
|
µ ( τ 1 ,D a )
µ ( τ 2 ,D a )
µ ( τ 1 ,D b )
µ ( τ 2 ,D b )
|
|
<
|
|⇐⇒|
|
<
|
|
.
This property will be useful in developing heuristic search methods, because
it allows us to rationally prune search graphs, as we discuss in Sect. 6.2.
We defined terms such as “true positive” and “false positive” above. Now
we can say that if template τ 1 is at least as specific as template τ 2 , then the
number of true positives returned by τ 1 is no more than the number returned
by τ 2 , and equivalently for other scores. I.e. if τ 1 s τ 2 then:
|
TP( τ 1 ,D )
|≤|
TP( τ 2 ,D )
|
.
|
FP( τ 1 ,D )
|≤|
FP( τ 2 ,D )
|
.
|
TN( τ 1 ,D )
|≥|
TN( τ 2 ,D )
|
.
|
FN( τ 1 ,D )
|≥|
FN( τ 2 ,D )
|
.
The equivalent inequalities also hold for the estimates defined above:
| TP( τ 1 ,D )
|≤| TP( τ 2 ,D )
|
.
| FP( τ 1 ,D )
|≤| FP( τ 2 ,D )
|
.
As these relationships hold for any set of documents D , we can predict some
properties of templates without fully evaluating them. We can use these prop-
erties to guide heuristic searches.
If our assumptions here are correct, then the probability of finding
an interesting fragment is higher in positive documents than in neutral
documents. We can write this assumption as p ( f
µ ( τ,D + )) >
I ( D )
|
f
p f
µ ( τ,D N ) .
I ( D )
|
f
4 Substitution into definitions 25 and 26 gives recall precision 1 for every
template, which is clearly optimistic.
Search WWH ::




Custom Search