Databases Reference
In-Depth Information
in practice, and which should approximate the “ideal” values above. We now
consider several options.
Suppose that we had a set of documents such that every fragment was
labelled as either “interesting” or “not interesting”. Then we could use stan-
dard supervised learning algorithms to construct useful templates and directly
measure the number of true positives, false positives and so on, to find an op-
timal template. This could then be used to find further information in the
same field. However, while a small number of such labelled corpora do exist
(e.g. [14]), they are for a few very precisely defined application areas, and
so of little general use, as they cannot be used to aid IE in other application
areas. Annotating documents in this way is very time consuming for a domain
expert, and one aim of information extraction is to reduce the time and effort
required to find relevant information.
Suppose instead that we have one set of documents where for each sen-
tence, the probability that it contains a relevant piece of information is above
some threshold, and a second set of documents, where the probability is below
a threshold. We could then treat this as a classification problem, albeit with
noisy labels on the data. However, such a set of irrelevant documents is hard
to define, and furthermore, even interesting documents are likely to contain
irrelevant facts such as background information, although this approach has
been used successfully [23].
Suppose that instead of having irrelevant (negative) documents, we have
a set of “neutral” documents, each of which may or may not contain relevant
information. I.e. we have no prior knowledge about relevant information in
neutral documents. We can then compare the proportion of information re-
trieved from neutral and from positive documents to evaluate a template. We
assume that a “good” template will retrieve more information from positive
documents than from neutral documents, even if we don't know in advance
which pieces of information are useful, or how much useful information exists
in any particular document.
Let D be a corpus containing
documents. We define the set of positive
documents as D + and neutral documents as D N , such that D = D +
|
D
|
D N
and D +
D N
. A “positive” document is one that the user believes is
likely to contain information of interest. A “neutral” document is one where
the user has no reason to believe that the document does or does not contain
information of interest.
We now use these two sets of documents to define estimates of the num-
bers of true-positive fragments and false-positive fragments matched by a
template τ .
Definition 27. We define an estimated true-positive set TP ( τ,D )=
µ ( τ,D + ) , for a template τ and a set of positive documents, D +
=
D.
Definition 28. We define an estimated false-positive set FP ( τ,D )=
µ ( τ,D N ) , for a template τ and a set of neutral documents, D N
D.
Search WWH ::




Custom Search