A Logical Framework for Template Creation and Information Extraction - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

in practice, and which should approximate the “ideal” values above. We now

consider several options.

Suppose that we had a set of documents such that every fragment was

labelled as either “interesting” or “not interesting”. Then we could use stan-

dard supervised learning algorithms to construct useful templates and directly

measure the number of true positives, false positives and so on, to find an op-

timal template. This could then be used to find further information in the

same field. However, while a small number of such labelled corpora do exist

(e.g. [14]), they are for a few very precisely defined application areas, and

so of little general use, as they cannot be used to aid IE in other application

areas. Annotating documents in this way is very time consuming for a domain

expert, and one aim of information extraction is to reduce the time and effort

required to find relevant information.

Suppose instead that we have one set of documents where for each sen-

tence, the probability that it contains a relevant piece of information is above

some threshold, and a second set of documents, where the probability is below

a threshold. We could then treat this as a classification problem, albeit with

noisy labels on the data. However, such a set of irrelevant documents is hard

to define, and furthermore, even interesting documents are likely to contain

irrelevant facts such as background information, although this approach has

been used successfully [23].

Suppose that instead of having irrelevant (negative) documents, we have

a set of “neutral” documents, each of which may or may not contain relevant

information. I.e. we have no prior knowledge about relevant information in

neutral documents. We can then compare the proportion of information re-

trieved from neutral and from positive documents to evaluate a template. We

assume that a “good” template will retrieve more information from positive

documents than from neutral documents, even if we don't know in advance

which pieces of information are useful, or how much useful information exists

in any particular document.

Let D be a corpus containing

documents. We define the set of positive

documents as D + and neutral documents as D N , such that D = D +

|

D

|

∪

D N

and D +

D N

. A “positive” document is one that the user believes is

likely to contain information of interest. A “neutral” document is one where

the user has no reason to believe that the document does or does not contain

information of interest.

We now use these two sets of documents to define estimates of the num-

bers of true-positive fragments and false-positive fragments matched by a

template τ .

Definition 27. We define an estimated true-positive set TP ( τ,D )=

µ ( τ,D + ) , for a template τ and a set of positive documents, D +

∩

=

∅

⊆

D.

Definition 28. We define an estimated false-positive set FP ( τ,D )=

µ ( τ,D N ) , for a template τ and a set of neutral documents, D N

⊆

D.

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home