A Logical Framework for Template Creation and Information Extraction - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

These are very crude estimates, as they assume that every fragment matched

by τ in D + contains information of interest, and that every fragment matched

by τ in D N contains no information of interest. Thus they cannot be used

to estimate the precision or recall scores 4 , but they are su cient to guide

the search of useful templates. In fact, as a successful search progresses and

the quality of the template improves, then these estimates will become more

accurate, although they are unreliable, especially at the start of the search

process.

Let us consider some other properties of templates generated using su-

perset generalisation. Suppose we have two templates, τ 1 and τ 2 ,andthat

τ 1 > s τ 2 . Then by definition,

.If τ 2 is a generalisation

of τ 1 derived using superset generalisation, then µ ( τ 1 ,D ) ⊂ µ ( τ 2 ,D ). This

relative specificity relation holds for any set of documents, so if one template

matches fewer fragments than another in one corpus, then it will in any other

corpus as well. Thus given two corpora D a and D b :

µ ( τ 1 ,D )

µ ( τ 2 ,D )

µ ( τ 1 ,D a )

µ ( τ 2 ,D a )

µ ( τ 1 ,D b )

µ ( τ 2 ,D b )

|⇐⇒|

This property will be useful in developing heuristic search methods, because

it allows us to rationally prune search graphs, as we discuss in Sect. 6.2.

We defined terms such as “true positive” and “false positive” above. Now

we can say that if template τ 1 is at least as specific as template τ 2 , then the

number of true positives returned by τ 1 is no more than the number returned

by τ 2 , and equivalently for other scores. I.e. if τ 1 ≥ s τ 2 then:

TP( τ 1 ,D )

|≤|

TP( τ 2 ,D )

FP( τ 1 ,D )

|≤|

FP( τ 2 ,D )

TN( τ 1 ,D )

|≥|

TN( τ 2 ,D )

FN( τ 1 ,D )

|≥|

FN( τ 2 ,D )

The equivalent inequalities also hold for the estimates defined above:

| TP( τ 1 ,D )

|≤| TP( τ 2 ,D )

| FP( τ 1 ,D )

|≤| FP( τ 2 ,D )

As these relationships hold for any set of documents D , we can predict some

properties of templates without fully evaluating them. We can use these prop-

erties to guide heuristic searches.

If our assumptions here are correct, then the probability of finding

an interesting fragment is higher in positive documents than in neutral

documents. We can write this assumption as p ( f

µ ( τ,D + )) >

∈

I ( D )

∈

p f

µ ( τ,D N ) .

∈

I ( D )

∈

4 Substitution into definitions 25 and 26 gives recall ≡ precision ≡ 1 for every

template, which is clearly optimistic.

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home