Automatic Evaluation of Ontologies - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

Thus, if h is small, th ( βh ) is close to 0, whereas for a large h it becomes close to

1. It is reasonable to treat the case when the two concepts are the same, i.e., when

U i = U j and thus l = 0, as a special case, and define δ (0 ,h ) = 1 in that case, to

prevent δ U ( U i ,U i ) from being dependent on the depth of the concept U i .

Incidentally, if we set α to 0 (or close to 0) and β to some large value, δ ( l, h ) will

be approx. 0 for h = 0 and approx. 1 for h> 0. Thus, in the sum used to define the

OntoRand index (11.1), each pair of instances contributes the value of 1 if they have

some common ancestor besides the root in one ontology but not in other, otherwise

it contributes the value of 0. Thus, the OntoRand index becomes equivalent to

the ordinary Rand index computed over the partitions of instances implied by the

second-level concepts of the two ontologies (i.e., the immediate subconcepts of the

root concept). This can be taken as a warning that α should not be too small and β

not too large, otherwise the OntoRand index will ignore the structure of the lower

levels of the ontologies.

The overlap-based version of d U from eq. (11.2) can also be defined in terms of

h and l . If the root is taken to be at depth 0, then the intersection of A ( U i ) and

A ( U j ) contains h + 1 concepts, and the union of A ( U i ) and A ( U j ) contains h + l +1

concepts. Thus, we see that eq. (11.2) is equivalent to defining

δ ( l, h )=( h +1) / ( h + l +1) .

(11.4)

By comparing the equations (11.3) and (11.4), we see a notable difference between

the two definitions of δ : when h = 0, i.e., when the two instances have no common

ancestor except the root, eq. (11.3) returns δ = 0 while eq. (11.4) returns δ =

1 / ( l +1) > 0. When comparing two ontologies, it may often happen that many

pairs of instances have no common ancestor (except the root) in either of the two

ontologies, i.e., h U = h V = 0, but the distance between their concepts is likely to be

different: l U = l V . In these cases, using eq. (11.3) will result in δ U = δ V = 0, while

eq. (11.4) will result in δ U = δ V . When the resulting values |δ U − δ V | are used in

eq. (11.1), we see that in the case of definition (11.3), many terms in the sum will be

0 and the OntoRand index will be close to 1. For example, in our experiments with

the Science subtree of dmoz.org (Sec. 11.5.3), despite the fact that the assignment of

instances to concepts was considerably different between the two ontologies, approx.

81% of instance pairs had h U = h V = 0 (and only 3.2% of these additionally had

l U = l V ). Thus, when using the definition of δ from eq. (11.3) (as opposed to the

overlap-based definition from eq. (11.4)), we must accept the fact that most of the

terms in the sum (11.1) will be 0 and OntoRand index will be close to 1. This does

not mean that the resulting values of OntoRand are not useful for assessing whether,

e.g., one ontology is closer to the gold standard than another ontology is, but it may

nevertheless appear confusing that OntoRand is always so close to 1. In this case a

possible alternative is to replace eq. (11.3) by

δ ( l, h )= e −αl tanh( βh +1)

(11.5)

The family of δ -functions defined by (11.5) can be seen as a generalization (in a loose

sense) of the δ -function from formula (11.4). For example, we compared the values

of δ produced by these two definitions on a set of 106 random pairs of documents

from the dmoz.org Science subtree. For a suitable choice of α and β , the definition

(11.5) can be made to produce values of δ that are very closely correlated with those

of definition (11.4) (e.g., correl. coe cient = 0 . 995 for α =0 . 15 ,β =0 . 25). Similarly,

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home