Automatic Evaluation of Ontologies - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

when we compute |δ U − δ V | for various pairs of documents (when using eq. (11.1)

to compare two ontologies in Sec. 11.5.3), definition (11.5) can yield values closely

correlated to those of definition (11.4) for suitable values of α and β (e.g., correl.

coef. = 0 . 981 for α =0 . 8 ,β =1 . 5). However, note that the fact that δ values of

(11.5) are closely correlated with those of (11.4) for some choice of α and β does not

imply that the |δ U − δ V | will also be closely correlated for the same choice of α and

β (or vice versa).

The need to select concrete values of α and β is one of the disadvantages of using

the definition (11.3) (or (11.5)) rather than the overlap-based definition (11.2) (or

equivalently (11.4)).

Further generalizations. The distance measure (11.3) could be further generalized

by taking δ ( l, h )= f ( l ) g ( h ) for any decreasing function f and increasing function g .

Since the values l and h are always integers and are limited by the depth of the tree

(or twice the depth in the case of l ), the functions f and g (or even δ ( l, h )itself)

could even be defined using a table of function values for all possible l and h .

Note that the main part of the OntoRand index formula, as defined in equa-

tion (11.1), i.e., the sum 1 ≤i<j≤n |δ U ( U ( o i ) ,U ( o j )) − δ V ( V ( o i ) ,V ( o j )) | , can also

be interpreted as a Manhattan ( L 1 -norm) distance between two vectors of n ( n− 1) / 2

components, one depending on the ontology U and the other depending only on the

ontology V . Thus, in effect, we have represented an ontology U by a “feature vec-

tor” in which the ( i, j )-th component has the value δ U ( U ( o i ) ,U ( o j )) describing how

closely the instances o i and o j have been placed in that ontology. This interpreta-

tion opens the possibility of various further generalizations, such as using Euclidean

distance instead of Manhattan distance, or even using kernel methods (cf. [13]).

However, we leave such extensions for further work.

11.4.4 Approximation Algorithms

As can be seen from eq. (11.1), the computation of our ontology similarity measure

involves a sum over all pairs of documents, ( i, j ) for 1 ≤ i<j≤ n . This quadratic

time complexity can be problematic when comparing ontologies with a fairly large

number of instances (e.g., on the order of 100,000, as in the case of the dmoz.org “Sci-

ence” subtree mentioned in Section 11.4.1). One way to speed up the computation of

the similarity measure and obtain an approximate result is to use a randomly sam-

pled subset of pairs rather than all possible pairs of documents. That is, eq. (11.1)

would then contain the average value of |δ U ( U ( o i ) ,U ( o j )) − δ V ( V ( o i ) ,V ( o j )) | over

some subset of pairs instead of over all pairs.

Another way towards approximate computation of the similarity measure is to

try to identify pairs ( i, j ) for which the difference |δ U ( U ( o i ) ,U ( o j )) −δ V ( V ( o i ) ,V ( o j )) |

is not close to 0. If both ontologies classify the instances o i and o j into highly unre-

lated clusters, the values δ U ( U ( o i ) ,U ( o j )) and δ V ( V ( o i ) ,V ( o j )) will both be close

to 0 and their difference will also be close to 0 and will not have a large effect on

the sum. (In a typical dmoz-like hierarchy we can expect that a large proportion

of pairs of instances will fall unto such relatively unrelated clusters. As an extreme

case, consider the definition of δ U using eq. (3). If a pair of instances has no common

ancestor concept except the root, h will be 0 and thus δ U will be 0. If this happens

in both ontologies, the pair will contribute nothing to the sum in eq. (11.1).) Thus it

would be reasonable to try identifying pairs ( i, j ) for which o i and o j are in closely

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home