Information Technology Reference
In-Depth Information
when we compute U − δ V | for various pairs of documents (when using eq. (11.1)
to compare two ontologies in Sec. 11.5.3), definition (11.5) can yield values closely
correlated to those of definition (11.4) for suitable values of α and β (e.g., correl.
coef. = 0 . 981 for α =0 . 8 =1 . 5). However, note that the fact that δ values of
(11.5) are closely correlated with those of (11.4) for some choice of α and β does not
imply that the U − δ V | will also be closely correlated for the same choice of α and
β (or vice versa).
The need to select concrete values of α and β is one of the disadvantages of using
the definition (11.3) (or (11.5)) rather than the overlap-based definition (11.2) (or
equivalently (11.4)).
Further generalizations. The distance measure (11.3) could be further generalized
by taking δ ( l, h )= f ( l ) g ( h ) for any decreasing function f and increasing function g .
Since the values l and h are always integers and are limited by the depth of the tree
(or twice the depth in the case of l ), the functions f and g (or even δ ( l, h )itself)
could even be defined using a table of function values for all possible l and h .
Note that the main part of the OntoRand index formula, as defined in equa-
tion (11.1), i.e., the sum 1 ≤i<j≤n U ( U ( o i ) ,U ( o j )) − δ V ( V ( o i ) ,V ( o j )) | , can also
be interpreted as a Manhattan ( L 1 -norm) distance between two vectors of n ( n− 1) / 2
components, one depending on the ontology U and the other depending only on the
ontology V . Thus, in effect, we have represented an ontology U by a “feature vec-
tor” in which the ( i, j )-th component has the value δ U ( U ( o i ) ,U ( o j )) describing how
closely the instances o i and o j have been placed in that ontology. This interpreta-
tion opens the possibility of various further generalizations, such as using Euclidean
distance instead of Manhattan distance, or even using kernel methods (cf. [13]).
However, we leave such extensions for further work.
11.4.4 Approximation Algorithms
As can be seen from eq. (11.1), the computation of our ontology similarity measure
involves a sum over all pairs of documents, ( i, j ) for 1 ≤ i<j≤ n . This quadratic
time complexity can be problematic when comparing ontologies with a fairly large
number of instances (e.g., on the order of 100,000, as in the case of the dmoz.org “Sci-
ence” subtree mentioned in Section 11.4.1). One way to speed up the computation of
the similarity measure and obtain an approximate result is to use a randomly sam-
pled subset of pairs rather than all possible pairs of documents. That is, eq. (11.1)
would then contain the average value of U ( U ( o i ) ,U ( o j )) − δ V ( V ( o i ) ,V ( o j )) | over
some subset of pairs instead of over all pairs.
Another way towards approximate computation of the similarity measure is to
try to identify pairs ( i, j ) for which the difference U ( U ( o i ) ,U ( o j )) −δ V ( V ( o i ) ,V ( o j )) |
is not close to 0. If both ontologies classify the instances o i and o j into highly unre-
lated clusters, the values δ U ( U ( o i ) ,U ( o j )) and δ V ( V ( o i ) ,V ( o j )) will both be close
to 0 and their difference will also be close to 0 and will not have a large effect on
the sum. (In a typical dmoz-like hierarchy we can expect that a large proportion
of pairs of instances will fall unto such relatively unrelated clusters. As an extreme
case, consider the definition of δ U using eq. (3). If a pair of instances has no common
ancestor concept except the root, h will be 0 and thus δ U will be 0. If this happens
in both ontologies, the pair will contribute nothing to the sum in eq. (11.1).) Thus it
would be reasonable to try identifying pairs ( i, j ) for which o i and o j are in closely
Search WWH ::




Custom Search