Information Technology Reference
In-Depth Information
The dataset we used for the evaluation is the dmoz.org directory as of October 21,
2005. It contains 687,333 concepts and 4,381,225 instances. The concepts are orga-
nized into a tree; the deepest parts of the hierarchy go 15 levels deep, but in most
places it is shallower (85% of all concepts are on levels 5 through 9, and the av-
erage node depth is 7.13). Since it would be too time-consuming to compute our
OntoRand index over all pairs of documents (there are approx. 9 . 6 · 10 12 such pairs),
we used a random sample of 10 6 pairs of documents.
In the case of the similarity measure (11.3), which is based on the tree distance
between two concepts, it is necessary to select the parameters α and β . Recall that
α is used in the term e −αl , where l is the length of the path from one concept to
the other. Since our hierarchy has just 15 levels, we know that l ≤ 28 for any pair of
nodes; but since most nodes are on levels 5 through 9, we can expect l to be around
10-15 for a typical random pair of unrelated concepts. We decided to use α =0 . 3,
which results in e −αl values from 0.74 (for l = 1) to 0.22 (for l = 5), 0.05 (for l = 10)
and 0.01 (for l = 15).
The parameter β can be chosen using similar considerations. It is used in the
term tanh( βh ), where h is the level at which the last common ancestor of the two
concepts is located. Thus in our case h will be between 0 and 14, and will be close
to 0 for two random unrelated concepts. For two very closely related concepts, h
will typically be close to the depth of these two concepts, which (as we saw above)
is on average around 7. We use β =0 . 4, which results in values of tanh( βh ) ranging
from 0 (for h = 0) and 0.20 (for h = 1) to 0.76 (for h = 5), 0.89 (for h = 7), and
0.96 (for h = 10).
In general, the choice of values of α and β depends on the characteristics of the
ontologies we are dealing with. A more principled way of choosing α and β might
be to set explicit requirements on the value that we want e −αl to have for a pair
of two random (i.e., typically unrelated) documents, and on the value that we want
tanh( βh ) to have for a pair of two very closely related documents.
11.5.1 Removing Lower Levels of the Tree
In this scenario we keep only the upper k levels of the tree, for various values of k .
Any concepts located at levels from k + 1 on are discarded; instances that used to
be assigned to one of the deleted concepts are reassigned to its ancestor on the level
k − 1 (i.e., the deepest level that was not deleted). We then compare the resulting
tree with the original tree. This removal of lower levels of the tree corresponds to
the scenario that the ontology is being constructed automatically in a top-down
manner (e.g., by hierarchical top-down clustering of instances) and some automatic
stopping criterion is used to decide when to stop partitioning the clusters; if we stop
too early, the resulting hierarchy will lack the lower levels. The chart in Figure 11.1
shows how the overlap measure (eq. 11.2) and the tree distance measure (eq. 11.3)
react to this gradual removal of lower parts of the hierarchy.
We note that the overlap-based similarity measure increases monotonically as
more and more levels are kept. The increase is quick at first and slow at the end,
which is reasonable because (as has been noted above) the deeper levels of the hier-
archy contain relatively few nodes, so discarding them does not alter the hierarchy
so dramatically. For instance, if we constructed an ontology in a top-down manner
and stopped when the ontology is at most seven levels deep, the OntoRand index
Search WWH ::




Custom Search