Automatic Evaluation of Ontologies - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

The dataset we used for the evaluation is the dmoz.org directory as of October 21,

2005. It contains 687,333 concepts and 4,381,225 instances. The concepts are orga-

nized into a tree; the deepest parts of the hierarchy go 15 levels deep, but in most

places it is shallower (85% of all concepts are on levels 5 through 9, and the av-

erage node depth is 7.13). Since it would be too time-consuming to compute our

OntoRand index over all pairs of documents (there are approx. 9 . 6 · 10 12 such pairs),

we used a random sample of 10 6 pairs of documents.

In the case of the similarity measure (11.3), which is based on the tree distance

between two concepts, it is necessary to select the parameters α and β . Recall that

α is used in the term e −αl , where l is the length of the path from one concept to

the other. Since our hierarchy has just 15 levels, we know that l ≤ 28 for any pair of

nodes; but since most nodes are on levels 5 through 9, we can expect l to be around

10-15 for a typical random pair of unrelated concepts. We decided to use α =0 . 3,

which results in e −αl values from 0.74 (for l = 1) to 0.22 (for l = 5), 0.05 (for l = 10)

and 0.01 (for l = 15).

The parameter β can be chosen using similar considerations. It is used in the

term tanh( βh ), where h is the level at which the last common ancestor of the two

concepts is located. Thus in our case h will be between 0 and 14, and will be close

to 0 for two random unrelated concepts. For two very closely related concepts, h

will typically be close to the depth of these two concepts, which (as we saw above)

is on average around 7. We use β =0 . 4, which results in values of tanh( βh ) ranging

from 0 (for h = 0) and 0.20 (for h = 1) to 0.76 (for h = 5), 0.89 (for h = 7), and

0.96 (for h = 10).

In general, the choice of values of α and β depends on the characteristics of the

ontologies we are dealing with. A more principled way of choosing α and β might

be to set explicit requirements on the value that we want e −αl to have for a pair

of two random (i.e., typically unrelated) documents, and on the value that we want

tanh( βh ) to have for a pair of two very closely related documents.

11.5.1 Removing Lower Levels of the Tree

In this scenario we keep only the upper k levels of the tree, for various values of k .

Any concepts located at levels from k + 1 on are discarded; instances that used to

be assigned to one of the deleted concepts are reassigned to its ancestor on the level

k − 1 (i.e., the deepest level that was not deleted). We then compare the resulting

tree with the original tree. This removal of lower levels of the tree corresponds to

the scenario that the ontology is being constructed automatically in a top-down

manner (e.g., by hierarchical top-down clustering of instances) and some automatic

stopping criterion is used to decide when to stop partitioning the clusters; if we stop

too early, the resulting hierarchy will lack the lower levels. The chart in Figure 11.1

shows how the overlap measure (eq. 11.2) and the tree distance measure (eq. 11.3)

react to this gradual removal of lower parts of the hierarchy.

We note that the overlap-based similarity measure increases monotonically as

more and more levels are kept. The increase is quick at first and slow at the end,

which is reasonable because (as has been noted above) the deeper levels of the hier-

archy contain relatively few nodes, so discarding them does not alter the hierarchy

so dramatically. For instance, if we constructed an ontology in a top-down manner

and stopped when the ontology is at most seven levels deep, the OntoRand index

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home