Information Technology Reference
In-Depth Information
the set of all documents belonging to that concept). We then reassign each document
to the category whose centroid is the most similar to the document vector. Thus,
the hierarchical relation between concepts remains unchanged, but the assignment of
instances to concepts may change considerably. This reassignment of instances to the
nearest concepts resembles operations that might be used in an automated ontology
construction/population approach (e.g., analogous to k -means clustering). We then
measure the similarity of the new ontology (after the reassignment of documents to
concepts) to the original one.
For reasons of scalability, the experiments in this section were not performed
on the entire dmoz ontology, but only on its “Science” subtree. This consists of
11,624 concepts and 104,853 documents. We compare two reassignment strategies:
“thorough reassignment” compares each document vector to the centroids of all
concepts, while “top-down reassignment” is a greedy approach that starts with the
root concept and proceeds down the tree, always moving into the subconcept whose
centroid is the most similar to the document vector. When a leaf is reached, or when
none of the subconcept centroids is more similar to the document vector than the
current concept's centroid, the procedure stops and assigns the document to the
current concept. This is much faster than thorough reassignment, but it has the risk
of being derailed into a less promising part of the tree due to bad choices in the
upper levels.
Fig. 11.4. Evaluation of ontology where instances have been reassigned to concepts
based on their natural-language descriptions. The number of reassignment steps is
used as the x -coordinate. The left chart shows the similarity of the original ontology
and the ontology after reassignment. The right chart shows the average distance (as
measured by δ U , eq. (11.2)) between a concept containing an instance in the original
ontology and the concept to which the instance has been reassigned.
After documents are reassigned to concepts, new centroids of the concepts may
be computed (based on the new assignment of documents to concepts), and a new
reassignment step performed using the new centroids. The charts on Figure 11.4
show the results for up to five reassignment steps. The overlap-based definition of
δ U (see eq. (11.2)) was used for both charts.
The left chart in Figure 11.4 shows the similarity of the ontology after each reas-
signment step to the original ontology. As can be expected, top-down reassignment
of documents to concepts introduces much greater changes to the ontology than
thorough reassignment. Most of the change occurs during the first reassignment
step (which is reasonable as it would be naive to expect a simple centroid-based
 
Search WWH ::




Custom Search