Information Technology Reference
In-Depth Information
a gold standard (see Section 11.2) with the main difference being that, while it bases
the evaluation on instances assigned to the ontology concepts, our approach does
not rely on natural-language descriptions of the concepts and instances (unlike e.g.,
the string edit distance approaches of Maedche and Staab [16]). No assumptions are
made regarding the representation of instances, only that we can distinguish one
instance from another (and that the ontology is based on the same set of instances
as the gold standard).
11.4.1 Task Description
We have tested our approach on a concrete task of evaluating a topic ontology based
on the “Science” subtree of the dmoz.org internet directory. The dmoz directory is
a topic ontology structured as a hierarchy of topics, and each topic may contain
(besides subtopics) zero or more links to external web pages. Each link includes
a title and a short description of the external web page. In the context of the
ontology learning scenario, each link to an external web page represents an instance
of the topic, in a manner similar to the approach to automatic classification of
Web documents into a topic ontology defined in [20]. In addition to classifying
documents into a topic ontology to populate an existing ontology, we can also define
a problem of learning an ontology given only a set of documents. In the case of
“Science” subtree of dmoz.org this means given a total of approx. 100,000 instances,
arrange the instances into a hierarchy of concepts. In effect, this is similar to an
unsupervised hierarchical clustering problem. The resulting hierarchy of concepts
(with each instance attached to one of the concepts) is in effect a simple ontology
(the hierarchical relationship between concepts can be approximately interpreted as
an “is-a” relation). One can evaluate learned ontologies by comparing them to the
“Science” subtree of the real dmoz.org directory, which will thus assume the role of
a gold standard.
In this evaluation task, each instance is represented by a short document of
natural-language text (i.e., the title and description of the external page, as it ap-
pears in the dmoz.org directory). The concepts of the learned ontologies, however,
are not explicitly represented by any terms, phrases, or similar textual descriptions.
The question of how to select a good short textual representation, or perhaps a set
of keywords, for a particular learned concept could in itself be a separate task, but
is not part of the ontology learning task whose evaluation is being discussed here.
Additionally, since the number of instances (as well as concepts) is fairly large, the
evaluation must be reasonably fast and completely automated.
11.4.2 Similarity Measures on Partitions
Our approach to evaluation is based on the analogies between this ontology learning
task and traditional unsupervised clustering. In clustering, the task is to partition a
set of instances into a family of disjoint subsets. Here, the topic ontology can be seen
as a hierarchical way of partitioning the set of instances. The clustering community
has proposed various techniques for comparing two partitions of the same set of
instances, which can be used to compare the output of an automated clustering
method with a gold-standard partition. If these distance measures on traditional
“flat” partitions can be extended to hierarchical partitions, they can be used to
Search WWH ::




Custom Search