ONTECTAS: Bridging the Gap between Collaborative Tagging Systems and Structured Data - Advanced Information Systems Engineering - page 447

Information Technology Reference

In-Depth Information

Table 3. Gold standard based lexical and taxonomic comparison

Lexical Taxonomi c

ONT LFZ BB DAG Schmitz ONT LFZ BB DAG Schmitz

Precision 0.261 0.743 0.183 0.745

0.128

0.480 0.077 0.434 0.123

0.329

Recall

0.240 0.006 0.244 0.025

0.007

0.723 0.023 0.711 0.783

0.256

F-Measure 0.044 0.011 0.043 0.049

0.014

0.577 0.035 0.539 0.212

0.288

may miss important concepts and relationships and a good algorithm that finds

concepts and relationships manually verified to be correct may get penalized

unfairly. We will return to this point. The full version of this paper [20] shows

the formal definitions of the measures and the detailed results. Due to space

limitations, we only cover the highlights in this paper.

We looked at the 25 highest-level concepts common across the five algorithms.

Table 3 shows the results. Bolded entries represent the best performance.

ONTECTAS has the second highest overall lexical recall and f-measure, which

shows that it did well at finding the desired concepts. While DAG had the highest

lexical precision and f-measure, and BB had the highest lexical recall, they both

did very poorly on taxonomic precision, leading to a low taxonomic f-measure.

LFZ had a very good lexical precision; however, this is achieved by reporting a

very small number of correct concepts. ONTECTAS is superior to LFZ in terms

of all three taxonomic measures.

Because the 25 highest level common concepts were very uneven in size, we

performed an analysis of the 6 largest subtrees — otherwise algorithms would

be testing against subtrees that were only one or two concepts large. When we

considered only the 6 largest subtrees, ONTECTAS had the best lexical and

taxonomic f-measure.

Comparing to a gold standard shows how well algorithms do against a man-

ually created ontology. But since a gold standard ontology is static, this metric

may unfairly penalize algorithms that genuinely find correct concepts and rela-

tionships. E.g., “dialect” and “software

technology” is incorrect according

to this standard. Thus, comparing algorithms should take into account other

components discussed above as well.

is-a

8 Conclusion and Future Work

We proposed an algorithm (ONTECTAS) for building ontologies of keywords

from collaborative tagging systems. ONTECTAS uses association rule mining,

bi-gram pruning, exploiting pairs of tags with the same child, and lexico-syntactic

patterns to detect relationships between tags. We also provided a thorough anal-

ysis of ONTECTAS and how it compares to other algorithms. Some of the

important open problems include detecting spam users, improving accuracy of

ontology extraction via supervised learning and by means of incorporation of

part-of-speech detection. Our ongoing work addresses some of these.

Next Page

Advanced Information Systems Engineering

Search WWH ::

Custom Search

Home