Information Technology Reference
In-Depth Information
t ag- Based eXPeri Ments
The Catalogue data used for the Tag Based experiments
For the evaluation of our HTML tag based approach, we followed Chakrabarti et al. (1998) and used
the Yahoo! catalogue because it is accessible for free, well known, and widely used. Since the Yahoo!
catalogue is well structured and organized by human experts, we get the additional advantage that we
have a highly reliable classification database for the comparison of our approach to document classifica-
tion with the approach using text only.
The evaluation Scenario
In order to evaluate the improvement achieved by our approach, we have to compare both our clas-
sification extension including HTML tags, and a classification without the use of HTML tags, to an
established and widely used classification. We imported 436 categories and 4735 documents from the
document class “Finanzen und Wirtschaft” (Business and Economy) of the German Yahoo! catalogue
into our database. In order to obtain a reasonable basis for classification, we decided to aggregate all
categories with less than six subcategories or documents into their parent category in order to avoid
classes that are too small. Thus, we have obtained 191 categories, each of which has six or more docu-
ments or sub-categories.
The category vectors in the evaluation database have been set up with the following steps. First,
we used the TreeTagger tool (Schmid, 1994) and a Porter stemmer algorithm modified for the German
language (Porter, 1980) for extracting more than 100,000 German key phrases and sub key phrases
from all the documents, according to the following regular expression:
key phrase = adjective * noun + .
Following this, we propagated all key phrases of a category with a weight limit beyond a predefined
threshold towards the root category. For weighting a propagated key phrase, we used the average weight
of the phrase in all subcategories. Thus, it is not only possible to classify documents in categories with
similar documents, but also to classify documents that fit into several categories, according to a com-
mon parent category.
A cross validation method, called the leave-one-out method (Weiss & Kulikowski, 1991), has been
used for performance evaluation. This leave-one-out method removes an arbitrary document from the
database, renews the classification rules and classifies this document into the structure again. We have
performed this for 600 documents. In order to get a comparison to a text-only approach which does not
use the knowledge of the HTML tags, we took the same documents without HTML tagging information
and applied the same classification steps to these documents.
r esults of the Tag-Based experiments
Our experiments have shown that the information contained in HTML tags improves the accuracy of
the classification results from 28.2% to 38.3%. This corresponds to a relative improvement of 35.8%.
More details of these results can be found in Werner et al. (2005).
Search WWH ::




Custom Search