Information Technology Reference
In-Depth Information
ments, we use the metric, F 1 measure (Larsen & Aone, 1999; Zhao & Karypis, 1999), which makes
use of true class labels of Web pages, to measure the quality of clusters in a Web page dataset. The F 1
measure indicates how well a cluster solution matches the true classes in the real world (e.g., the Yahoo
directory). In general, the greater F 1 score, the better clustering solu tio n.
In our experiments we test the existing methods CH(k) , KL(k) ,
sil , φ (k) and φ T (k) (see the second
section) to discover the number of clusters k for Web page datasets. These five metrics are computed for
different k 's for a Web page dataset. However, none of them works well. Our tests results showed that for
any dataset in Table 1 their estimated k is more than 5 times different from the true number of classes
in the Web page datasets and the corresponding cluster solutions have a lower than 0.3 F 1 score.
After many trials, we find that avgInter(k) for any dataset in Table 1 reaches a common threshold
of 1.7, when the F 1 measure of the cluster solution for a dataset is greatest. The relation between the
thresholds of avgInter(k) and the F 1 scores of a cluster solution, and the relation between the thresholds
of avgInter(k) and k 's for the four Web page datasets are illustrated in Figure 6.
In Figure 6 (a-1), (b-1), (c-1) and (d-1), the F 1 scores of cluster performances for the four datasets
reach the maximal values when the threshold of avgInter is 1.7, and further increasing or reducing the
threshold of avgInter would only worsen the F 1 scores for the datasets DS1 , DS2 , DS3 and DS4 . In other
words, once the weighted average inter-cluster similarity ( avgInter ) reaches the common threshold, 1.7,
the cluster solu tio n is found to be best for a Web page dataset. This shows that, unlike other metric such as
CH(k) , KL(k) ,
k
sil , or φ T (k) , avgInter implies a common characteristic in different Web page datasets.
Figure 6 (a-2), (b-2), (c-2) and (d-2) show the k 's for four Web page datasets produced by setting dif-
ferent thresholds for avgInter . In Figure 6 (a- 2) it is shown that the avgInter method is able to find k =1
while many existing methods are unable to do so. As shown in the figure, when avgInter reaches 1.7,
the best estimated values for k is found to be 2 for DS1 , 5 for DS2 , 21 for DS3 and 40 for DS4 .
The estimated k is usually greater than the number of true classes in a Web page dataset because
outliers are found and clustered into some small clusters, and a few true classes are partitioned into
more than one cluster with finer granularity. This situation is exactly shown in Table 2, which shows
the clustering solution for the most diverse dataset, DS4 , obtained when the threshold of avgInter is
1.7. The naming for a newly formed cluster is by selecting the top three descriptive terms. The rank-
ing of descriptive terms for a cluster is conducted by sorting the '
k
tf df values of terms in the cluster
( tf lj ′ is defined to be the number of Web pages containing term t j in cluster C l and df j is the document
frequency (Yang & Pedersen, 1997) of t j ). It can be noted that for most true classes, a true class has a
dominant cluster in Table 2. For instance, the dominant clusters for true class astronomy, anatomy and
evolution are cluster C 1 , C 4 and C 5 , respectively. We can see several true classes have been partitioned
more precisely into more than one cluster; e.g., true class automotive has been separated into cluster C 17
which is more related to car and auto , and cluster C 18 more related motorcycle and bike , as indicated by
their top descriptive terms. Similar situation happens to true class agriculture , health , education and
archaeology , each of which has been partitioned into two clusters. As shown in Table 2, outliers, which
have low purity scores, can be found as cluster C 32 , C 33 , …, and C 40 .
lj
j
discovering the number of Clusters
The constant factor described in the last section can be used to estimate the number of clusters in a
clustering process. The number of clusters k for a Web page dataset is estimated to be:
Search WWH ::




Custom Search