Web Mining by Automatically Organizing Web Pages into Categories - Distributed Artificial Intelligence, Agent Technology, and Collaborative Applications

Information Technology Reference

In-Depth Information

ments, we use the metric, F 1 measure (Larsen & Aone, 1999; Zhao & Karypis, 1999), which makes

use of true class labels of Web pages, to measure the quality of clusters in a Web page dataset. The F 1

measure indicates how well a cluster solution matches the true classes in the real world (e.g., the Yahoo

directory). In general, the greater F 1 score, the better clustering solu tio n.

In our experiments we test the existing methods CH(k) , KL(k) ,

sil , φ (k) and φ T (k) (see the second

section) to discover the number of clusters k for Web page datasets. These five metrics are computed for

different k 's for a Web page dataset. However, none of them works well. Our tests results showed that for

any dataset in Table 1 their estimated k is more than 5 times different from the true number of classes

in the Web page datasets and the corresponding cluster solutions have a lower than 0.3 F 1 score.

After many trials, we find that avgInter(k) for any dataset in Table 1 reaches a common threshold

of 1.7, when the F 1 measure of the cluster solution for a dataset is greatest. The relation between the

thresholds of avgInter(k) and the F 1 scores of a cluster solution, and the relation between the thresholds

of avgInter(k) and k 's for the four Web page datasets are illustrated in Figure 6.

In Figure 6 (a-1), (b-1), (c-1) and (d-1), the F 1 scores of cluster performances for the four datasets

reach the maximal values when the threshold of avgInter is 1.7, and further increasing or reducing the

threshold of avgInter would only worsen the F 1 scores for the datasets DS1 , DS2 , DS3 and DS4 . In other

words, once the weighted average inter-cluster similarity ( avgInter ) reaches the common threshold, 1.7,

the cluster solu tio n is found to be best for a Web page dataset. This shows that, unlike other metric such as

CH(k) , KL(k) ,

k

sil , or φ T (k) , avgInter implies a common characteristic in different Web page datasets.

Figure 6 (a-2), (b-2), (c-2) and (d-2) show the k 's for four Web page datasets produced by setting dif-

ferent thresholds for avgInter . In Figure 6 (a- 2) it is shown that the avgInter method is able to find k =1

while many existing methods are unable to do so. As shown in the figure, when avgInter reaches 1.7,

the best estimated values for k is found to be 2 for DS1 , 5 for DS2 , 21 for DS3 and 40 for DS4 .

The estimated k is usually greater than the number of true classes in a Web page dataset because

outliers are found and clustered into some small clusters, and a few true classes are partitioned into

more than one cluster with finer granularity. This situation is exactly shown in Table 2, which shows

the clustering solution for the most diverse dataset, DS4 , obtained when the threshold of avgInter is

1.7. The naming for a newly formed cluster is by selecting the top three descriptive terms. The rank-

ing of descriptive terms for a cluster is conducted by sorting the '

k

tf df values of terms in the cluster

( tf lj ′ is defined to be the number of Web pages containing term t j in cluster C l and df j is the document

frequency (Yang & Pedersen, 1997) of t j ). It can be noted that for most true classes, a true class has a

dominant cluster in Table 2. For instance, the dominant clusters for true class astronomy, anatomy and

evolution are cluster C 1 , C 4 and C 5 , respectively. We can see several true classes have been partitioned

more precisely into more than one cluster; e.g., true class automotive has been separated into cluster C 17

which is more related to car and auto , and cluster C 18 more related motorcycle and bike , as indicated by

their top descriptive terms. Similar situation happens to true class agriculture , health , education and

archaeology , each of which has been partitioned into two clusters. As shown in Table 2, outliers, which

have low purity scores, can be found as cluster C 32 , C 33 , …, and C 40 .

lj

j

discovering the number of Clusters

The constant factor described in the last section can be used to estimate the number of clusters in a

clustering process. The number of clusters k for a Web page dataset is estimated to be:

Distributed Artificial Intelligence, Agent Technology, and Collaborative Applications

Search WWH ::

Custom Search

Home