Web Mining by Automatically Organizing Web Pages into Categories - Distributed Artificial Intelligence, Agent Technology, and Collaborative Applications

Information Technology Reference

In-Depth Information

Table 2. continued

C 37

3

Music

0.333

bell, slide, serial

C 38

1

Sociology

1.000

relief, portrait, davi

C 39

2

Music

0.500

ontario, predict, archaeolog

C 40

4

Music

0.250

unix, php, headlin

overall

2524

0.740

0.698

(F1 scores are given only for 24 clusters because those clusters represent true classes in dataset DS4. The purity (Strehl et al.,

2000) and the top three descriptive terms are given for each cluster.)

where 1≤ k ≤ n .

(16)

argmax(

avgInter k ≤

( ) 1.7)

k

The avgInter ( k ) is computed for different k 's. The k that results in avgInter ( k ) as close to (but less

than) the threshold 1.7 is selected to be the final k for a Web page dataset.

For our bi-directional hierarchical clustering system, we determine the number of clusters by using

the constant as the stopping factor in the clustering process. Our hierarchical clustering process starts

by arranging individual Web pages into clusters and then arranging the clusters into larger clusters

and so on until the average inter-cluster similarity avgInter ( k ) approaches the constant. As clusters are

grouped to form larger clusters the value of avgInter ( k ) is reduced. This grouping process (bottom-up

cluster-merging phase) is stopped when avgInter ( k ) approaches 1.7. The final number of clusters is au-

tomatically obtained as the result.

conclusion

and future research

Since the Web contains vast amount of information, Web mining has been proved to be am important

area of research. In this chapter, we focused on automatically organizing Web pages into categories by

clustering. Although many methods of finding the number of clusters for a dataset have been proposed,

none of them is satisfactory for clustering Web page datasets. Finding the number of clusters for a da-

taset is often treated as an ill-defined question because it is still questionable how well a cluster should

be defined. By recognizing this status, we preferred hierarchical clustering methods, which allow us

to view clusters at different levels with coarser granularity at the higher level and finer granularity at

the lower level. For Web mining in particular, our bi-directional hierarchical clustering method is able

to arrange Web pages into a hierarchy of categories that allows users to browse the results in different

levels of granularities.

Besides proposing the new bi-directional hierarchical clustering algorithm, we investigated the

problem of estimating the number of clusters, k , for Web page datasets. We discovered that the aver-

age inter-cluster similarity ( avgInter ) can be used as a criterion to estimate k for Web page datasets.

Our experiments showed that when the avgInter for a Web page dataset reaches a threshold of 1.7, the

clustering solutions achieve the best results. Compared to other criteria, avgInter implies a character-

Distributed Artificial Intelligence, Agent Technology, and Collaborative Applications

Search WWH ::

Custom Search

Home