Information Technology Reference
In-Depth Information
Table 2. continued
C 37
3
Music
0.333
bell, slide, serial
C 38
1
Sociology
1.000
relief, portrait, davi
C 39
2
Music
0.500
ontario, predict, archaeolog
C 40
4
Music
0.250
unix, php, headlin
overall
2524
0.740
0.698
(F1 scores are given only for 24 clusters because those clusters represent true classes in dataset DS4. The purity (Strehl et al.,
2000) and the top three descriptive terms are given for each cluster.)
where 1≤ k n .
(16)
argmax(
avgInter k
( ) 1.7)
k
The avgInter ( k ) is computed for different k 's. The k that results in avgInter ( k ) as close to (but less
than) the threshold 1.7 is selected to be the final k for a Web page dataset.
For our bi-directional hierarchical clustering system, we determine the number of clusters by using
the constant as the stopping factor in the clustering process. Our hierarchical clustering process starts
by arranging individual Web pages into clusters and then arranging the clusters into larger clusters
and so on until the average inter-cluster similarity avgInter ( k ) approaches the constant. As clusters are
grouped to form larger clusters the value of avgInter ( k ) is reduced. This grouping process (bottom-up
cluster-merging phase) is stopped when avgInter ( k ) approaches 1.7. The final number of clusters is au-
tomatically obtained as the result.
conclusion
and future research
Since the Web contains vast amount of information, Web mining has been proved to be am important
area of research. In this chapter, we focused on automatically organizing Web pages into categories by
clustering. Although many methods of finding the number of clusters for a dataset have been proposed,
none of them is satisfactory for clustering Web page datasets. Finding the number of clusters for a da-
taset is often treated as an ill-defined question because it is still questionable how well a cluster should
be defined. By recognizing this status, we preferred hierarchical clustering methods, which allow us
to view clusters at different levels with coarser granularity at the higher level and finer granularity at
the lower level. For Web mining in particular, our bi-directional hierarchical clustering method is able
to arrange Web pages into a hierarchy of categories that allows users to browse the results in different
levels of granularities.
Besides proposing the new bi-directional hierarchical clustering algorithm, we investigated the
problem of estimating the number of clusters, k , for Web page datasets. We discovered that the aver-
age inter-cluster similarity ( avgInter ) can be used as a criterion to estimate k for Web page datasets.
Our experiments showed that when the avgInter for a Web page dataset reaches a threshold of 1.7, the
clustering solutions achieve the best results. Compared to other criteria, avgInter implies a character-
 
Search WWH ::




Custom Search