Information Technology Reference
In-Depth Information
ments, we use the metric,
F
1
measure (Larsen & Aone, 1999; Zhao & Karypis, 1999), which makes
use of true class labels of Web pages, to measure the quality of clusters in a Web page dataset. The
F
1
measure indicates how well a cluster solution matches the true classes in the real world (e.g., the Yahoo
directory). In general, the greater
F
1
score, the better clustering solu
tio
n.
In our experiments we test the existing methods
CH(k)
,
KL(k)
,
sil
, φ
(k)
and φ
T
(k)
(see the second
section) to discover the number of clusters
k
for Web page datasets. These five metrics are computed for
different
k
's for a Web page dataset. However, none of them works well. Our tests results showed that for
any dataset in Table 1 their estimated
k
is more than 5 times different from the true number of classes
in the Web page datasets and the corresponding cluster solutions have a lower than 0.3
F
1
score.
After many trials, we find that
avgInter(k)
for any dataset in Table 1 reaches a common threshold
of 1.7, when the
F
1
measure of the cluster solution for a dataset is greatest. The relation between the
thresholds of
avgInter(k)
and the
F
1
scores of a cluster solution, and the relation between the thresholds
of
avgInter(k)
and
k
's for the four Web page datasets are illustrated in Figure 6.
In Figure 6 (a-1), (b-1), (c-1) and (d-1), the
F
1
scores of cluster performances for the four datasets
reach the maximal values when the threshold of
avgInter
is 1.7, and further increasing or reducing the
threshold of
avgInter
would only worsen the
F
1
scores for the datasets
DS1
,
DS2
,
DS3
and
DS4
. In other
words, once the weighted average inter-cluster similarity (
avgInter
) reaches the common threshold, 1.7,
the cluster solu
tio
n is found to be best for a Web page dataset. This shows that, unlike other metric such as
CH(k)
,
KL(k)
,
k
sil
, or φ
T
(k)
,
avgInter
implies a common characteristic in different Web page datasets.
Figure 6 (a-2), (b-2), (c-2) and (d-2) show the
k
's for four Web page datasets produced by setting dif-
ferent thresholds for
avgInter
. In Figure 6 (a- 2) it is shown that the
avgInter
method is able to find
k
=1
while many existing methods are unable to do so. As shown in the figure, when
avgInter
reaches 1.7,
the best estimated values for
k
is found to be 2 for
DS1
, 5 for
DS2
, 21 for
DS3
and 40 for
DS4
.
The estimated
k
is usually greater than the number of true classes in a Web page dataset because
outliers are found and clustered into some small clusters, and a few true classes are partitioned into
more than one cluster with finer granularity. This situation is exactly shown in Table 2, which shows
the clustering solution for the most diverse dataset,
DS4
, obtained when the threshold of
avgInter
is
1.7. The naming for a newly formed cluster is by selecting the top three descriptive terms. The rank-
ing of descriptive terms for a cluster is conducted by sorting the
'
k
tf df
values of terms in the cluster
(
tf
lj
′ is defined to be the number of Web pages containing term
t
j
in cluster
C
l
and
df
j
is the document
frequency (Yang & Pedersen, 1997) of
t
j
). It can be noted that for most true classes, a true class has a
dominant cluster in Table 2. For instance, the dominant clusters for true class
astronomy, anatomy
and
evolution
are cluster
C
1
, C
4
and
C
5
, respectively. We can see several true classes have been partitioned
more precisely into more than one cluster; e.g., true class
automotive
has been separated into cluster
C
17
which is more related to
car
and
auto
, and cluster
C
18
more related
motorcycle
and
bike
, as indicated by
their top descriptive terms. Similar situation happens to true class
agriculture
,
health
,
education
and
archaeology
, each of which has been partitioned into two clusters. As shown in Table 2, outliers, which
have low purity scores, can be found as cluster
C
32
,
C
33
, …, and
C
40
.
lj
j
discovering the number of Clusters
The constant factor described in the last section can be used to estimate the number of clusters in a
clustering process. The number of clusters
k
for a Web page dataset is estimated to be:
Search WWH ::
Custom Search