Information Technology Reference
In-Depth Information
Figure 5. Moving a sub-node x into its connected node with the greatest gain
v'
u
v
x
Table 1. Compositions of four representative Web page datasets
DS1 : true classes = 2, the number of Web pages= 766, dimension = 1327
true class (the number of Web pages):
agriculture(73) astronomy(693)
DS 2: true classes = 4, the number of Web pages=664, dimension =1362
astronomy(169) biology(234) alternative(119) mathematics(142)
DS3: true classes = 12, the number of Web pages = 1215, dimensi on= 1543
agriculture(108) astronomy(92) evolution(74) genetics(108) health(127)
music(103) taxes(80) religion(113) sociology(110) jewelry(108) network (101)
sports(91)
DS4: true classes = 24, the number of Web pages = 2524,dimensi on= 2699
agriculture(87) astronomy(96) anatomy(85) evolution(76) plants(124)
genetics(106) mathematics(106) health(128) hardware(127) forestry(68)
radio(115) music(104) automotive(109) taxes(82) government(147)
religion(114) education(124) art(101) sociology(108) archaeology(105)
jewelry(106) banking(72) network (88) sports(146)
can be used for Web browsing, with larger and more general clusters at higher levels while smaller and
more specific clusters are at lower levels.
Web Page datasets for experiments
For testing our bi-directional hierarchical clustering algorithm and for discovering a new constant
stopping factor, we conducted a number of experiments on Web page datasets. Here we report four
Web page datasets taken from Yahoo.com (see Table 1) representing datasets with different sizes and
different granularity and we skip other datasets for brevity since their experimental results were found
to have similar quality. The first dataset, DS1 , contains 766 Web pages which are randomly selected
from two true classes: agriculture and astronomy . This dataset is designed to show our method of es-
timating the number of clusters k in a dataset which consists of clusters of widely different sizes: The
 
Search WWH ::




Custom Search