Latent Semantic Space for Web Clustering - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

Fig. 6. LSS System oversearch engine Google

to evaluate the clustering performance [3] based on the human expert's de-

cisions. More than one hundred queries related to medicine have been sub-

mitted from our system to clustering the returned results from PubMed and

GOOGLE spectively. More than two hundred thousand Web pages or snip-

pets have been returned. In general, the average entropy is around 0 . 14

±

0 . 06

for PubMed and 0 . 27

0 . 08 or so for GOOGLE. Because PubMed has defined

meta-date for each medical literature by human experts. If without using these

meta-data, the average entropy will become 0 . 21

±

0 . 09. According to it, we

can conclude courageously that the CONCEPTs organized by LSS can nearly

make a precisely semantic concept clustering for Web pages.

±

6 Conclusion

Polysemy , phrases and term dependency are the limitations of search tech-

nology [12]. A single term is not able to identify a latent concept in a

document, for instance, the term “Network” associated with the term “Com-

puter”, “Tra c”, or “Neural” denotes different concepts. To discriminate term

associations no doubt is concrete way to distinguish one category from the

others. A group of solid term associations can clearly identify a concept. The

term-associations (frequently co-occurring terms) of a given collection of Web

pages, form a simplicial complex. The complex can be decomposed into con-

nected components at various levels (in various level of skeletons). We believe

each such a connected component properly identify a concept in a collection

of Web pages.

Search WWH ::

Custom Search

Home