Latent Semantic Space for Web Clustering - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

of PRINITIVE CONCEPT s and the results are very encouraging. 1 These

results can directly obtained from search engines. All the returned results are

automatically clustered into different topics. The authoritative web pages in

each topic are ranked based on how similar web pages belong to the topic.

The experimental results indicate that we have an effective way to organize

the large amount of return from a web query.

Internet is an information ocean. How to marshal large amount of returned

web pages, paragraphs or sentences is the key issue. Roughly speaking, we de-

compose (triangulate, partition, granulate) LSS of documents (e.g., returned

web pages or sentences) into simplicial complex in combinatorial topology [23],

which could be viewed a special form of hypergraphs. However, we should

note that the notion of simplicial complexes is actually predated that of hy-

pergraphs about half a century, even though the latter notion is more familiar

to modern computer scientists.

Let us recall some examples to illustrate the main intuition. The associa-

tion that consists of “wall” and “street” denotes some financial notions that

have meaning beyond the two nodes, “wall” and “street”. This is similar to

the notion of open segment ( v 0 , v 1 )) that represents one dimensional geo-

metric object, 1-simplex, that carries information beyond the two end points.

In general, an r -association represents some semantic generated by a set of

r keywords, may have more semantics or even have nothing to do with the

individual keywords. A mathematical structure that reflects such phenomena

is the notion of simplicial complex in combinatorial topology; see Sect. 3.

The thesis of this paper is that the simplicial complex of term-associations

reflects the structure of the concepts in LSS of the documents. Based on such

conceptual structure, the documents (returned pages, paragraph, or sentences)

can be effectively clustered.

2 Key Terms and TDITF

The notion of association rules was introduced by Agrawal et al. [1] and has

been demonstrated to be useful in several domains [4, 5], such as retail sales

transaction database. In the theory two standard measures, called support and

confidence , are often used. For documents the orders of keywords or directions

of rules are not essential. Our focus will be on the support; a set of items that

meets the support is often referred to as frequent itemsets; we will call them

associations (undirected association rules) as to indicate the emphasis on their

meaning more than the phenomena of frequency.

The frequency distribution of a word or phrase in a document collection is

quite different from the item frequency distribution in a retail sales transaction

database. In [14], we have shown that isomorphic relations have isomorphic

1 The search engine's web site is at “http://ginni.bme.ntu.edu.tw/”, which is a

pentium IV personal computer.

Search WWH ::

Custom Search

Home