Databases Reference
In-Depth Information
of PRINITIVE CONCEPT s and the results are very encouraging. 1 These
results can directly obtained from search engines. All the returned results are
automatically clustered into different topics. The authoritative web pages in
each topic are ranked based on how similar web pages belong to the topic.
The experimental results indicate that we have an effective way to organize
the large amount of return from a web query.
Internet is an information ocean. How to marshal large amount of returned
web pages, paragraphs or sentences is the key issue. Roughly speaking, we de-
compose (triangulate, partition, granulate) LSS of documents (e.g., returned
web pages or sentences) into simplicial complex in combinatorial topology [23],
which could be viewed a special form of hypergraphs. However, we should
note that the notion of simplicial complexes is actually predated that of hy-
pergraphs about half a century, even though the latter notion is more familiar
to modern computer scientists.
Let us recall some examples to illustrate the main intuition. The associa-
tion that consists of “wall” and “street” denotes some financial notions that
have meaning beyond the two nodes, “wall” and “street”. This is similar to
the notion of open segment ( v 0 , v 1 )) that represents one dimensional geo-
metric object, 1-simplex, that carries information beyond the two end points.
In general, an r -association represents some semantic generated by a set of
r keywords, may have more semantics or even have nothing to do with the
individual keywords. A mathematical structure that reflects such phenomena
is the notion of simplicial complex in combinatorial topology; see Sect. 3.
The thesis of this paper is that the simplicial complex of term-associations
reflects the structure of the concepts in LSS of the documents. Based on such
conceptual structure, the documents (returned pages, paragraph, or sentences)
can be effectively clustered.
2 Key Terms and TDITF
The notion of association rules was introduced by Agrawal et al. [1] and has
been demonstrated to be useful in several domains [4, 5], such as retail sales
transaction database. In the theory two standard measures, called support and
confidence , are often used. For documents the orders of keywords or directions
of rules are not essential. Our focus will be on the support; a set of items that
meets the support is often referred to as frequent itemsets; we will call them
associations (undirected association rules) as to indicate the emphasis on their
meaning more than the phenomena of frequency.
The frequency distribution of a word or phrase in a document collection is
quite different from the item frequency distribution in a retail sales transaction
database. In [14], we have shown that isomorphic relations have isomorphic
1 The search engine's web site is at “http://ginni.bme.ntu.edu.tw/”, which is a
pentium IV personal computer.
Search WWH ::




Custom Search