Databases Reference
In-Depth Information
4.2 Algorithm
As we already known, a r -simplex is a ( r +1)-term-association (frequent ( r +1)-
itemset). Web pages can be clustered based on maximal simplexes of any
dimension (CONCEPTS), i.e., associations. Note that Web pages clustered by
CONCEPTS contains common lower dimensional faces (shorter associations,
in particular 0-simplexes); this is consequence of a priori property. In this
sense, the methodology provides a soft approach; we allow lower dimensional
overlapped CONCEPTS exist within different clusters.
Since the intersection of connected components has lower dimensions. It is
convenient for us to design an e cient algorithm for documents clustering in a
skeleton by skeleton fashion. The algorithm for finding all maximal connected
components based on a user query in a skeleton is listed as follows.
Require: V
=
{
t 1 ,t 2 ,...,t n }
be the vertex set of all reserved terms in a
collection of documents.
Ensure: H
is the hierarchy of connected components.
Let t 0 be the user query.
Let θ be a given minimal support.
Let S 0 =
betherootofthehierarchy.
Let Support(S 0 )bethe support of associations of the terms in S 0 .
H⇐
{
t 0 }
S 0
i
0
while S i
do
while for all vertex t j
=
V and /
S i do
S i t j if Support(S (i+1) ) is bigger than θ .
Add S ( i +1) to be the child of S i
end while
i
Let S ( i +1)
( i +1)
end while
Use our notation S i is a skeleton of S 0 . It is clear, one can get S m for any
n and m . A simplex will be constructed by including all those co-occurring
terms whose support is bigger than or equal to a given minimal support θ .
An external vertex will be added into a simplex if the produced support is no
less than θ .
According to our algorithm, the simplex will be constructed through one
term, that is, a user query. All the noun phrases in a Web pages returned from
remote search engines will be selected for document clustering. Web pages can
be decomposed into several categories based on the PRIMITIVE CONCEPTS.
If a Web page contains a PRIMITIVE CONCEPT, it means that Web page
highly equates to such concept, thereby, by the a priori property, all the sub-
associations in the concept is also contained in this Web page. The Web page
can be classified into the category identified with such a concept. A document
often consists of more than one PRIMITIVE CONCEPTS context, in this
case it can be classified into multicategories.
Search WWH ::




Custom Search