Database Reference
In-Depth Information
keywords) match the upcoming query. A hierarchical clustering technique is typically used in these
approaches, and different strategies for matching the query against the document hierarchy have been
proposed, most notably a top-down or a bottom-up search and their variants (Jardine & Van Rijsbergen,
1971; Van Rijsbergen & Croft, 1975; Croft, 1980; Voorhees, 1985). Similarly, search engines (e.g., Ya-
hoo) and product catalog search (e.g., eBay) use a category structure created in advance and then group
search results into separate categories. In all these approaches, if a query does not match any cluster
representation of one of the pre-defined clusters or categories, then it fails to match any documents
even if the document collection contains relevant results. It is worth noticing that this problem is not
intrinsic to clustering, but is due to the fact that keyword representation of clusters is often insufficient
to apprehend the meaning of documents in a cluster.
An alternative way of using the cluster hypothesis is in the presentation of retrieval results, that is
by presenting, in a clustered form, only documents that have been retrieved in response to the query.
This idea was first introduced in the Scatter/Gather system (Hearst & Pedersen, 1996) which is based
on a variant of the classical k-means algorithm (Hartigan & Wong, 1979). Since then, several classes of
algorithms have been proposed such as STC (Zamir & Etzioni, 1999), SHOC (Zeng, He, Chen, Ma &
Ma, 2004), EigenCluster (Cheng, Kannan, Vempala & Wang, 2006), SnakeT (Ferragina & Gulli, 2005).
Note that such algorithms introduce a noticeable time overhead to the query processing, due to the large
number of results returned by the search engine. The reader is referred to (Manning, Raghavan & Schtze,
2008) for more details on IR clustering techniques.
All the above approaches testify that there is a significant potential benefit in providing additional
structure in large answer sets.
2.2.2.2 DB Clustering Techniques
The SQL group-by operator allows grouping query results. It classifies the results of a query into groups
based on a user-selected subset of fields. However, it partitions the space only by identical values. For
instance, if there are 1000 different zip-codes, 'group-by Zip' returns 1000 groups. In the last few years,
traditional clustering techniques have been employed by the database research community to overcome
this shortcoming (Chakrabarti, Chaudhuri & Hwang, 2004; Bamba, Roy & Mohania, 2005; Li, Wang,
Lim, Wang & Chang, 2007).
Chakrabarti et al's System
In (Chakrabarti, Chaudhuri & Hwang, 2004), the authors proposed an automatic method for categorizing
query results. This method dynamically creates a navigation tree (i.e., a hierarchical category structure)
for each query q based on the contents of the tuples in the answer set T q . A hierarchical categorization
of T q is a recursive partitioning of the tuples in T q based on the data attributes and their values, i.e., the
query results are grouped into nested categories. Figure 5 shows an example of a hierarchical categoriza-
tion of the results of the query 'City = Paris AND Price [150, 300] K€'. At each level, the partitioning
is done based on a single attribute in the result set T q , and this attribute is the same for all nodes at that
level. Furthermore, once an attribute is used as a categorizing attribute at any level, it is not repeated
at a later level. For example, Price is the categorizing attribute of all nodes at Level 2. The partitions
are assigned descriptive labels and form a categorization of the result set based on that attribute. For
example, the first child of the root in Figure 5 has label 'Location = 18 th arrondissement' while its own
first child has label 'Price [200, 225] K€'.
Search WWH ::




Custom Search