Practical Approaches to the Many-Answer Problem - Advanced Database Query Systems

Database Reference

In-Depth Information

keywords) match the upcoming query. A hierarchical clustering technique is typically used in these

approaches, and different strategies for matching the query against the document hierarchy have been

proposed, most notably a top-down or a bottom-up search and their variants (Jardine & Van Rijsbergen,

1971; Van Rijsbergen & Croft, 1975; Croft, 1980; Voorhees, 1985). Similarly, search engines (e.g., Ya-

hoo) and product catalog search (e.g., eBay) use a category structure created in advance and then group

search results into separate categories. In all these approaches, if a query does not match any cluster

representation of one of the pre-defined clusters or categories, then it fails to match any documents

even if the document collection contains relevant results. It is worth noticing that this problem is not

intrinsic to clustering, but is due to the fact that keyword representation of clusters is often insufficient

to apprehend the meaning of documents in a cluster.

An alternative way of using the cluster hypothesis is in the presentation of retrieval results, that is

by presenting, in a clustered form, only documents that have been retrieved in response to the query.

This idea was first introduced in the Scatter/Gather system (Hearst & Pedersen, 1996) which is based

on a variant of the classical k-means algorithm (Hartigan & Wong, 1979). Since then, several classes of

algorithms have been proposed such as STC (Zamir & Etzioni, 1999), SHOC (Zeng, He, Chen, Ma &

Ma, 2004), EigenCluster (Cheng, Kannan, Vempala & Wang, 2006), SnakeT (Ferragina & Gulli, 2005).

Note that such algorithms introduce a noticeable time overhead to the query processing, due to the large

number of results returned by the search engine. The reader is referred to (Manning, Raghavan & Schtze,

2008) for more details on IR clustering techniques.

All the above approaches testify that there is a significant potential benefit in providing additional

structure in large answer sets.

2.2.2.2 DB Clustering Techniques

The SQL group-by operator allows grouping query results. It classifies the results of a query into groups

based on a user-selected subset of fields. However, it partitions the space only by identical values. For

instance, if there are 1000 different zip-codes, 'group-by Zip' returns 1000 groups. In the last few years,

traditional clustering techniques have been employed by the database research community to overcome

this shortcoming (Chakrabarti, Chaudhuri & Hwang, 2004; Bamba, Roy & Mohania, 2005; Li, Wang,

Lim, Wang & Chang, 2007).

Chakrabarti et al's System

In (Chakrabarti, Chaudhuri & Hwang, 2004), the authors proposed an automatic method for categorizing

query results. This method dynamically creates a navigation tree (i.e., a hierarchical category structure)

for each query q based on the contents of the tuples in the answer set T q . A hierarchical categorization

of T q is a recursive partitioning of the tuples in T q based on the data attributes and their values, i.e., the

query results are grouped into nested categories. Figure 5 shows an example of a hierarchical categoriza-

tion of the results of the query 'City = Paris AND Price [150, 300] K€'. At each level, the partitioning

is done based on a single attribute in the result set T q , and this attribute is the same for all nodes at that

level. Furthermore, once an attribute is used as a categorizing attribute at any level, it is not repeated

at a later level. For example, Price is the categorizing attribute of all nodes at Level 2. The partitions

are assigned descriptive labels and form a categorization of the result set based on that attribute. For

example, the first child of the root in Figure 5 has label 'Location = 18 th arrondissement' while its own

first child has label 'Price [200, 225] K€'.

Search WWH ::

Custom Search

Home