Practical Approaches to the Many-Answer Problem - Advanced Database Query Systems

Database Reference

In-Depth Information

a large number of similar results before finding the next different result. In contrast, clustering-based

techniques(Section 2.2.2) assist the user to clarify or refine the retrieval goal instead of trying to learn

it. They consist in dividing the query result set into homogeneous groups, allowing the user to select

and explore groups that are of interest to her/him. However, such techniques seek to only maximize

some statistical property of the resulting clusters (such as the size and compactness of each cluster and

the separation of clusters relative to each other), and therefore there is no guarantee that the resulting

clusters will match the meaningful groups that a user may expect. Furthermore, these approaches are

performed on query results and consequently occur at query time. Thus, the overhead time cost is an

open critical issue for such a posteriori tasks.

In the second part of this chapter, we focus on the Many-Answers problem that is critical for very

large database and decision support systems. Thus, we investigate a simple but useful strategy to handle

this problem.

3. KNOWLEDGE-BASED CLUSTERING OF RESULT SET

Database systems are being increasingly used for interactive and exploratory data retrieval. In such

retrieval, user's queries often result in too many answers. Not all the retrieved items are relevant to the

user; typically, only a tiny fraction of the result set is relevant to her/him. Unfortunately, she/he often

needs to examine all or most of the retrieved items to find the interesting ones. As discussed in Section

2, this phenomenon (commonly referred to as 'information overload') often happens when the user

submits a 'broad' query, i.e., she/he has an ill-defined retrieval goal.

For example, consider a realtor database HouseDB with information on houses for sale in Paris,

including their Price, Size, #Bedrooms, Age, Location, etc. A user who approaches that database with a

broad query such as 'Price [150k€, 300k€]' may be overloaded with a huge list of results, since there are

many houses within this price range in Paris. A well-established theory in cognitive psychology (Miller,

1962; Mandler, 1967) contends that humans organize items into logical groups as a way of dealing with

large amounts of information. For instance, a child classifies his toys according to his favorite colors; a

direct marketer classifies his target according to a variety of geographic, demographic, and behavioral

attributes; and a real estate agent classifies his houses according to the location, the price, the size, etc.

Furthermore, the Cluster Hypothesis (Jardine & Van Rijsbergen, 1971) states that “closely associated

items tend to be relevant to the same request”. Therefore, clustering analysis (Berkhin, 2006) which

refers to partitioning data into dissimilar groups (or clusters) of similar items is an effective technique

to overcome the problem of information overload. However, applying traditional clustering methods

directly to the results of a user's query presents two major problems:

1. the first is related to relevance . Most clustering algorithms seek to only maximize some statistical

property of the clusters (e.g., the size and compactness of each cluster and the separation of clusters

relative to each other), and therefore there is no guarantee that the resulting clusters will match the

meaningful groups that a user may expect;

2. the second is related to scalability . Clustering analysis is a time-consuming process, and doing it

on the fly (i.e., at query time) may compromise seriously the response time of the system.

Search WWH ::

Custom Search

Home