Database Reference
In-Depth Information
a large number of similar results before finding the next different result. In contrast, clustering-based
techniques(Section 2.2.2) assist the user to clarify or refine the retrieval goal instead of trying to learn
it. They consist in dividing the query result set into homogeneous groups, allowing the user to select
and explore groups that are of interest to her/him. However, such techniques seek to only maximize
some statistical property of the resulting clusters (such as the size and compactness of each cluster and
the separation of clusters relative to each other), and therefore there is no guarantee that the resulting
clusters will match the meaningful groups that a user may expect. Furthermore, these approaches are
performed on query results and consequently occur at query time. Thus, the overhead time cost is an
open critical issue for such a posteriori tasks.
In the second part of this chapter, we focus on the Many-Answers problem that is critical for very
large database and decision support systems. Thus, we investigate a simple but useful strategy to handle
this problem.
3. KNOWLEDGE-BASED CLUSTERING OF RESULT SET
Database systems are being increasingly used for interactive and exploratory data retrieval. In such
retrieval, user's queries often result in too many answers. Not all the retrieved items are relevant to the
user; typically, only a tiny fraction of the result set is relevant to her/him. Unfortunately, she/he often
needs to examine all or most of the retrieved items to find the interesting ones. As discussed in Section
2, this phenomenon (commonly referred to as 'information overload') often happens when the user
submits a 'broad' query, i.e., she/he has an ill-defined retrieval goal.
For example, consider a realtor database HouseDB with information on houses for sale in Paris,
including their Price, Size, #Bedrooms, Age, Location, etc. A user who approaches that database with a
broad query such as 'Price [150k€, 300k€]' may be overloaded with a huge list of results, since there are
many houses within this price range in Paris. A well-established theory in cognitive psychology (Miller,
1962; Mandler, 1967) contends that humans organize items into logical groups as a way of dealing with
large amounts of information. For instance, a child classifies his toys according to his favorite colors; a
direct marketer classifies his target according to a variety of geographic, demographic, and behavioral
attributes; and a real estate agent classifies his houses according to the location, the price, the size, etc.
Furthermore, the Cluster Hypothesis (Jardine & Van Rijsbergen, 1971) states that “closely associated
items tend to be relevant to the same request”. Therefore, clustering analysis (Berkhin, 2006) which
refers to partitioning data into dissimilar groups (or clusters) of similar items is an effective technique
to overcome the problem of information overload. However, applying traditional clustering methods
directly to the results of a user's query presents two major problems:
1. the first is related to relevance . Most clustering algorithms seek to only maximize some statistical
property of the clusters (e.g., the size and compactness of each cluster and the separation of clusters
relative to each other), and therefore there is no guarantee that the resulting clusters will match the
meaningful groups that a user may expect;
2. the second is related to scalability . Clustering analysis is a time-consuming process, and doing it
on the fly (i.e., at query time) may compromise seriously the response time of the system.
Search WWH ::




Custom Search