Practical Approaches to the Many-Answer Problem - Advanced Database Query Systems

Database Reference

In-Depth Information

they are attempting to discover potentially useful items. In such a situation, user queries are often very

broad, resulting in too many answers. Not all the retrieved items are relevant to the user. Unfortunately,

she/he often needs to examine all or most of them to find the interesting ones. This too-many-answers

phenomenon is commonly referred to as information overload - “a state in which the amount of informa-

tion that merits attention exceeds an individual's ability to process it” (Schultz & Vandenbosch, 1998).

Information overload often happens when the user is not certain of what she/he is looking for, i.e.,

she/he has a vague and poorly defined information need or retrieval goal. Thus, she/he generally poses a

broad query in the beginning to avoid exclusion of potentially interesting results and next, she/he starts

browsing the answer looking for something interesting. Information overload makes it hard for the user to

separate the interesting items from the uninteresting ones, thereby leading to potential decision paralysis

and wastage of time and effort. The dangers of information overload are not to be underestimated and

are well illustrated by buzzwords such as Infoglut (Allen, 1992), Information Fatigue Syndrome (Lewis,

1996), TechnoStress (Weil & Rosen, 1997), Data Smog (Shenk, 1997), Data Asphyxiation (Winkle, 1998)

and Information Pollution (Nielsen, 2003).

In the context of relational databases, automated ranking and clustering of query results are used to

reduce information overload. Automated ranking-based techniques first seek to clarify or approximate

the user's retrieval goal. Then, they assign a score to each answer, representing the extent to which it is

relevant to the approximated retrieval goal. Finally, the user is provided with a ranked list, in descend-

ing order of relevance, of either all query results or only a top-k subset. In contrast, clustering-based

techniques assist the user to clarify or refine the retrieval goal instead of trying to learn it. They consist

in dividing the query result set into dissimilar groups (or clusters) of similar items, allowing users to

select and explore groups that are of interest to them while ignoring the rest. However, both of these

techniques present two major problems:

•

the first is related to relevance . With regard to automated ranking-based techniques, the relevance

of the results highly depends on their ability to accurately capture the user's retrieval goal, which

is not an obvious task. Furthermore, such techniques also bring the disadvantage of match ho-

mogeneity, i.e., the user is often required to go through a large number of similar results before

finding the next different result. With regard to clustering-based techniques, there is no guarantee

that the resulting clusters will match the meaningful groups that a user may expect. In fact, most

clustering techniques seek to only maximize some statistical properties of the clusters (such as the

size and compactness of each cluster and the separation of clusters relative to each other) ;

•

the second is related to scalability . Both ranking and clustering are performed on query results and

consequently occur at query time. Thus, the overhead time cost is an open critical issue for such

a posteriori tasks.

To go one step beyond an overview of these well-established techniques, we investigate a simple but

useful strategy to alleviate the two above problems. Specifically, we present an efficient and effective

algorithm coined Explore-Select-Rearrange Algorithm ( ESRA ) that provides users with hierarchical

clustering schemas of their query results. ESRA operates on pre-computed knowledge-based summaries

of the data, instead of raw data themselves. The underlying summarization technique used in this work

is the SAINTETIQ model (Raschia & Mouaddib, 2002; Saint-Paul, Raschia, & Mouaddib, 2005), which

is a domain knowledge-based approach that enables summarization and classification of structured data

stored into a database. Each node (or summary) of the hierarchy provided by ESRA describes a subset of

Advanced Database Query Systems

Search WWH ::

Custom Search

Home