Database Reference
In-Depth Information
In a pre-processing step, we compute knowledge-based summaries of the queried data. The under-
lying summarization technique used in this paper is the SAINTETIQ model (Raschia & Mouaddib,
2002; Saint-Paul, Raschia, & Mouaddib, 2005), which is a domain knowledge-based approach that
enables summarization and classification of structured data stored into a database. SAINTETIQ
first transforms raw data into high-level representations (summaries) that fit the user's perception
of the domain, by means of linguistic labels (e.g., cheap , reasonable , expensive , very expensive )
defined over the data attribute domains and provided by a domain expert or even an end-user. Then
it applies a hierarchical clustering algorithm on these summaries to provide multi-resolution sum-
maries (i.e., summary hierarchy) that represent the database content at different abstraction levels.
The summary hierarchy can be seen as an analogy for knowledge representation estate agent.
At query time, we use the summary hierarchy of the data, instead of the data itself, to quickly pro-
vide the user with concise, useful and structured answers as a starting point for an online analysis.
This goal is achieved thanks to the Explore-Select algorithm ( ESA ) that extracts query-relevant
entries from the summary hierarchy. Each answer item describes a subset of the result set in a
human-readable form using linguistic labels. Moreover, answers of a given query are nodes of the
summary hierarchy and every subtree rooted by an answer offers a 'guided tour' of a data subset
to the user. The user then navigates this tree, in a top-down fashion, exploring the summaries of
interest while ignoring the rest. Note that the database is accessed only when the user requests to
download ( Upload ) the original data that a potentially relevant summary describes. Hence, this
framework is intended to help the user iteratively refine her/his information need in the same way
as done by the estate agent.
However, since such the summary hierarchy is independent of the query, the set of starting point
answers could be large and consequently dissimilarity between items is susceptible to skew. It occurs
when the summary hierarchy is not perfectly adapted to the user query. To tackle this problem, we first
propose a straightforward approach ( ESA-SEQ ) using the clustering algorithm of SAINTETIQ to optimize
the high-level answers. The optimization requires post-processing and therefore, it incurs overhead time
cost. Thus, we finally develop an efficient and effective algorithm ( ESRA , i.e., ES-Rearrange Algorithm)
that rearranges answers based on the hierarchical structure of the pre-computed summary hierarchy,
such that no post-processing task (but the query evaluation itself) have to be performed at query time.
The rest of this section is organized as follows. First, we present the SAINTETIQ model and its
properties and we illustrate the process with a toy example. Then, in Section 3.2 we detail the use of
SAINTETIQ outputs in a query processing and we describe the formulation of queries and the retrieval
of clusters. Thereafter, we discuss in Section 3.3 how such results help facing the many-answers prob-
lem. The algorithm that addresses the problem of dissimilarity (discrimination) between the starting
point answers by rearranging them is presented in Section 3.4. Section 3.5 discusses an extension of
the above process that allows every user to use her/his own vocabulary when querying the database. An
experimental study using real data is presented in Section 3.6.
3.1 Overview of the SAINTETIQ System
In this subsection, we first introduce the main ideas of SAINTETIQ (Raschia & Mouaddib, 2002; Saint-
Paul, Raschia, & Mouaddib, 2005). Then, we briefly discuss some other data clustering techniques, and
argue that SAINTETIQ is more suitable for interactive and exploratory data retrieval.
Search WWH ::




Custom Search