Practical Approaches to the Many-Answer Problem - Advanced Database Query Systems

Database Reference

In-Depth Information

where L is the number of cells of the output hierarchy and d its average width. In the above formula, the

coefficient k SEQ corresponds to the set of operations performed to find the best learning operator ( create ,

merge or split ) to apply at each level of the hierarchy, whereas L log L is an estimation of the average

depth of this hierarchy.

The SAINTETIQ model, besides being a grid-based clustering method, has many other advantages

that are relevant for achieving the targeted objective of this chapter. First, SAINTETIQ uses prior

domain knowledge (the Knowledge Base) to guide the clustering process, and to provide clusters that

fit the user's perception of the domain. This distinctive feature differentiates it from other grid-based

clustering techniques which attempt to only maximize some statistical property of the clusters, and

therefore there is no guarantee that the resulting clusters will match the meaningful groups that a user

may expect. Second, the flexibility in the vocabulary definition of KB leads to clustering schemas that

have two useful properties: (1) the clusters have 'soft' boundaries, in the sense that each record belongs

to each cluster to some degree, and thus undesirable threshold effects that are usually produced by crisp

(non-fuzzy) boundaries are avoided; (2) the clusters are presented in a user-friendly language (i.e.,

linguistic labels) and hence the user can determine at a glance whether a cluster's content is of interest.

Finally, SAINTETIQ applies a conceptual clustering algorithm for partitioning the incoming data in an

incremental and dynamic way. Thus, changes in the database are reflected through such an incremental

maintenance of the complete hierarchy (Saint-Paul, Raschia, & Mouaddib, 2005).

Of course, for new application, the end-user or the expert has to be consulted to create linguistic

labels as well as the fuzzy membership functions. However, it is worth noticing that, once such knowl-

edge base is defined, the system does not require any more setting. Furthermore, the issue of estimating

fuzzy membership functions has been intensively studied in the fuzzy set literature (Galindo, 2008), and

various methods based on data distribution and statistics exist to assist the user designing trapezoidal

fuzzy membership functions.

3.2 Querying the SAINTETIQ Summaries

In an exploratory analysis of a massive data set, users usually have only a vague idea of what they could

find in the data. They are then unable to formulate precise criteria to locate the desired information. The

querying mechanism presented here allows such users to access a database (previously summarized)

using vague requirements (e.g., cheap ) instead of crisp ones (e.g., [100k€, 200k€]). In fact, users only

need to select the right criteria from an existing set of linguistic labels defined on each attribute domain

in the KB , to filter a set of clusters (summaries) that can then be browsed to find potentially interesting

pieces of information. However, choosing the linguistic labels from a controlled vocabulary compels

the user to adopt a predefined categorization materialized by the grid-cells. In section 3.5, we deal with

user-specific linguistic labels to overcome this pitfall. In this subsection, we first introduce a toy example

that will be used throughout that chapter. Then, we present all aspects of the querying mechanism from

the expression and meaning of a query to its matching against summaries.

3.2.1 Running Example

To illustrate the querying mechanism, we introduce here a sample data set R with 30 records ( t 1 -t 30 )

represented on three attributes: Price, Size and Location. We suppose that { cheap (ch.), reasonable (re.),

expensive (ex.), very expensive (vex.)}, { small (sm.), medium (me.), large (la.)} and { downtown (dw.),

Search WWH ::

Custom Search

Home