Database Reference
In-Depth Information
where L is the number of cells of the output hierarchy and d its average width. In the above formula, the
coefficient k SEQ corresponds to the set of operations performed to find the best learning operator ( create ,
merge or split ) to apply at each level of the hierarchy, whereas L log L is an estimation of the average
depth of this hierarchy.
The SAINTETIQ model, besides being a grid-based clustering method, has many other advantages
that are relevant for achieving the targeted objective of this chapter. First, SAINTETIQ uses prior
domain knowledge (the Knowledge Base) to guide the clustering process, and to provide clusters that
fit the user's perception of the domain. This distinctive feature differentiates it from other grid-based
clustering techniques which attempt to only maximize some statistical property of the clusters, and
therefore there is no guarantee that the resulting clusters will match the meaningful groups that a user
may expect. Second, the flexibility in the vocabulary definition of KB leads to clustering schemas that
have two useful properties: (1) the clusters have 'soft' boundaries, in the sense that each record belongs
to each cluster to some degree, and thus undesirable threshold effects that are usually produced by crisp
(non-fuzzy) boundaries are avoided; (2) the clusters are presented in a user-friendly language (i.e.,
linguistic labels) and hence the user can determine at a glance whether a cluster's content is of interest.
Finally, SAINTETIQ applies a conceptual clustering algorithm for partitioning the incoming data in an
incremental and dynamic way. Thus, changes in the database are reflected through such an incremental
maintenance of the complete hierarchy (Saint-Paul, Raschia, & Mouaddib, 2005).
Of course, for new application, the end-user or the expert has to be consulted to create linguistic
labels as well as the fuzzy membership functions. However, it is worth noticing that, once such knowl-
edge base is defined, the system does not require any more setting. Furthermore, the issue of estimating
fuzzy membership functions has been intensively studied in the fuzzy set literature (Galindo, 2008), and
various methods based on data distribution and statistics exist to assist the user designing trapezoidal
fuzzy membership functions.
3.2 Querying the SAINTETIQ Summaries
In an exploratory analysis of a massive data set, users usually have only a vague idea of what they could
find in the data. They are then unable to formulate precise criteria to locate the desired information. The
querying mechanism presented here allows such users to access a database (previously summarized)
using vague requirements (e.g., cheap ) instead of crisp ones (e.g., [100k€, 200k€]). In fact, users only
need to select the right criteria from an existing set of linguistic labels defined on each attribute domain
in the KB , to filter a set of clusters (summaries) that can then be browsed to find potentially interesting
pieces of information. However, choosing the linguistic labels from a controlled vocabulary compels
the user to adopt a predefined categorization materialized by the grid-cells. In section 3.5, we deal with
user-specific linguistic labels to overcome this pitfall. In this subsection, we first introduce a toy example
that will be used throughout that chapter. Then, we present all aspects of the querying mechanism from
the expression and meaning of a query to its matching against summaries.
3.2.1 Running Example
To illustrate the querying mechanism, we introduce here a sample data set R with 30 records ( t 1 -t 30 )
represented on three attributes: Price, Size and Location. We suppose that { cheap (ch.), reasonable (re.),
expensive (ex.), very expensive (vex.)}, { small (sm.), medium (me.), large (la.)} and { downtown (dw.),
Search WWH ::




Custom Search