Practical Approaches to the Many-Answer Problem - Advanced Database Query Systems

Database Reference

In-Depth Information

Figure 15. Fuzzy linguistic partition defined on the attribute Price

3.1.1 A Two-Step Process

SAINTETIQ takes tabular data as input and produces multi-resolution summaries of records through

an online mapping process and a summarization process. 3.1.1.1 Mapping Service SAINTETIQ system

relies on Zadeh's fuzzy set theory (Zadeh, 1956), and more specifically on linguistic variables (Zadeh,

1975) and fuzzy partitions (Ruspini, 1969), to represent data in a concise form. The fuzzy set theory is

used to translate records in accordance with a Knowledge Base ( KB ) provided by a domain expert or

even an end-user. Basically, the operation replaces the original values of each record in the table by a

set of linguistic labels defined in the KB . For instance, with a linguistic variable on the attribute Price

(Figure 15), a value t .Price = 95000€ is mapped to {0.3/ cheap , 0.7/ reasonable } where 0.7 is a member-

ship grade that tells how well the label reasonable describes the value 95000. Extending this mapping

to all the attributes of a relation could be seen as mapping the records to a grid-based multidimensional

space. The grid is provided by the KB and corresponds to the user's perception of the domain.

Thus, tuples of Table 7 are mapped into two distinct grid-cells denoted by c 1 and c 2 in Table 8. old is

a fuzzy label a priori provided by the KB on attribute Age and it perfectly matches (with degree 1) range

[19, 24] of raw values. Besides, 0.3/ cheap says that cheap fits the data only with a small degree (0.3).

The degree is computed as the maximum of membership grades of tuple values to cheap in c 1 .

Flexibility in the vocabulary definition of KB permits to express any single value with more than one

fuzzy descriptor and avoid threshold effect due to a smooth transition between two descriptors. Besides,

KB leads to the point where tuples become indistinguishable and then are grouped into grid-cells such

that there are finally many more records than cells. Every new (coarser) tuple stores a record count and

attribute-dependant measures (min, max, mean, standard deviation, etc.). It is then called a summary.

3.1.1.2 Summarization Service

The summarization service ( SEQ ) is the second and the most sophisticated step of the SAINTETIQ

system. It takes grid-cells as input and outputs a collection of summaries hierarchically arranged from

the most generalized one (the root) to the most specialized ones (the leaves). Summaries are clusters of

grid-cells, defining hyperrectangles in the multidimensional space. In the basic process, leaves are grid-

cells themselves and the clustering task is performed on L cells rather than n tuples ( L << n ).

From the mapping step, cells are introduced continuously in the hierarchy with a top-down approach

inspired of D.H. Fisher's CobWeb, a conceptual clustering algorithm (Fisher, 1987). Then, they are in-

corporated into best fitting nodes descending the tree. Three more operators could be apply, depending

on partition's score, that are create , merge and split nodes. They allow developing the tree and updating

its current state. Figure 16 represents the summary hierarchy built from the cells c1 and c2 of Table 8.

Search WWH ::

Custom Search

Home