Practical Approaches to the Many-Answer Problem - Advanced Database Query Systems

Database Reference

In-Depth Information

today in the context of Relational Databases. Nonetheless, in this subsection, we discuss the efficiency

and the effectiveness of ESA , ESA+SEQ and ESRA algorithms based on a real database.

3.6.1 Data Set

Through an agreement, the CIC Banking Group provided us with an excerpt of statistical data used for

behavioral studies of customers. The dataset is a collection of 33735 customers defined on 10 attributes

(e.g., age, income, occupation, etc.). On each attribute, marketing experts defined between 3 and 8 lin-

guistic labels leading to a total of 1036800 possible label combinations (i.e., 1036800 possible cells).

Note that all the experiments were done on a 1.7GHz P4-based computer with 768MB memory.

3.6.2 Results

All experiments reported in this section were conducted on a workload composed of 150 queries with a

random number of selection predicates from all attributes (i.e., each query has between 1 and 3 required

features on 1, 2, 3 or 4 attributes).

Quantitative Analysis

The CIC dataset is summarized by the SAINTETIQ system as described in Section 3.1. The dataset,

consisting of 33735 records, yields a summary tree with 13263 nodes, 6701 leaves or cells, maximum

depth of 16, average depth of 10.177, maximum width of 14 and an average width of 2.921. The data

distribution in the summary tree reveals a 0.6% ( 6701

1036800 ) occupation rate.

From the analysis of theoretical complexities, we claim that ESA and ESRA are much faster than the

post-clustering approach ESA+SEQ . That is the main result of Figure 20 that shows the performance

evolution according to the number of cells populated by query answer records. Furthermore, we plot

the number of summary nodes visited ( #VisitedNodes ) per query (right scale) and finally, the normal-

ized ESRA time cost ( t N. ESRA ) to evaluate the performance of ESRA regardless of how the query fits the

pre-clustering summary hierarchy. t N. ESRA is computed as follows:

t

TreeNodes

VisitedNodes

t

=

ESRA

N ESRA

.

As one can observe, Figure 20 verifies experimentally that ESA+SEQ is quasi-linear ( O(L·log L) ) in

the number of cells L whereas ESA , ESRA and N.ESRA are linear ( O(L) ). Besides, the time cost incurred

by rearranging query results (i.e., t ESRA -t ESA ) is insignificant compared to the search cost (i.e., t ESA ). For

instance, for L = 1006, t ESA and t ESRA are 0.235 sec and 0.287 sec , respectively. Thus, the ESRA algorithm

is able to drastically reduce the time cost of clustering query results.

Qualitative Analysis

Due to the difficulty of conducting a large-scale xix real-life user study, we discuss the effectiveness of

the ESRA algorithm based on structural properties of results provided to the end-user. It is worth notic-

Search WWH ::

Custom Search

Home