Database Reference
In-Depth Information
today in the context of Relational Databases. Nonetheless, in this subsection, we discuss the efficiency
and the effectiveness of ESA , ESA+SEQ and ESRA algorithms based on a real database.
3.6.1 Data Set
Through an agreement, the CIC Banking Group provided us with an excerpt of statistical data used for
behavioral studies of customers. The dataset is a collection of 33735 customers defined on 10 attributes
(e.g., age, income, occupation, etc.). On each attribute, marketing experts defined between 3 and 8 lin-
guistic labels leading to a total of 1036800 possible label combinations (i.e., 1036800 possible cells).
Note that all the experiments were done on a 1.7GHz P4-based computer with 768MB memory.
3.6.2 Results
All experiments reported in this section were conducted on a workload composed of 150 queries with a
random number of selection predicates from all attributes (i.e., each query has between 1 and 3 required
features on 1, 2, 3 or 4 attributes).
Quantitative Analysis
The CIC dataset is summarized by the SAINTETIQ system as described in Section 3.1. The dataset,
consisting of 33735 records, yields a summary tree with 13263 nodes, 6701 leaves or cells, maximum
depth of 16, average depth of 10.177, maximum width of 14 and an average width of 2.921. The data
distribution in the summary tree reveals a 0.6% ( 6701
1036800 ) occupation rate.
From the analysis of theoretical complexities, we claim that ESA and ESRA are much faster than the
post-clustering approach ESA+SEQ . That is the main result of Figure 20 that shows the performance
evolution according to the number of cells populated by query answer records. Furthermore, we plot
the number of summary nodes visited ( #VisitedNodes ) per query (right scale) and finally, the normal-
ized ESRA time cost ( t N. ESRA ) to evaluate the performance of ESRA regardless of how the query fits the
pre-clustering summary hierarchy. t N. ESRA is computed as follows:
t
TreeNodes
VisitedNodes
t
=
ESRA
N ESRA
.
As one can observe, Figure 20 verifies experimentally that ESA+SEQ is quasi-linear ( O(L·log L) ) in
the number of cells L whereas ESA , ESRA and N.ESRA are linear ( O(L) ). Besides, the time cost incurred
by rearranging query results (i.e., t ESRA -t ESA ) is insignificant compared to the search cost (i.e., t ESA ). For
instance, for L = 1006, t ESA and t ESRA are 0.235 sec and 0.287 sec , respectively. Thus, the ESRA algorithm
is able to drastically reduce the time cost of clustering query results.
Qualitative Analysis
Due to the difficulty of conducting a large-scale xix real-life user study, we discuss the effectiveness of
the ESRA algorithm based on structural properties of results provided to the end-user. It is worth notic-
 
Search WWH ::




Custom Search