Database Reference
In-Depth Information
of attributes can take), the expected number of distinct elements obtained is
n − n ∗
1
n ) r . This assumes that data are uniformly distributed. If it turns
out not to be the case and data present some skew, we will be overestimating
the size of the data cube. For instance, let us suppose a relation R(ProductKey,
CustomerKey, TimeKey) . If we want to estimate the size of the aggregation
over ProductKey and CustomerKey , we should know the number of different
values of each attribute. Then, n =
(1
,and r is the
number of tuples in R . The main advantage of this method is its simplicity
and performance. The obvious drawback of the algorithm is that it does not
consult the database, and the results can be used only when we know that
data are uniformly distributed.
The basic idea of the sampling-based algorithm is to take a random
subset of the database and compute the cube over this subset. Let D be the
database, S the sample, and Cube (S) the size of the cube computed from S .
The size of the cube will be estimated as Cube (S)
|
ProductKey
|∗|
CustomerKey
|
| D |
|S|
. This method is simple
and fast, and it has been reported that it provides satisfactory results over
real-world data sets.
The probabilistic counting algorithm is based on the following obser-
vation: suppose we want to compute the number of tuples of the aggregation
of sales by product category and shipper. We would first aggregate along the
dimension Product , to generate product categories, and count the number
of distinct shippers generated by this operation. For example, for the set of
product-shipper pairs
,if p1 and
p2 correspond to category c1 and the rest to category c2 , the aggregation
will have three tuples:
{
( p1 , s1 ) , ( p2 , s1 ) , ( p3 , s2 ) , ( p4 , s4 ) , ( p5 , s4 )
}
.Inotherwords, c1 yields
only one value of shipper, and c2 yields two distinct values of shipper. Thus,
estimating the number of distinct tuples in a group (in this case, shippers by
category), we can estimate the number of tuples in that group. This idea is
used to estimate the size of a data cube by means of counting the number
of distinct elements in a multiset as proposed in a well-known algorithm
by Flajolet and Martin, performing this for all possible combinations of the
hierarchies in the cube. The algorithm estimates the sizes of the aggregations
in a cube at the cost of scanning the whole database once. However, this is
cheaper than actually computing the cube, and it is proved that the error
has a bound. Details of this algorithm fall beyond the scope of this topic.
{
( c1 , s1 ) , ( c2 , s2 ) , ( c2 , s4 )
}
7.4.3 Partial Computation of a Data Cube
Generally speaking, three alternatives exist to implement a data warehouse:
materialize the whole data cube (as studied in the previous section),
materialize a selected portion of the cube, and not materializing any
aggregation at all. Materializing the whole cube has not only the drawback of
 
Search WWH ::




Custom Search