Physical Data Warehouse Design - Data Warehouse Systems: Design and Implementation

Database Reference

In-Depth Information

of attributes can take), the expected number of distinct elements obtained is

n − n ∗

n ) r . This assumes that data are uniformly distributed. If it turns

out not to be the case and data present some skew, we will be overestimating

the size of the data cube. For instance, let us suppose a relation R(ProductKey,

CustomerKey, TimeKey) . If we want to estimate the size of the aggregation

over ProductKey and CustomerKey , we should know the number of different

values of each attribute. Then, n =

−

,and r is the

number of tuples in R . The main advantage of this method is its simplicity

and performance. The obvious drawback of the algorithm is that it does not

consult the database, and the results can be used only when we know that

data are uniformly distributed.

The basic idea of the sampling-based algorithm is to take a random

subset of the database and compute the cube over this subset. Let D be the

database, S the sample, and Cube (S) the size of the cube computed from S .

The size of the cube will be estimated as Cube (S)

ProductKey

|∗|

CustomerKey

∗ | D |

|S|

. This method is simple

and fast, and it has been reported that it provides satisfactory results over

real-world data sets.

The probabilistic counting algorithm is based on the following obser-

vation: suppose we want to compute the number of tuples of the aggregation

of sales by product category and shipper. We would first aggregate along the

dimension Product , to generate product categories, and count the number

of distinct shippers generated by this operation. For example, for the set of

product-shipper pairs

,if p1 and

p2 correspond to category c1 and the rest to category c2 , the aggregation

will have three tuples:

{

( p1 , s1 ) , ( p2 , s1 ) , ( p3 , s2 ) , ( p4 , s4 ) , ( p5 , s4 )

}

.Inotherwords, c1 yields

only one value of shipper, and c2 yields two distinct values of shipper. Thus,

estimating the number of distinct tuples in a group (in this case, shippers by

category), we can estimate the number of tuples in that group. This idea is

used to estimate the size of a data cube by means of counting the number

of distinct elements in a multiset as proposed in a well-known algorithm

by Flajolet and Martin, performing this for all possible combinations of the

hierarchies in the cube. The algorithm estimates the sizes of the aggregations

in a cube at the cost of scanning the whole database once. However, this is

cheaper than actually computing the cube, and it is proved that the error

has a bound. Details of this algorithm fall beyond the scope of this topic.

{

( c1 , s1 ) , ( c2 , s2 ) , ( c2 , s4 )

}

7.4.3 Partial Computation of a Data Cube

Generally speaking, three alternatives exist to implement a data warehouse:

materialize the whole data cube (as studied in the previous section),

materialize a selected portion of the cube, and not materializing any

aggregation at all. Materializing the whole cube has not only the drawback of

Data Warehouse Systems: Design and Implementation

Search WWH ::

Custom Search

Home