QoS-Oriented Grid-Enabled Data Warehouses - Data Warehousing Design and Advanced Engineering Applications

Database Reference

In-Depth Information

der to obtain data about sales in Paris and, then,

individually on each of its arrondissements .

In order to reduce data movement across sites

(improving the system's performance) the grid-

based DW tables are physically distributed across

different sites. Such distribution is represented at a

Global Physical Schema (GPS). Ideally, the physi-

cally used data distribution strategy is transparent

to users, which should submit queries considering

a unified Logical Model (LM).

Grids are highly heterogeneous environments.

At each site, different types of resources may be

available (like shared-nothing and shared-disk

parallel machines, for example). It is somewhat

difficult to find an intra-site allocation strategy

that is optimal in the several possible situations.

Therefore, each site may use its own local physi-

cal allocation strategy (e.g. Multi-Dimensional

Hierarchical Fragmentation - MDHF (Stöhr et

al, 2000) or Node-Partitioned Data Warehouse

strategy - NPDW (Furtado, 2004). Each site's

existent relations are represented in a Local Site

Physical Schema (LSPS). This assumption fits

well with the idea of domain autonomy, which is

one of the grid's characteristics.

In the generic grid-based DW, nodes from

any site can load data to the database. But the

same data cannot be loaded from distinct sites.

This leads to the idea that each piece of data has

a single site (to which we call Data Source Site)

that is its primary source. In order to reduce data

movement across grid's sites (considering the

abovementioned geographically related access

patters), each site should maintain a copy of the

facts data it has loaded into the DW (in this chapter,

we consider that tables in the LM are organized in

a star schema). This generates a globally physically

partitioned facts table which uses the values of a

site source attribute as partitioning criteria.

Depending on the implementation, the site

source attribute values may be combined with

values of other existent dimensions. In fact, the

repartitioning of each facts table site source-based

fragment into several smaller fragments can

benefit the system in several ways. For instance,

in such situation, each smaller fragment can be

replicated to distinct sites, what would increase

the system's degree of parallelism. Besides that,

depending on the selection predicate, some queries

may access only a set of the smaller fragments,

which would be faster than accessing the whole

original site source-based fragment. These two

situations are represented in Figure 2. Therefore,

even at the global level, other partitioning criteria

should be used together with the site source at-

tribute. The use of the most frequently used equi-

join attributes as part of the partitioning criteria

for the facts table can improve performance, by

reducing data movement across sites when execut-

ing queries [as it does in shared-nothing parallel

machines (Furtado, 2004b)].

Besides facts table's partitions, each site should

also store dimension tables' data. Full replication

of dimension tables across all sites may be done

to reduce inter-site data movement during query

execution and to improve data availability. Such

strategy is feasible when dimension tables are small

(this also facilitates system management). But when

large dimension tables are present, they can be

fragmented both at intra-site and inter-sites levels

in order to improve performance and QoS-levels.

Intra-site dimension table fragmentation strategy

depends on the locally chosen physical allocation

strategy (which is dependent on the type of locally

available resources, as discussed earlier). Inter-sites

large dimension tables' fragmentation should be

done using a strategy similar to the one of facts

table fragmentation: initially, dimension data should

remain at its Data Source Site . Inter-site replica-

tion is done when necessary. Derived partitioning

of the facts table can also be done, improving the

system's performance as join operations can be

broken into subjoins that are executed in parallel

at distinct sites. Although the use of facts table

derived partitioning depends on the semantics of

stored data, such kind of partitioning should be

used together use the aforementioned partitioning

based on the site source attribute.

Data Warehousing Design and Advanced Engineering Applications

Search WWH ::

Custom Search

Home