ChunkSim - Data Warehousing Design and Advanced Engineering Applications

Database Reference

In-Depth Information

sidering a parallel environment, this means that

it is worth to partition the central fact table into

multiple pieces that can be processed in parallel,

while the dimension tables are left complete. In a

shared-nothing environment, this means copying

dimension relations into all nodes and dividing the

fact throughout the nodes. This approach is very

useful in what concerns parallel processing, since

most operations can proceed in parallel (Furtado

2005), including processing of joins in parallel.

Works such as (Furtado 2007, Rao et al. 2002)

deal with more complex schemas that may include

bigger dimensions and multiple interconnected

big relations that, for best performance, need to

be partitioned. In those cases, simultaneous par-

titioning of multiple relations may pose problems

to the parallel join operator. As a simple rule, if

at most two of the relations participating in a join

are partitioned, they can be partitioned by their

common join key (equi-partition), and parallel join

processing can proceed independently in all nodes;

If, on the other hand, more than two relations need

to be partitioned, then substantial exchanges of

data may be necessary to process broadcast and

redirected joins, as described in (Rao et al. 2002).

When redirection is necessary, workload-based

partitioning solutions as the ones referenced in

(Furtado 2007, Rao et al. 2002) can be adopted,

which try to co-locate the relations that result in

highest query processing gains.

The IBM DB2 server (IBM 2008) allows the

definition of any number of partitions (up to some

limit) for dividing tables into a set of nodes. In

that system parallel joins are processed in differ-

ent manners, depending on the co-location status

of joining relations: In co-located joins, the data

is already partitioned according to the join hash

values, so the join can occur in parallel with no

data exchange requirements between nodes; In

redirected joins the data is not co-located, but

it is enough to re-locate the rows from one of

the source data sets in order to proceed with a

co-located join; In repartitioned join both source

data sets need to be re-located in order to become

co-located; In broadcast join one of the source data

is broadcasted into all nodes to enable parallel

joining with the other partitioned data set.

In this work we assume schemas and workloads

that do not require redirection, that is, simple

schemas such as the one in Figure 1, where there

is a large fact and small dimensions, or schemas

with at most two equi-partitioned relations (par-

titioned by the same key attribute), assuming that

the remaining relations are replicated. Future

work includes extending the current rationale of

the ChunkSim simulator to deal with any generic

schema.

Basic Query Processing

Given a data allocation in a shared-nothing envi-

ronment, the next important issue is how queries

are to be processed. Query plans are determined

by a cost-based parallel query optimizer, which

evaluates alternative query processing paths and

chooses the most efficient, according to a cost

model. The result of the optimization must then

be re-written as a set of queries and data transfer

operations. In this section we illustrate how queries

can be decomposed into a number of smaller sub-

queries that will act on fragments independently

in a SN architecture, with significant speedup.

This subject has been discussed in several works

(Akinde et al. 2003, Furtado 2005, Stohr 2000),

our illustration being based on the parallel query

processing approach followed by the Data Ware-

house Parallel Architecture (DWPA) (Furtado

2005, Furtado 2007).

Figure 2 illustrates the basic architecture of the

shared-nothing, node partitioned data warehouse.

It includes three major entities: Query Submitter,

Executor and the Controller. Submitters are simple

services that may reside in any computer, do not

require an underlying database server and submit

queries to the system. Once submitted, the query is

parsed and transformed into high-level actions by a

query planner. These actions are then transformed

into Command Lists for each Executor. Executors

Data Warehousing Design and Advanced Engineering Applications

Search WWH ::

Custom Search

Home