ChunkSim - Data Warehousing Design and Advanced Engineering Applications

Database Reference

In-Depth Information

many replicas there will be for dataset D. Given a

replication degree r , there will be Cc +´

for processing, local query runtimes, and

results transfer and merge times (according

to the WP information). The output of the

simulation is the average expected runtime

value (secs).

Experiment: an experiment is a set of runs with

some parameter that varies in order to ana-

lyze the effect of that parameter on perfor-

mance or availability properties of the sys-

tem. Additionally, for statistics relevance,

ChunkSim always runs a pre-defined num-

ber of times (100 by default) for each pa-

rameter value.

é ê ê

ù ú ú

chunks in the system.

A replication degree of 0 means that the data

set will have no replicas, no chunk will be rep-

licated into any other node; A replication degree

of 0.5 means that half the data set chunks will

be replicated into additional nodes; A replication

degree of 2 means that all chunks of a data set

will have two replicas located in two other nodes.

A replication degree of 15 in a 16-node system

corresponds to full mirroring. In terms of size, a

100 GB data set will occupy a total of 1.6TB with

full replication, 300 GB with r=2 and 150GB with

r=0.5 . The smallest the replication factor, the lower

the loading and storage requirements.

Replication alternatives follow the same logic

of placement alternatives, meaning that to Pl-C,

Pl-H, Pl-W and Pl-Wf correspond Rl-C, Rl-H and

Rl-W. These are denoted as replication policies.

This set of placement and replication alter-

natives is the basic set already implemented in

ChunkSim and in the actual DWPA parallel data

warehouse architecture prototype (Furtado 2007).

Other semi-automated approaches can be added

that may for instance take into account groups of

nodes for availability or performance reasons.

ChunkSim offers the following experiments:

Performance Analysis of Replication Degrees

(PARD) - this experiment answers the

question of how different replication de-

grees influence system performance;

Additional inputs: replication Degrees array;

Outputs: a set of tuples (Replication Degree,

Time PL-H (LP), Time FM (LP), Time (LP),

Time Query). The “Time (LP)” and “Time

Query” fields are the average expected run-

time of the Local Processing (LP) part of a

query and of the whole Query, respectively

(the LP part of a query is the fraction that

is processed locally at each node, before

transfer and merge times). The Time “PL-

H (LP)” and “Time FM (LP)” fields are for

comparison purposes, since they represent

“Slow” Homogeneous Placement with no

replicas (PL-H) and “Fast” Full Mirroring

(FM) runtimes, respectively.

ChunkSim Estimation of

Performance and Availability

The ChunkSim simulator implements the data al-

location alternatives (placement and replication)

and collects the system configuration information

(WP, CL and RL) that it needs to model the sys-

tem. We further define a run and an experiment

as the actions the simulator uses to output some

analysis report:

For illustration purposes, Table 1 shows an

example of the output report of PARD (correspond-

ing to the experimental setup that will be shown

later on in section 5). In that table the Replication

Degree (RD) quantifies how much replication

there are for the chunks in the SN system. For

instance, RD=10% means that only 10% of the

fact chunks have one copy, while 100% means

Run: a run is a simple event-based simulation

of the on-demand, chunk-wise process-

ing algorithm of Figure 4, simulating on-

demand assignment of chunks to nodes

Data Warehousing Design and Advanced Engineering Applications

Search WWH ::

Custom Search

Home