ChunkSim - Data Warehousing Design and Advanced Engineering Applications

Database Reference

In-Depth Information

et al. 1996) built a prototype of ds-RAID and the

work in (Hwang et al. 2002) further evolved the

concept with an orthogonal striping and mirroring

approach. These approaches are low-level data-

block ones similar to RAID approaches, but at

the networked multiple-nodes level. Their basic

concepts are useful for parallel data warehous-

ing, but they must be applied at a higher partition

and processing units-level, since disk block-wise

solutions would not spare very expensive data

transfers from the disks to the processing nodes

in order to retrieve whole partitions if we were to

use low-level approaches for parallel inter-node

processing in our setup.

At the OLAP query-processing level, in (Akal

et al. 2002) the authors propose efficient data

processing in a database cluster by means of full

mirroring and creating multiple node-bound sub-

queries for a query by adding predicates. This way,

each node receives a sub-query and is forced to

execute over a subset of the data, called a virtual

partition. Virtual partitions are more flexible than

physical ones, since only the ranges need to be

changed, but do not prevent full table scans when

the range is not small. The authors of (Lima et al.

2004) further propose multiple small sub-queries

by node instead of just one (Lima et al. 2004b),

and an adaptive partition size tuning approach.

The rationale in all these works still suffer from

data access-performance related limitations: full

table scanning as in (Akal et al. 2002) by every or

most nodes is more expensive than having each

node access only a part of the data; the alterna-

tive in (Lima et al. 2004, Lima et al. 2004b) of

reducing the query predicate sizes to avoid full

table scans also has another limitation in that it

assumes some form of index-based access instead

of full scans. Unless the data is clustered index-

key-wise (which should not be assumed), this can

be slow for queries that need to go through large

fractions of data sets (frequent in OLAP). These

limitations are absent in physically partitioned

data sets, as long as the query processor is able to

restrict which partitions to access. For this reason,

we consider chunks, which are fixed size parti-

tions that, besides partitioning, can also be used

for replication and efficient load-balancing.

Basic partitioning into chunks can be imple-

mented in any relational database management

system (RDBMS), by either creating separate

tables for each chunk or using the partitioning

interface offered by the RDBMS.

The first issue concerning chunks is to define

and enforce a convenient size, since chunks are

the individual units that can be placed and pro-

cessed in a highly flexible fashion, regardless of

the partitioning approach. Since the focus of this

work is on replication degree, load and avail-

ability balancing, and the solutions proposed are

to be used generically with different partitioning

approaches, we assume chunk target size and

target size enforcement to be guaranteed by the

partitioning approach itself and refer to these is-

sues only briefly.

We assume a target chunk size value (e.g. in

our experiments with TPC-H our target was ap-

proximately 250MB for base Lineitem table, not-

ing that there is a fairly wide range of acceptable

chunk size values which yield good performance,

given a simple guiding principle: if there are too

many very small chunks, there will be a large extra

overhead related to switching disk scans among

chunks, collecting partial results from each and

merging all the results; If, on the other hand, the

chunk is too large, we will not take full advantage

of load-balancing opportunities.

Chunk target size enforcement is dependent

on the partitioning approach. A chunk target size

is a simple and always available way to influ-

ence the number and size of pieces that are an

input to the allocation approach, which in turn

is useful for the dynamic chunk-wise load and

availability-balance. In contrast with what would

be more sophisticated alternatives for optimization

of chunk configuration, such as a benchmark or

workload-based analysis, the simple specification

of a chunk target size does not require further

knowledge and adds only a simple target-related

Data Warehousing Design and Advanced Engineering Applications

Search WWH ::

Custom Search

Home