Database Reference
In-Depth Information
et al. 1996) built a prototype of ds-RAID and the
work in (Hwang et al. 2002) further evolved the
concept with an orthogonal striping and mirroring
approach. These approaches are low-level data-
block ones similar to RAID approaches, but at
the networked multiple-nodes level. Their basic
concepts are useful for parallel data warehous-
ing, but they must be applied at a higher partition
and processing units-level, since disk block-wise
solutions would not spare very expensive data
transfers from the disks to the processing nodes
in order to retrieve whole partitions if we were to
use low-level approaches for parallel inter-node
processing in our setup.
At the OLAP query-processing level, in (Akal
et al. 2002) the authors propose efficient data
processing in a database cluster by means of full
mirroring and creating multiple node-bound sub-
queries for a query by adding predicates. This way,
each node receives a sub-query and is forced to
execute over a subset of the data, called a virtual
partition. Virtual partitions are more flexible than
physical ones, since only the ranges need to be
changed, but do not prevent full table scans when
the range is not small. The authors of (Lima et al.
2004) further propose multiple small sub-queries
by node instead of just one (Lima et al. 2004b),
and an adaptive partition size tuning approach.
The rationale in all these works still suffer from
data access-performance related limitations: full
table scanning as in (Akal et al. 2002) by every or
most nodes is more expensive than having each
node access only a part of the data; the alterna-
tive in (Lima et al. 2004, Lima et al. 2004b) of
reducing the query predicate sizes to avoid full
table scans also has another limitation in that it
assumes some form of index-based access instead
of full scans. Unless the data is clustered index-
key-wise (which should not be assumed), this can
be slow for queries that need to go through large
fractions of data sets (frequent in OLAP). These
limitations are absent in physically partitioned
data sets, as long as the query processor is able to
restrict which partitions to access. For this reason,
we consider chunks, which are fixed size parti-
tions that, besides partitioning, can also be used
for replication and efficient load-balancing.
Basic partitioning into chunks can be imple-
mented in any relational database management
system (RDBMS), by either creating separate
tables for each chunk or using the partitioning
interface offered by the RDBMS.
The first issue concerning chunks is to define
and enforce a convenient size, since chunks are
the individual units that can be placed and pro-
cessed in a highly flexible fashion, regardless of
the partitioning approach. Since the focus of this
work is on replication degree, load and avail-
ability balancing, and the solutions proposed are
to be used generically with different partitioning
approaches, we assume chunk target size and
target size enforcement to be guaranteed by the
partitioning approach itself and refer to these is-
sues only briefly.
We assume a target chunk size value (e.g. in
our experiments with TPC-H our target was ap-
proximately 250MB for base Lineitem table, not-
ing that there is a fairly wide range of acceptable
chunk size values which yield good performance,
given a simple guiding principle: if there are too
many very small chunks, there will be a large extra
overhead related to switching disk scans among
chunks, collecting partial results from each and
merging all the results; If, on the other hand, the
chunk is too large, we will not take full advantage
of load-balancing opportunities.
Chunk target size enforcement is dependent
on the partitioning approach. A chunk target size
is a simple and always available way to influ-
ence the number and size of pieces that are an
input to the allocation approach, which in turn
is useful for the dynamic chunk-wise load and
availability-balance. In contrast with what would
be more sophisticated alternatives for optimization
of chunk configuration, such as a benchmark or
workload-based analysis, the simple specification
of a chunk target size does not require further
knowledge and adds only a simple target-related
Search WWH ::




Custom Search