Database Reference
In-Depth Information
be done after the partitioning scheme has been decided. We now describe the
steps involved in recommending hash-based partitioning in a parallel DBMS.
Candidate selection. We identify all candidate columns in the workload
that are part of equality predicates or group-by clauses. As explained in
Section 4.1, candidate selection can be done by either parsing the input
queries or the output execution plans or instrumenting the optimizer. In
addition to all candidate partitioning schemes identified in this manner,
we also consider the option of replicating some tables across all servers,
which can reduce communication overhead. Since replication incurs stor-
age overhead, we consider replicating only tables whose sizes are below a
threshold. Some optimizers extend the notion of interesting orders (see
Section 2.3.1) to also generate interesting partitioning columns. Subop-
timal plans that can exploit some interesting partitioning column are
kept during optimization because they can be part of the optimal plan
overall. Then, the optimizer is instrumented to simulate the availability
of all partitioning and replication schemes simultaneously, so that after
optimization we can obtain the optimal partitioning strategy for a given
query.
Merging. Consider a table T , and suppose that the best candidate par-
titioning for query q 1 is over T
(
a
,
b
)
and that the best candidate
partitioning for query q 2 is over T
(
a
,
c
)
. In this case, a hash-based par-
could be the best partitioning overall, as it benefits
both queries. In general, merging generates new partitioning candidates
with columns that are a common subset of two or more candidates
identified in the previous step. To avoid skewed distributions, we do not
attempt to merge partitioning candidates if the resulting number of key
values is under a given threshold.
Enumeration. Before enumeration starts, we compute a benefit value for
each candidate partitioning. The benefit of a candidate partitioning P
is obtained as the difference in cost, for all queries that picked P as a
candidate partitioning, between the configuration that can leverage all
possible partitioning candidates (including P ) and the one for which
no partitioning is available. Merged partitioning candidates inherit the
sum of benefits of their respective parents. Note that this benefit value
is just an approximation of the true reduction in cost for a given par-
titioning candidate, since it assigns the benefit of the whole query to
all useful partitioning candidates equally. Using benefit values, we de-
fine the initial configuration as the one that uses, for each table, the
titioning over T
(
a
)
We assume that the same table is not repeated in the query, and therefore the optimal plan
contains a single partitioning scheme for each table. In the general case the technique needs to
be extended, as discussed in Section 9.6.
Search WWH ::




Custom Search