Database Reference
In-Depth Information
￿
The MapReduce can utilize three kinds of indices (range-indices, block-level
indices and database indexed tables) in a straightforward way. The experiments
of the study show that the range-index improves the performance of MapReduce
by a factor of 2 in the selection task and a factor of 10 in the join task when
selectivity is high.
￿
There are two kinds of decoders for parsing the input records: mutable decoders
and immutable decoders. The study claim that only immutable decoders intro-
duce performance bottleneck. To handle database-like workloads, MapReduce
users should strictly use mutable decoders. A mutable decoder is faster than an
immutable decoder by a factor of 10, and improves the performance of selection
by a factor of 2. Using a mutable decoder, even parsing the text record is efficient.
￿
Map-side sorting exerts negative performance effect on large aggregation
tasks which require nontrivial key comparisons and produce millions of
groups. Therefore, fingerprinting-based sort can significantly improve the
performance of MapReduce on such aggregation tasks. The experiments show
that fingerprinting-based sort outperforms direct sort by a factor of 4 to 5, and
improves overall performance of the job by 20-25 %.
￿
Scheduling strategy affects the performance of MapReduce as it can be sensitive
to the processing speed of slave nodes, and slows down the execution time of the
entire job by 25-35 %.
The experiments of the study show that with proper engineering for these factors,
the performance of MapReduce can be improved by a factor of 2.5 to 3.5 and
approaches the performance of Parallel Databases. Therefore, several low-level
system optimization techniques have been introduced to improve the performance
of the MapReduce framework.
In general, running a single program in a MapReduce framework may require
tuning a number of parameters by users or system administrators. The settings of
these parameters control various aspects of job behavior during execution such
as memory allocation and usage, concurrency, I/O optimization, and network
bandwidth usage. The submitter of a Hadoop job has the option to set these param-
eters either using a program-level interface or through XML configuration files.
For any parameter whose value is not specified explicitly during job submission,
default values, either shipped along with the system or specified by the system
administrator, are used [ 69 ]. Users can run into performance problems because
they do not know how to set these parameters correctly, or because they do not
even know that these parameters exist. Herodotou and Babu [ 148 ] have focused
on the optimization opportunities presented by the large space of configuration
parameters for these programs. They introduced a Profiler component to collect
detailed statistical information from unmodified MapReduce programs and a What-
if Engine for fine-grained cost estimation. In particular, the Profiler component is
responsible for the following two main aspects:
1. Capturing information at the fine granularity of phases within the map and
reduce tasks of a MapReduce job execution. This information is crucial to the
accuracy of decisions made by the What-if Engine and the Cost-based Optimizer
components.
Search WWH ::




Custom Search