Database Reference
In-Depth Information
experiments of the study show that the range index improves the perfor-
mance of MapReduce by a factor of 2 in the selection task and a factor of 10
in the join task when selectivity is high.
There are two kinds of decoders for parsing the input records: mutable
decoders and immutable decoders. The study claims that only immutable
decoders introduce performance bottleneck. To handle database-like work-
loads, MapReduce users should strictly use mutable decoders. A mutable
decoder is faster than an immutable decoder by a factor of 10 and improves
the performance of selection by a factor of 2. Using a mutable decoder, even
parsing the text record is efficient.
Map-side sorting exerts negative performance effect on large aggrega-
tion tasks that require nontrivial key comparisons and produce millions of
groups. Therefore, fingerprinting-based sort can significantly improve the
performance of MapReduce on such aggregation tasks. The experiments
show that fingerprinting-based sort outperforms direct sort by a factor of 4
to 5, and improves overall performance of the job by 20%-25%.
Scheduling strategy affects the performance of MapReduce as it can be
sensitive to the processing speed of slave nodes, and slows down the execu-
tion time of the entire job by 25%-35%.
The experiments of the study show that with proper engineering for these fac-
tors, the performance of MapReduce can be improved by a factor of 2.5 to 3.5 and
approaches the performance of parallel databases. Therefore, several low-level sys-
tem optimization techniques have been introduced to improve the performance of
the MapReduce framework.
In general, running a single program in a MapReduce framework may require
tuning a number of parameters by users or system administrators. The settings of
these parameters control various aspects of job behavior during execution such as
memory allocation and usage, concurrency, I/O optimization, and network band-
width usage. The submitter of a Hadoop job has the option to set these parameters
either using a program-level interface or through XML configuration files. For any
parameter whose value is not specified explicitly during job submission, default val-
ues, either shipped along with the system or specified by the system administrator,
are used [12]. Users can run into performance problems because they do not know
how to set these parameters correctly, or because they do not even know that these
parameters exist. Herodotou and Babu [66] have focused on the optimization oppor-
tunities presented by the large space of configuration parameters for these programs.
They introduced a Proiler component to collect detailed statistical information from
unmodified MapReduce programs and a what-if engine for fine-grained cost estima-
tion. In particular, the profiler component is responsible for the following two main
aspects:
1. Capturing information at the fine granularity of phases within the map and
reduce tasks of a MapReduce job execution. This information is crucial to
the accuracy of decisions made by the what-if engine and the cost-based
optimizer components.
Search WWH ::




Custom Search