Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

The MapReduce can utilize three kinds of indices (range-indices, block-level

indices and database indexed tables) in a straightforward way. The experiments

of the study show that the range-index improves the performance of MapReduce

by a factor of 2 in the selection task and a factor of 10 in the join task when

selectivity is high.

There are two kinds of decoders for parsing the input records: mutable decoders

and immutable decoders. The study claim that only immutable decoders intro-

duce performance bottleneck. To handle database-like workloads, MapReduce

users should strictly use mutable decoders. A mutable decoder is faster than an

immutable decoder by a factor of 10, and improves the performance of selection

by a factor of 2. Using a mutable decoder, even parsing the text record is efficient.

Map-side sorting exerts negative performance effect on large aggregation

tasks which require nontrivial key comparisons and produce millions of

groups. Therefore, fingerprinting-based sort can significantly improve the

performance of MapReduce on such aggregation tasks. The experiments show

that fingerprinting-based sort outperforms direct sort by a factor of 4 to 5, and

improves overall performance of the job by 20-25 %.

Scheduling strategy affects the performance of MapReduce as it can be sensitive

to the processing speed of slave nodes, and slows down the execution time of the

entire job by 25-35 %.

The experiments of the study show that with proper engineering for these factors,

the performance of MapReduce can be improved by a factor of 2.5 to 3.5 and

approaches the performance of Parallel Databases. Therefore, several low-level

system optimization techniques have been introduced to improve the performance

of the MapReduce framework.

In general, running a single program in a MapReduce framework may require

tuning a number of parameters by users or system administrators. The settings of

these parameters control various aspects of job behavior during execution such

as memory allocation and usage, concurrency, I/O optimization, and network

bandwidth usage. The submitter of a Hadoop job has the option to set these param-

eters either using a program-level interface or through XML configuration files.

For any parameter whose value is not specified explicitly during job submission,

default values, either shipped along with the system or specified by the system

administrator, are used [ 69 ]. Users can run into performance problems because

they do not know how to set these parameters correctly, or because they do not

even know that these parameters exist. Herodotou and Babu [ 148 ] have focused

on the optimization opportunities presented by the large space of configuration

parameters for these programs. They introduced a Profiler component to collect

detailed statistical information from unmodified MapReduce programs and a What-

if Engine for fine-grained cost estimation. In particular, the Profiler component is

responsible for the following two main aspects:

1. Capturing information at the fine granularity of phases within the map and

reduce tasks of a MapReduce job execution. This information is crucial to the

accuracy of decisions made by the What-if Engine and the Cost-based Optimizer

components.

Cloud Data Management

Search WWH ::

Custom Search

Home