Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

experiments of the study show that the range index improves the perfor-

mance of MapReduce by a factor of 2 in the selection task and a factor of 10

in the join task when selectivity is high.

•

There are two kinds of decoders for parsing the input records: mutable

decoders and immutable decoders. The study claims that only immutable

decoders introduce performance bottleneck. To handle database-like work-

loads, MapReduce users should strictly use mutable decoders. A mutable

decoder is faster than an immutable decoder by a factor of 10 and improves

the performance of selection by a factor of 2. Using a mutable decoder, even

parsing the text record is efficient.

•

Map-side sorting exerts negative performance effect on large aggrega-

tion tasks that require nontrivial key comparisons and produce millions of

groups. Therefore, fingerprinting-based sort can significantly improve the

performance of MapReduce on such aggregation tasks. The experiments

show that fingerprinting-based sort outperforms direct sort by a factor of 4

to 5, and improves overall performance of the job by 20%-25%.

•

Scheduling strategy affects the performance of MapReduce as it can be

sensitive to the processing speed of slave nodes, and slows down the execu-

tion time of the entire job by 25%-35%.

The experiments of the study show that with proper engineering for these fac-

tors, the performance of MapReduce can be improved by a factor of 2.5 to 3.5 and

approaches the performance of parallel databases. Therefore, several low-level sys-

tem optimization techniques have been introduced to improve the performance of

the MapReduce framework.

In general, running a single program in a MapReduce framework may require

tuning a number of parameters by users or system administrators. The settings of

these parameters control various aspects of job behavior during execution such as

memory allocation and usage, concurrency, I/O optimization, and network band-

width usage. The submitter of a Hadoop job has the option to set these parameters

either using a program-level interface or through XML configuration files. For any

parameter whose value is not specified explicitly during job submission, default val-

ues, either shipped along with the system or specified by the system administrator,

are used [12]. Users can run into performance problems because they do not know

how to set these parameters correctly, or because they do not even know that these

parameters exist. Herodotou and Babu [66] have focused on the optimization oppor-

tunities presented by the large space of configuration parameters for these programs.

They introduced a Proiler component to collect detailed statistical information from

unmodified MapReduce programs and a what-if engine for fine-grained cost estima-

tion. In particular, the profiler component is responsible for the following two main

aspects:

1. Capturing information at the fine granularity of phases within the map and

reduce tasks of a MapReduce job execution. This information is crucial to

the accuracy of decisions made by the what-if engine and the cost-based

optimizer components.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home