Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

Contraction phase. To evaluate the contraction phase, we run two versions of

Incoop, a version that only memoizes the output of an entire reduce task, and the full

design that includes the contraction phase. We identify these two versions as Task

and Contraction . Figure 4.6 compares the work and time speedup of the two

versions using an application of each class ( CoMatrix is data-intensive and KNN

is CPU-intensive). The contraction phase does not change the performance of KNN

but significantly improves the performance of CoMatrix . This is related to the fact

that the reduce phase in KNN performs a simple computation and thus has little to

gain from the contraction phase. Given this fact, it is noteworthy that the contraction

phase did not add significant overhead.

Scheduler modification. We now evaluate the effectiveness of the memoization-

aware scheduler. In Figure 4.7, we compare the time to run the various applications

in Incoop using the new and the original Hadoop scheduler. The Y-axis presents

the total running time normalized to the time using the Hadoop scheduler. The

memoization-aware scheduler cuts the running time by 30% for data-intensive appli-

cations and almost 15% for CPU-intensive applications. This highlights the impor-

tance of this design aspect.

4.6.7 o verheaDs

Next we evaluate the price that is paid for the gains we showed in the previous sec-

tion, namely the overheads introduced by Incoop during the initial run, and the space

requirements for storing memoized results. The results are shown in Figure 4.8.

Performance overhead. Figure 4.8a depicts the performance overhead for the

first run for the Task and the Contraction variants as described before. We

stress that these overheads are a one-time cost that can lead to substantial gains

in subsequent runs. The overhead varies from 5% to 22% and is lower for CPU-

intensive applications ( K-Means and KNN ), since the time to compute over the data

dominates the time to transfer this data to be stored. For data-intensive applications

1.4

Hadoop scheduler

Incoop scheduler

1.2

1

0.8

0.6

0.4

0.2

0

K-Means WordCount

KNN

CoMatrix BiCount

Applications

FIGURE 4.7

Effectiveness of scheduler optimizations.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home