Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

of RAM. The task trackers were parameterized to use two map and two reduce slots

per worker machine.

Work and time. We separately measure work and time to compare the perfor-

mance across runs. Work is the sum of all the computation time performed by all

the tasks, which eliminates the effects of having some machines idling waiting to

synchronize with other machines. (Parallel) time refers to the total running time for

the job. The two metrics are related through the work-time principle, which states

that a computation with W work can be executed on P machines (or processors)

in w

p time if there are no scheduling overheads. Note that the work measurements

include the additional computational work performed by tasks that are speculatively

executed by the Hadoop framework (e.g., Hadoop can run the same task on two dif-

ferent machines to improve performance if there is spare capacity on the cluster).

Therefore, a difference in the number of speculative tasks that are launched will be

reflected in the comparison of work.

Initial and incremental runs. When evaluating Incoop, we need to consider two

types of runs. The initial run operates on data that was never seen before, and therefore

can start with an empty memoization server, which is then populated by the initial run.

The incremental run corresponds to a subsequent run where the input is modified by

a certain fraction, and the system tries to reuse subcomputations to the extent possible.

Speedup. We present the results comparing the performance of Incoop and

Hadoop by plotting the speedup, that is, the ratio of the work or parallel time required

by Hadoop to the work or time required by Incoop. In most cases we plot how this

speedup varies as we change the fraction of the input that differs from the initial

to the incremental run. To run an experiment where x % of the input data differs

between the two runs, we randomly chose x % of the chunks in the input and replaced

them with new equally sized chunks with new content.

4.6.4 i nCremental hDFs

We compare the throughput during an upload of a data set of 3 GB in HDFS and Inc-

HDFS, for a varying number of skipped bytes in Inc-HDFS. The client machine that

is writing to the file system runs on the same machine as the name node of Hadoop.

The results of this experiment are summarized in Table 4.2. Overall, Inc-HDFS adds

only a small throughput overhead compared with HDFS, which can be attributed to

the fingerprint computation.

TABLE 4.2

Throughput of HDFS and Inc-HDFS

Version

Skip Offset (MB)

Throughput (MB/s)

HDFS

-

34.41

Incremental HDFS

20

32.67

40

34.19

60

32.04

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home