Database Reference
In-Depth Information
of RAM. The task trackers were parameterized to use two map and two reduce slots
per worker machine.
Work and time. We separately measure work and time to compare the perfor-
mance across runs. Work is the sum of all the computation time performed by all
the tasks, which eliminates the effects of having some machines idling waiting to
synchronize with other machines. (Parallel) time refers to the total running time for
the job. The two metrics are related through the work-time principle, which states
that a computation with W work can be executed on P machines (or processors)
in w
p time if there are no scheduling overheads. Note that the work measurements
include the additional computational work performed by tasks that are speculatively
executed by the Hadoop framework (e.g., Hadoop can run the same task on two dif-
ferent machines to improve performance if there is spare capacity on the cluster).
Therefore, a difference in the number of speculative tasks that are launched will be
reflected in the comparison of work.
Initial and incremental runs. When evaluating Incoop, we need to consider two
types of runs. The initial run operates on data that was never seen before, and therefore
can start with an empty memoization server, which is then populated by the initial run.
The incremental run corresponds to a subsequent run where the input is modified by
a certain fraction, and the system tries to reuse subcomputations to the extent possible.
Speedup. We present the results comparing the performance of Incoop and
Hadoop by plotting the speedup, that is, the ratio of the work or parallel time required
by Hadoop to the work or time required by Incoop. In most cases we plot how this
speedup varies as we change the fraction of the input that differs from the initial
to the incremental run. To run an experiment where x % of the input data differs
between the two runs, we randomly chose x % of the chunks in the input and replaced
them with new equally sized chunks with new content.
4.6.4 i nCremental hDFs
We compare the throughput during an upload of a data set of 3 GB in HDFS and Inc-
HDFS, for a varying number of skipped bytes in Inc-HDFS. The client machine that
is writing to the file system runs on the same machine as the name node of Hadoop.
The results of this experiment are summarized in Table 4.2. Overall, Inc-HDFS adds
only a small throughput overhead compared with HDFS, which can be attributed to
the fingerprint computation.
TABLE 4.2
Throughput of HDFS and Inc-HDFS
Version
Skip Offset (MB)
Throughput (MB/s)
HDFS
-
34.41
Incremental HDFS
20
32.67
40
34.19
60
32.04
 
Search WWH ::




Custom Search