Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

TABLE 4.1

Applications Used in the Performance Evaluation

Application

Description

K-Means

K-Means clustering is a method of cluster analysis for partitioning n data points

into k clusters, in which each observation belongs to the cluster with the nearest

mean.

WordCount

Word count determines the frequency of words in a document.

KNN

K -nearest neighbors classifies objects based on the closest training examples in a

feature space.

CoMatrix

Co-occurrence matrix generates an N × N matrix, where N is the number of

unique words in the corpus. A cell m ij contains the number of times word w i

co-occurs with word w j .

BiCount

Bigram count measures the prevalence of each subsequence of two items within a

given sequence.

from a public data set.* The two CPU-intensive applications use a set of points in a

d -dimensional space as input. In this case we used a set of randomly generated points

in a 50-dimensional unit cube. To obtain reasonable running times, we chose all the

input sizes in a way that the running time of each job would be approximately 1 hour.

4.6.3 o verview oF the e XPeriments

Our evaluation tries to answer the following questions:

•

What are the overheads introduced by Inc-HDFS compared with HDFS?

(Section 4.6.4)

•

What are the performance gains of using Incoop when compared with

recomputing from scratch? (Section 4.6.5)

•

How important are each of the design features we introduce? (Section 4.6.6)

•

What are the overheads introduced by Incoop when a job is executed for the

first time? (Section 4.6.7)

To answer these questions, we ran experiments using the following setting and

measured the following data.

Experimental setup. We ran experiments on a cluster of 20 machines, running

the Linux kernel 2.6.32 in 64-bit mode, connected by a gigabit ethernet. The name

node and the job tracker of Hadoop ran on a master machine, which had a 12-core

Intel Xeon processor and 12 GB of RAM. When we run an Inc-HDFS client on this

machine, the parallel chunking code in Inc-HDFS is parameterized to spawn 12

threads, that is, one thread per core. The data nodes and task trackers of Hadoop ran

on the remaining 19 machines, which had AMD Opteron-252 processors and 4 GB

* Wikipedia data set: http://wiki.dbpedia.org/.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home