Database Reference
In-Depth Information
TABLE 4.1
Applications Used in the Performance Evaluation
Application
Description
K-Means
K-Means clustering is a method of cluster analysis for partitioning n data points
into k clusters, in which each observation belongs to the cluster with the nearest
mean.
WordCount
Word count determines the frequency of words in a document.
KNN
K -nearest neighbors classifies objects based on the closest training examples in a
feature space.
CoMatrix
Co-occurrence matrix generates an N × N matrix, where N is the number of
unique words in the corpus. A cell m ij contains the number of times word w i
co-occurs with word w j .
BiCount
Bigram count measures the prevalence of each subsequence of two items within a
given sequence.
from a public data set.* The two CPU-intensive applications use a set of points in a
d -dimensional space as input. In this case we used a set of randomly generated points
in a 50-dimensional unit cube. To obtain reasonable running times, we chose all the
input sizes in a way that the running time of each job would be approximately 1 hour.
4.6.3 o verview oF the e XPeriments
Our evaluation tries to answer the following questions:
What are the overheads introduced by Inc-HDFS compared with HDFS?
(Section 4.6.4)
What are the performance gains of using Incoop when compared with
recomputing from scratch? (Section 4.6.5)
How important are each of the design features we introduce? (Section 4.6.6)
What are the overheads introduced by Incoop when a job is executed for the
first time? (Section 4.6.7)
To answer these questions, we ran experiments using the following setting and
measured the following data.
Experimental setup. We ran experiments on a cluster of 20 machines, running
the Linux kernel 2.6.32 in 64-bit mode, connected by a gigabit ethernet. The name
node and the job tracker of Hadoop ran on a master machine, which had a 12-core
Intel Xeon processor and 12 GB of RAM. When we run an Inc-HDFS client on this
machine, the parallel chunking code in Inc-HDFS is parameterized to spawn 12
threads, that is, one thread per core. The data nodes and task trackers of Hadoop ran
on the remaining 19 machines, which had AMD Opteron-252 processors and 4 GB
* Wikipedia data set: http://wiki.dbpedia.org/.
 
Search WWH ::




Custom Search