Databases Reference
In-Depth Information
information about your data, the speed and convenience in processing a smaller
data set generally outweigh any loss of precision. Finding data clusters is one ex-
ample of such descriptive information. Optimized implementations of a variety of
clustering algorithms are readily available in R, MATLAB, and other packages. It
makes a lot more sense to sample down the data and apply some standard software
package, instead of trying to process all data using some distributed clustering al-
gorithms in Hadoop.
WARNING The loss of precision from computing on a sampled data set may
or may not be important. It depends on what you're trying to compute and
the distribution of your data set. For example, it's usually fine to compute an
average from a sampled data set, but if the data set is highly skewed
and the
average is dominated by a few values, sampling can be problematic. Similarly,
clustering on a sampled data set is fine if it's used only to get a general
understanding of the data. If you were looking for small, anomalous clusters,
sampling may get rid of them. For functions such as maximum and minimum,
it's not a good idea to apply them to sampled data.
Running RandomSample.py using Streaming is like running Unix commands using
Streaming, the difference being that Unix commands are already available on all nodes
in the cluster, whereas RandomSample.py is not. Hadoop Streaming supports a -file
option to package your executable file as part of the job submission. 7 Our command
to execute RandomSample.py is
bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-input input/cite75_99.txt
-output output
-mapper 'RandomSample.py 10'
-file RandomSample.py
-D mapred.reduce.tasks=1
In specifying the mapper to be 'RandomSample.py 10' we're sampling at 10 percent.
Note that we've set the number of reducers ( mapred.reduce.tasks ) to 1. As we
haven't specified any particular reducer, it will use the default IdentityReducer . As
its name implies, IdentityReducer passes its input straight to output. In this case we
can set the number of reducers to any non-zero value to get an exact number of output
files. Alternatively, we can set the number of reducers to 0, and let the number of out-
put files be the number of mappers. This is probably not ideal for the sampling task as
each mapper's output is only a small fraction of the input, and we may end up with a
number of small files. We can easily correct that later using the HDFS shell command
getmerge or other file manipulations to arrive at the right number of output files. The
approach to use is more or less a personal preference.
7 It's also implicitly assumed that you have installed the Python language on all the nodes in your cluster.
 
Search WWH ::




Custom Search