Writing basic MapReduce programs - Hadoop in Action

Databases Reference

In-Depth Information

information about your data, the speed and convenience in processing a smaller

data set generally outweigh any loss of precision. Finding data clusters is one ex-

ample of such descriptive information. Optimized implementations of a variety of

clustering algorithms are readily available in R, MATLAB, and other packages. It

makes a lot more sense to sample down the data and apply some standard software

package, instead of trying to process all data using some distributed clustering al-

gorithms in Hadoop.

WARNING The loss of precision from computing on a sampled data set may

or may not be important. It depends on what you're trying to compute and

the distribution of your data set. For example, it's usually fine to compute an

average from a sampled data set, but if the data set is highly skewed

and the

average is dominated by a few values, sampling can be problematic. Similarly,

clustering on a sampled data set is fine if it's used only to get a general

understanding of the data. If you were looking for small, anomalous clusters,

sampling may get rid of them. For functions such as maximum and minimum,

it's not a good idea to apply them to sampled data.

Running RandomSample.py using Streaming is like running Unix commands using

Streaming, the difference being that Unix commands are already available on all nodes

in the cluster, whereas RandomSample.py is not. Hadoop Streaming supports a -file

option to package your executable file as part of the job submission. 7 Our command

to execute RandomSample.py is

bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar

➥

-input input/cite75_99.txt

➥

-output output

➥

-mapper 'RandomSample.py 10'

➥

-file RandomSample.py

➥

-D mapred.reduce.tasks=1

In specifying the mapper to be 'RandomSample.py 10' we're sampling at 10 percent.

Note that we've set the number of reducers ( mapred.reduce.tasks ) to 1. As we

haven't specified any particular reducer, it will use the default IdentityReducer . As

its name implies, IdentityReducer passes its input straight to output. In this case we

can set the number of reducers to any non-zero value to get an exact number of output

files. Alternatively, we can set the number of reducers to 0, and let the number of out-

put files be the number of mappers. This is probably not ideal for the sampling task as

each mapper's output is only a small fraction of the input, and we may end up with a

number of small files. We can easily correct that later using the HDFS shell command

getmerge or other file manipulations to arrive at the right number of output files. The

approach to use is more or less a personal preference.

7 It's also implicitly assumed that you have installed the Python language on all the nodes in your cluster.

Hadoop in Action

Search WWH ::

Custom Search

Home