Information Technology Reference
In-Depth Information
Hadoop 1.0.3, R 2.14.1). One node from the Master Instance Group has the
extended Pig 0.11 with the RPig feature deployed to generate MapReduce
programs. The rest of the nodes are from the Core Instance Group, provid-
ing both data storage and MapReduce task execution services. As R requires
data to be loaded in memory, each node is configured to have a maximum
capacity of one map task and one reduce task, so an R session could take
the maximum memory available in a single node. We also assign a larger
heap-size limit to the child JVMs of map tasks as these are where R statistical
functions are executed. The reduce task is allocated a lower value.
9.5.1 Summary Statistics with Quantiles
Before going to complex examples that use different R packages, we would like
to show a simple quantiles statistic task to give a “hello world” example in the
first case. Quantiles are used to summarize a set of observations by giving the
boundary values between the divided distributions. For example, a large num-
ber of values for a network parameter observed over time can be summarized
in a few numbers, or quantiles, for reporting or comparing with thresholds.
9.5.1.1 Design and Implementation
DataFu [7] is a collection of useful Pig add-ons (UDFs) developed by LinkedIn
for data mining and statistics, and it is used for the comparison study in this
use case. DataFu is used in many off-line workflows for data-derived prod-
ucts like “People You May Know” and “Skills” at LinkedIn. The following
shows the main lines of implementation using DataFu. *
DEFINE Quantile datafu.pig.stats.Quantile('0.5','0.75','1.0');
— Computing the quantiles for each network nodes
Quantiles = FOREACH B {sorted = ORDER values BY val; GENERATE
id, Quantile(sorted); };
DataFu uses the DEFINE statement to specify a Quantile UDF function
with string parameters for the function constructor. ('0.5','0.75','1.0')
yields the 50th and 75th percentiles and the max. The function takes a sorted
bag as the input.
The following shows the RPig version for the same computational task.
REGISTER 'RFuncs.r' using rsession as RFuncs; — or using
'renjin' for JVM R
%DECLARE q_probs '0.5, 0.75, 1';%declare q_type '2';
Quantiles = FOREACH A GENERATE id, RFuncs.Quantile(values,
'$q_probs', '$q_type');
* Detailed explanation of the DataFu quantile example is available at http://engineering.linkedin.
com/open-source/introducing-datafu-open-source-collection-useful-apache-pig-udfs.
Search WWH ::




Custom Search