Information Technology Reference
In-Depth Information
In addition to these Pig statements, the following R UDF is defined in the
RFuncs.r script.
Quantile.outputSchema ← “q:double”;
Quantile ← function (x, probs, type) {
probs ← as.numeric(unlist(strsplit(probs, split = “,”)))
# parse the parameter value
q ← quantile(unlist(x), probs = probs, names = T, type = type);
# call the R quantile() function
return (as.list(q)); }
The Quantile UDF is a simple wrapper for the quantile() function of the
stats library. x is a numeric vector whose sample quantiles are desired.
Its value is converted from the Pig input tuple by the framework. The func-
tion parameters (probs, type) value can be supplied in different ways,
for example, a Declare statement that is used in the example or a Parameter
File and so on.
To summarize this use case of RPig, any original R function can be easily
wrapped and exposed as a Pig R UDF. The necessary input parameter of
the original R function can be exposed by the UDF to make the function
more generic for reusability. Still, all the input data for a single function call
will be executed in one R engine, and some partitioning might be necessary
(e.g., group by “week”) if the data are too large.
9.5.1.2 Result and Discussion
In this use case of computing quantiles, both DataFu and RPig only require
a few lines of Pig (and R) code as the user does not need to write the quan-
tile algorithm. However, the RPig implementation of the function is much
more flexible regarding the data input and output formats than DataFu. The
DataFu quantile function only takes a sorted input bag, and each numeric
value is a tuple inside the bag. We have to preformat the raw data before
calling the function in this case. In contrast, the RPig version can handle any
format of bag or tuple input. Numeric values can be either in one tuple/bag
or separated tuple/bags since the data always will be flattened into numeric
vectors in the R function before computing quantiles.
Figure 9.2 shows the performance comparison with a fixed 20-node Hadoop
cluster. Each row of input data contains more than 10K double values for one
network node, and that makes around 1 GB raw data for every 10K rows.
The RPig version implementation with the JVM R engine (Renjin 0.7.0) has
the slowest performance. It becomes very slow when input data size becomes
larger, and it consumes almost all available memory for the map task. It might
relate to the internal memory management problem of Renjin since it is only
in a very early stage. The RPig with stand-alone R has the best performance.
DataFu (v 0.0.10) is in the second since it needs to preformat and sort the data
Search WWH ::




Custom Search