RPig: Concise Programming Framework by Integrating R with Pig for Big Data Analytics - Cloud Computing with e-Science Applications

Information Technology Reference

In-Depth Information

In addition to these Pig statements, the following R UDF is defined in the

RFuncs.r script.

Quantile.outputSchema ← “q:double”;

Quantile ← function (x, probs, type) {

probs ← as.numeric(unlist(strsplit(probs, split = “,”)))

# parse the parameter value

q ← quantile(unlist(x), probs = probs, names = T, type = type);

# call the R quantile() function

return (as.list(q)); }

The Quantile UDF is a simple wrapper for the quantile() function of the

R stats library. x is a numeric vector whose sample quantiles are desired.

Its value is converted from the Pig input tuple by the framework. The func-

tion parameters (probs, type) value can be supplied in different ways,

for example, a Declare statement that is used in the example or a Parameter

File and so on.

To summarize this use case of RPig, any original R function can be easily

wrapped and exposed as a Pig R UDF. The necessary input parameter of

the original R function can be exposed by the UDF to make the function

more generic for reusability. Still, all the input data for a single function call

will be executed in one R engine, and some partitioning might be necessary

(e.g., group by “week”) if the data are too large.

9.5.1.2 Result and Discussion

In this use case of computing quantiles, both DataFu and RPig only require

a few lines of Pig (and R) code as the user does not need to write the quan-

tile algorithm. However, the RPig implementation of the function is much

more flexible regarding the data input and output formats than DataFu. The

DataFu quantile function only takes a sorted input bag, and each numeric

value is a tuple inside the bag. We have to preformat the raw data before

calling the function in this case. In contrast, the RPig version can handle any

format of bag or tuple input. Numeric values can be either in one tuple/bag

or separated tuple/bags since the data always will be flattened into numeric

vectors in the R function before computing quantiles.

Figure 9.2 shows the performance comparison with a fixed 20-node Hadoop

cluster. Each row of input data contains more than 10K double values for one

network node, and that makes around 1 GB raw data for every 10K rows.

The RPig version implementation with the JVM R engine (Renjin 0.7.0) has

the slowest performance. It becomes very slow when input data size becomes

larger, and it consumes almost all available memory for the map task. It might

relate to the internal memory management problem of Renjin since it is only

in a very early stage. The RPig with stand-alone R has the best performance.

DataFu (v 0.0.10) is in the second since it needs to preformat and sort the data

Search WWH ::

Custom Search

Home