RPig: Concise Programming Framework by Integrating R with Pig for Big Data Analytics - Cloud Computing with e-Science Applications

Information Technology Reference

In-Depth Information

aggregated data Traffics (from Section 9.3.3) is already small enough in

this case, we can group all the data together and send them to one R engine

using RPig. The following shows the Pig statements:

Results = FOREACH (GROUP Traffics ALL) GENERATE RFuncs.ema_all

($1, n);

ema_all() is a defined R function processing the grouped data, as in the

following:

ema_all.outputSchema ← toString(lapply(seq(1,11), function(x)

{paste(“map[tuple(double)]”, sep = “”)}))

ema_all ← function(x, n) {

xDf ← as.data.frame(do.call(rbind, x[[1]]))

# convert to a data frame

...

# sorted the data and initial variables

library('TTR')

for(i in 1:length(clients)){

t ← xDf[xDf[,c(3)] = =clients[i], c(4)]

results ← append(results,list(list(as.character(clients[i]),

EMA(t,n))))}

return (results)}

In this case, the data passed to R is a nested list ( x ), which contains

aggregated traffic data for all service clients in different time windows,

( xxx

( ) ( ) … .). The first line of the R script converts the nested

list to a data frame called xDf , so the input data can be easily sorted and

selected as a data table. A sorted numeric list containing traffic data of pre-

vious time windows for each service client is selected and is used as input

for the R EMA() function of the TTR package. Results of all service clients

as a nested list results will be subsequently converted to a Pig map data

structure specified by the output schema. The name of the service client is

the key of the map, and the forecasted result is the value of the map. In this

case, the Pig statement is used as the query language for accessing the data

from the HDFS file system, and then the converted data will be sent to R for

analytic tasks. Afterward, the data analytic result is printed on screen or

stored in HDFS through Pig statements. Hence, RPig can be used as a way

for R programmers to read and write data and files in HDFS.

To summarize this use case of RPig, the Pig operations are used as pre-

processing steps to extract and summarize only the necessary information

needed for R processing. When the summarized data are small enough to

be handled in R in a single node, then we can use any statistical algorithm

implementations of R directly on the summarized data similar to the tradi-

tional single-machine approach of R.

1

2

,

xxx

,

1

2

3

1

2

3

Cloud Computing with e-Science Applications

Search WWH ::

Custom Search

Home