RPig: Concise Programming Framework by Integrating R with Pig for Big Data Analytics - Cloud Computing with e-Science Applications

Information Technology Reference

In-Depth Information

Here, sv is extracted samples as support vectors from the input date frame

xDf . However, the data frame of R is a column-oriented structure. All of the

values of a column are grouped together, then the values of the next column

are in a second group, and so on. Data tables stored in Hadoop and Pig are

the same as the commonly used CSV (comma-separated value) format, which

is primarily row oriented. If we want to use the Pig SPLIT operator to split

the collected support vectors to repeat the sv extracting process again, we

need to convert the collected data representation to be row oriented. Hence,

we use the apply() function to put each row to a list. Finally, all lists will be

put as tuples into a bag sent back to Pig. We flatten the bag in Pig and convert

the data back to the “table” format to continue processing.

In the last step of the training phase, we group the finalized data sets or

support vectors SV and send them to one R engine to obtain a final SVM

model and then store the model for the prediction phase. The MapReduce

model is still applied, but only one map and one reduce function will be

created at this stage.

Model = FOREACH (GROUP SV ALL) GENERATE R.svm_m(*);

svm_m.outputSchema ← “model:bytearray”;

svm_m ← function(x) {

...

# get the svmModel

return (serialize(svmModel, NULL)) }

The R UDF svm _ m is almost the same as the previous svm _ sv but

returns an SVM model svmModel this time. The serialized model will be

saved as a bytearray or original R object in HDFS, so we can use the model

directly in R for prediction later. In the SVM prediction phase, the SVM

model can be directly loaded into R from Pig in parallel execution for a huge

number of VoIP call sessions.

RHadoop [3] is a popular open-source project from Revolution Analytics

that allows users to manage and analyze data with Hadoop in R. The

rmr2 is an R package from RHadoop; it offers the user the ability to write

MapReduce functions in R. We implemented parallel SVM design with

rmr2 for the comparative study in this case. The following shows the

MapReduce implementation to obtain support vectors SV : We use this

function as the example to show the difference in implementation with

regard to the RPig version. The SV has exactly the same value as the RPig

implementation we described.

svDfs ← mapreduce(input = inputPath,

map = function(dummy, input) {

...

# extracting the support vector sv

keyval(1, list(sv)) },

reduce = function(k, sv){

val ← do.call(“rbind”, sv); keyval(1, val) }

)

SV ←from.dfs(svDfs)$val

Cloud Computing with e-Science Applications

Search WWH ::

Custom Search

Home