Information Technology Reference
In-Depth Information
Here, sv is extracted samples as support vectors from the input date frame
xDf . However, the data frame of R is a column-oriented structure. All of the
values of a column are grouped together, then the values of the next column
are in a second group, and so on. Data tables stored in Hadoop and Pig are
the same as the commonly used CSV (comma-separated value) format, which
is primarily row oriented. If we want to use the Pig SPLIT operator to split
the collected support vectors to repeat the sv extracting process again, we
need to convert the collected data representation to be row oriented. Hence,
we use the apply() function to put each row to a list. Finally, all lists will be
put as tuples into a bag sent back to Pig. We flatten the bag in Pig and convert
the data back to the “table” format to continue processing.
In the last step of the training phase, we group the finalized data sets or
support vectors SV and send them to one R engine to obtain a final SVM
model and then store the model for the prediction phase. The MapReduce
model is still applied, but only one map and one reduce function will be
created at this stage.
Model = FOREACH (GROUP SV ALL) GENERATE R.svm_m(*);
svm_m.outputSchema ← “model:bytearray”;
svm_m ← function(x) {
...
# get the svmModel
return (serialize(svmModel, NULL)) }
The R UDF svm _ m is almost the same as the previous svm _ sv but
returns an SVM model svmModel this time. The serialized model will be
saved as a bytearray or original R object in HDFS, so we can use the model
directly in R for prediction later. In the SVM prediction phase, the SVM
model can be directly loaded into R from Pig in parallel execution for a huge
number of VoIP call sessions.
RHadoop [3] is a popular open-source project from Revolution Analytics
that allows users to manage and analyze data with Hadoop in R. The
rmr2 is an R package from RHadoop; it offers the user the ability to write
MapReduce functions in R. We implemented parallel SVM design with
rmr2 for the comparative study in this case. The following shows the
MapReduce implementation to obtain support vectors SV : We use this
function as the example to show the difference in implementation with
regard to the RPig version. The SV has exactly the same value as the RPig
implementation we described.
svDfs mapreduce(input = inputPath,
map = function(dummy, input) {
...
# extracting the support vector sv
keyval(1, list(sv)) },
reduce = function(k, sv){
val ← do.call(“rbind”, sv); keyval(1, val) }
)
SV ←from.dfs(svDfs)$val
Search WWH ::




Custom Search