Information Technology Reference
In-Depth Information
The map function extracts the support vectors of every data subset and out-
puts a key-value pair, the value part is the support vectors sv . All outputs
of the map functions have the same key value, integer 1 , so the extracted sv
of different data subsets will be collected and aggregated together by the
reduce function. The final result SV can be retrieved for the value part of the
key-value pair output of the reduce function.
To summarize this use case with RPig, parallel or iterative statistical algo-
rithms for distributed data sources are expressed as parallel R executions in a
Pig data flow. Input data are treated as a number of distributed data sources with
no centralized information during parallel R executions for each data source;
aggregated results of distributed R executions as stepping stones are relative
to a final result of a final centralized R execution. Pig operations are used to
distribute the data and tasks for parallel processing with multiple R engines as
Map tasks. This approach allows parallel R executions to reduce the processing
time. However, statistical errors may be caused by the iterative and incremental
statistical algorithms as a trade-off and are acceptable in most cases [13, 14].
9.5.3.2 Result and Discussion
Both RPig and RHadoop allow a parallel SVM implementation in the
MapReduce model. RPig just uses the FOREACH statement to parallelize
the tasks as the Map functions. RHadoop allows the user to code the entire
analytic job in R, but the user has to design the key-value pairs based on Map
and Reduce functions. This creates complexity for the user in code design
and development compared to RPig, especially when multiple MapReduce
functions are necessary for complex analytic jobs. In the example described
for obtaining SV , we wrote 16 lines of Pig and R code using RPig, but needed
21 lines of R code for RHadoop because of writing the key-value pair func-
tions (FigureĀ 9.4a). This again shows the concise programming of RPig.
A relatively small size data is used in this case since it is very CPU intensive
as described previously. We split the data containing 12K training samples
into 16 pieces to obtain support vectors and then we obtained the final SVM
model at the end of the training. The performance of the SVM training phase
with respect to the cluster size is shown in FigureĀ 9.4b. RPig has almost iden-
tical performance compared to the RHadoop (rmr2 v2.2.2) implementations.
There is a reduction of processing time as the cluster size increases, but the
decrease in processing time is not exactly linear as there is a higher commu-
nication cost with a larger cluster.
In summary, RPig is able to scale out machine learning functionalities
for deep analytics. We demonstrated this through an SVM use case. RPig is
less complex to use and requires less development effort for writing parallel
machine learning algorithms compared to RHadoop or others (e.g., RHIPE
[15]), which require designing and writing key-value-paired Map and Reduce
functions manually.
Search WWH ::




Custom Search