RPig: Concise Programming Framework by Integrating R with Pig for Big Data Analytics - Cloud Computing with e-Science Applications

Information Technology Reference

In-Depth Information

query language. In other words, we have to reimplement the SVM()

function of R in Jaql for our second use case to parallelize the process.

3. Scaling CPU power : Approaches for scaling out CPU power for R

can be divided into MapReduce- and non-MapReduce-based imple-

mentations. MapReduce-based approaches are generally running

on top of Hadoop. For example, both RHIPE [15] and RHadoop [3]

extend R to allow writing key-value pair map and reduce functions

within an R script. The MapReduce jobs of R can be submitted to a

Hadoop cluster for parallel executions. However, these frameworks

require users to manually design complex key-value pair-based map

and reduce functions, making them difficult to use and inefficient

for analysis job development. In our case with RPig, the key-value

pair-based map and reduce functions are automatically generated

by leveraging the Pig framework. The user only needs to define R

functions for a single task node; the execution of the R functions is

parallelized automatically based on Pig data flows. RHive [21] has

the same concept as our work. It is an R extension facilitating distrib-

uted computing via HiveQL/SQL queries. However, it is restricted

for the Hive data warehouse. And, considering the natural differ-

ences between Pig and SQL language, RHive is an alternative to

RPig, but it cannot be a replacement.

Many approaches utilize non-MapReduce-based parallel frameworks,

such as Open MPI [22], and packages such as Rmpi [23] and snow [24] pro-

vide bridge interfaces between R and MPI. CloudRmpi [25] supports man-

agement of an EC2 cluster and access to an R session on the master MPI node.

Elastic-R [26] allows users to send data to any R engine in an R engine pool.

However, the solutions do not support parallel data read/write as Hadoop;

hence, they are not suitable for I/O-intensive scenarios. Furthermore, these

solutions are difficult to use as the user must code send/receive message

functions for master and slave nodes through complex MPI API.

9.6.2 Other Related Solutions

Some approaches try to build new systems without using traditional statistical

frameworks, such as R. For example, Mahout [2] is a framework built on top

of Hadoop with MapReduce-based machine learning algorithms. However,

Mahout is only at an early stage; many commonly used algorithms, such as

SVM, are not available yet. Second, it does not provide a high-level language,

such as R and Pig; instead, complex Java APIs are provided. As a result,

developing analytic jobs in Mahout is complex and difficult. SystemML [27]

proposes a new declarative machine learning (DML) language for machine

learning on MapReduce. However, DML is not as flexible as R language;

it does not support object-oriented features, advanced data types (such as

lists and arrays), and so on in comparison with R. More important, SystemML

Search WWH ::

Custom Search

Home