Information Technology Reference
In-Depth Information
query language. In other words, we have to reimplement the SVM()
function of R in Jaql for our second use case to parallelize the process.
3. Scaling CPU power : Approaches for scaling out CPU power for R
can be divided into MapReduce- and non-MapReduce-based imple-
mentations. MapReduce-based approaches are generally running
on top of Hadoop. For example, both RHIPE [15] and RHadoop [3]
extend R to allow writing key-value pair map and reduce functions
within an R script. The MapReduce jobs of R can be submitted to a
Hadoop cluster for parallel executions. However, these frameworks
require users to manually design complex key-value pair-based map
and reduce functions, making them difficult to use and inefficient
for analysis job development. In our case with RPig, the key-value
pair-based map and reduce functions are automatically generated
by leveraging the Pig framework. The user only needs to define R
functions for a single task node; the execution of the R functions is
parallelized automatically based on Pig data flows. RHive [21] has
the same concept as our work. It is an R extension facilitating distrib-
uted computing via HiveQL/SQL queries. However, it is restricted
for the Hive data warehouse. And, considering the natural differ-
ences between Pig and SQL language, RHive is an alternative to
RPig, but it cannot be a replacement.
Many approaches utilize non-MapReduce-based parallel frameworks,
such as Open MPI [22], and packages such as Rmpi [23] and snow [24] pro-
vide bridge interfaces between R and MPI. CloudRmpi [25] supports man-
agement of an EC2 cluster and access to an R session on the master MPI node.
Elastic-R [26] allows users to send data to any R engine in an R engine pool.
However, the solutions do not support parallel data read/write as Hadoop;
hence, they are not suitable for I/O-intensive scenarios. Furthermore, these
solutions are difficult to use as the user must code send/receive message
functions for master and slave nodes through complex MPI API.
9.6.2 Other Related Solutions
Some approaches try to build new systems without using traditional statistical
frameworks, such as R. For example, Mahout [2] is a framework built on top
of Hadoop with MapReduce-based machine learning algorithms. However,
Mahout is only at an early stage; many commonly used algorithms, such as
SVM, are not available yet. Second, it does not provide a high-level language,
such as R and Pig; instead, complex Java APIs are provided. As a result,
developing analytic jobs in Mahout is complex and difficult. SystemML [27]
proposes a new declarative machine learning (DML) language for machine
learning on MapReduce. However, DML is not as flexible as R language;
itĀ  does not support object-oriented features, advanced data types (such as
lists and arrays), and so on in comparison with R. MoreĀ important, SystemML
Search WWH ::




Custom Search