RPig: Concise Programming Framework by Integrating R with Pig for Big Data Analytics - Cloud Computing with e-Science Applications

Information Technology Reference

In-Depth Information

algorithms and therefore can only be used for simple data analysis. For

advanced or deep analysis, Mahout [2] contains a limited number of machine

learning algorithms implemented in the MapReduce model. Because of the

large number and complexity of machine learning and statistical algorithms,

the redesign and redevelopment of these algorithms in the MapReduce

model are difficult tasks. Various algorithms are still missing in Mahout

in comparison with matured statistical and machine learning frameworks.

For example, support vector machines (SVMs), one commonly used algo-

rithm, is still under development in Mahout. On the other hand, traditional

statistical software, such as R, has a rich and extensive set of machine learn-

ing and statistical processing functionalities for advanced analysis, but it is

not distributed and not scalable on its own. In general, it only runs within a

single computer and requires all data to be loaded into memory for process-

ing. Some solutions have been proposed to scale out this traditional statisti-

cal software, such as RHadoop [3], but limitations still exist. For example,

some require writing key-value paired map and reduce functions, leading

to difficulties in use and longer development time. More details of related

work are described in Section 9.6. Our approach addresses the problem by

integrating traditional and matured statistical software (R) with a scalable

DMS (Pig) to scale out deep analytics.

In this chapter, we present RPig, an integrated framework with R and Pig

for scalable machine learning and advanced statistical functionalities, which

makes it feasible to use high-level languages to develop analytic jobs easily

in concise programming. RPig takes advantage of both the deep statistical

analysis capability of R and parallel data-processing capability of Pig. Both

data storage and processing for deep data analysis are distributed and

scalable. The framework has the following main advantages:

• The statistical and machine learning functions of R can be easily

wrapped and directly used with Pig statements. This allows devel-

oping advanced parallel analytic jobs with two high-level languages

R and Pig (Latin) without needing to learn new languages or appli-

cation programming interfaces (APIs) or rewrite complex statistical

algorithms. The development effort can be significantly reduced for

the user.

• The framework is able to parallelize both R and Pig executions auto-

matically at the execution stage. The necessary low-level operations,

such as data conversion and fault handling, are handled by the

framework itself. The framework offers automatic parallel execution

for advanced data analysis.

In the rest of the chapter, we describe two scenarios that we encounter

in Section 9.2 that neither R nor Pig can handle independently. Section 9.3

Search WWH ::

Custom Search

Home