Information Technology Reference
In-Depth Information
algorithms and therefore can only be used for simple data analysis. For
advanced or deep analysis, Mahout [2] contains a limited number of machine
learning algorithms implemented in the MapReduce model. Because of the
large number and complexity of machine learning and statistical algorithms,
the redesign and redevelopment of these algorithms in the MapReduce
model are difficult tasks. Various algorithms are still missing in Mahout
in comparison with matured statistical and machine learning frameworks.
For  example, support vector machines (SVMs), one commonly used algo-
rithm, is still under development in Mahout. On the other hand, traditional
statistical software, such as R, has a rich and extensive set of machine learn-
ing and statistical processing functionalities for advanced analysis, but it is
not distributed and not scalable on its own. In general, it only runs within a
single computer and requires all data to be loaded into memory for process-
ing. Some solutions have been proposed to scale out this traditional statisti-
cal software, such as RHadoop [3], but limitations still exist. For example,
some require writing key-value paired map and reduce functions, leading
to difficulties in use and longer development time. More details of related
work are described in Section 9.6. Our approach addresses the problem by
integrating traditional and matured statistical software (R) with a scalable
DMS (Pig) to scale out deep analytics.
In this chapter, we present RPig, an integrated framework with R and Pig
for scalable machine learning and advanced statistical functionalities, which
makes it feasible to use high-level languages to develop analytic jobs easily
in concise programming. RPig takes advantage of both the deep statistical
analysis capability of R and parallel data-processing capability of Pig. Both
data storage and processing for deep data analysis are distributed and
scalable. The framework has the following main advantages:
• The statistical and machine learning functions of R can be easily
wrapped and directly used with Pig statements. This allows devel-
oping advanced parallel analytic jobs with two high-level languages
R and Pig (Latin) without needing to learn new languages or appli-
cation programming interfaces (APIs) or rewrite complex statistical
algorithms. The development effort can be significantly reduced for
the user.
• The framework is able to parallelize both R and Pig executions auto-
matically at the execution stage. The necessary low-level operations,
such as data conversion and fault handling, are handled by the
framework itself. The framework offers automatic parallel execution
for advanced data analysis.
In the rest of the chapter, we describe two scenarios that we encounter
in Section 9.2 that neither R nor Pig can handle independently. Section 9.3
Search WWH ::




Custom Search