RPig: Concise Programming Framework by Integrating R with Pig for Big Data Analytics - Cloud Computing with e-Science Applications

Information Technology Reference

In-Depth Information

Summary

In many domains, such as telecommunications, various scenarios necessi-

tate the processing of large amounts of data using statistical and machine

learning algorithms for deep analytics. A noticeable effort has been made

to move the data management systems into MapReduce parallel processing

environments, such as Hadoop, and Pig. Nevertheless, these systems lack

the necessary statistical and machine learning algorithms and therefore can

only be used for simple data analysis. Frameworks such as Mahout, on top

of Hadoop, support machine learning, but their implementations are at the

early stage. For example, Mahout does not provide support vector machine

(SVM) algorithms, and it is difficult to use. On the other hand, traditional

statistical software tools, such as R, containing comprehensive statisti-

cal algorithms for advanced analysis, are widely used. But, such software

can only run on a single computer; therefore, it is not scalable for big data.

In this chapter, we present RPig, an integrated framework with R and Pig for

scalable machine learning and advanced statistical functionalities, which

makes it feasible to use high-level languages to develop analytic jobs easily

in concise programming. Using application scenarios from the telecommu-

nications domain, we show the use of RPig. With comparable evaluation

results, we demonstrate advantages of RPig, such as less development effort

compared with related work.

9.1 Introduction

With the explosive growth in the use of information communication tech-

nology (ICT), applications that involve deep analytics need to be shifted to

scalable solutions for big data. Our work is motivated by the big data ana-

lytic capabilities of network management systems, such as network traffic

analysis, in the telecommunications (telecom) domain. More specifically, the

work is an extension of Apache Pig/Hadoop frameworks, which are com-

monly used to build cost-effective big data systems in industry. The design,

the developed software implementation, and the solution we describe here

are general and applicable to other domains.

To build a scalable system, one approach is to use distributed parallel com-

puting models, such as MapReduce [1], that allow adding more (computer)

nodes into the system to scale horizontally. MapReduce has been recently

applied to many data management systems (DMSs), such as Hadoop and Pig.

These systems target the storage and querying of data for top-layer appli-

cations. However, they lack the necessary statistical and machine learning

Search WWH ::

Custom Search

Home