RPig: Concise Programming Framework by Integrating R with Pig for Big Data Analytics - Cloud Computing with e-Science Applications

Information Technology Reference

In-Depth Information

An increasing amount of phone calls are made by various VoIP clients, such

as Viber and Skype. One approach for monitoring the service quality of VoIP

is using network-level key performance indicators (N-KPIs) at the Internet

protocol (IP) layer, such as packet loss or jitter, to predict the mean opinion

score (MOS), which is a standard speech quality measurement parameter [4].

An SVM-based regression algorithm is used in this case, but it is a complex

algorithm, usually involving long computation times on a relatively small

amount of data in the training phase. RPig enables us to define and execute

the SVM algorithms in the MapReduce model for both SVM training and

prediction phases without writing any key-value pair MapReduce functions.

As a result, the performance becomes scalable to cluster size, and develop-

ment effort is reduced.

This use case deals with a complex machine learning algorithm, which is

CPU intensive rather than I/O intensive. R's in-memory computation takes

most of the overall computation time with a few data in an analysis job.

RPig supports parallelism for various requirements in different scenarios.

9.3 Background

Big data [5] are data in volumes so large and complex that they become

difficult to process using on-hand database management tools or traditional

data-processing applications. Since Google published its MapReduce tech-

nology and Apache started the Hadoop project in 2004 and 2005, MapReduce

and Hadoop have become a generic and foundational approach for develop-

ing scalable, cost-effective, flexible, fault-tolerant big data systems [6]. Many

frameworks, such as Pig and Hive, have been developed based on Hadoop,

adding features on it. As Hadoop systems are more widely adopted in

industry, the requirements of the real-world problems are driving the

Hadoop ecosystem to become even richer. For example, Oozie and Azkaban

provide workflow and scheduling management. Impala and Shark aim at

low-latency real-time queries. Our work, RPig, is one of many frameworks,

such as Mahout and DataFu [7], targeting deep analytics. In the following

sections, we briefly describe the frameworks on which the RPig is based.

9.3.1 R and R Packages

R is a programming language and software environment widely used for sta-

tistical computing and deep data analysis, such as classification, and regression.

R is extensible through R packages. There are thousands of R packages that

implement massive specialized machine learning and statistical algorithms.

Search WWH ::

Custom Search

Home