RPig: Concise Programming Framework by Integrating R with Pig for Big Data Analytics - Cloud Computing with e-Science Applications

Information Technology Reference

In-Depth Information

1000

Pig code

R code

20

RPig

RHadoop

900

15

800

10

700

5

0

600

RPig

RHadoop (rmr2)

4

8

12

16

Size of cluster

Framework

(a)

(b)

FIGURE 9.4

(a) Lines of code comparison. (b) Scalability on SVM training.

9.6 Related Work

9.6.1 Related to R

With the emergence of big data analytics, many researchers are addressing

the scalability issues of R. The existing approaches can be classified in three

different categories:

1. Scaling memory size : All data used in R calculations, such as lists

and data frames, need to be loaded into memory; however, a single

computer only has a limited memory size that restricts a large data

set from being loaded into R. RevoScaleR [16] and bigmemory [17]

are R packages that allow R to use a hard disk as external memory

for calculations. This approach allows R to handle a data size much

bigger than its memory size since the size of the hard disk is gener-

ally much larger than the memory size of a computer.

2. Scaling storage size : Terabyte-level big data are generally stored in dis-

tributed file systems, such as Hadoop clusters. To enable R to directly

read/write data in these large-scale data warehouses, interfaces

between these warehouses and R are developed, such as Ricardo

[18], which offers a bridge between R and Hadoop HDFS. Comparing

Ricardo to R bridging work on traditional RDBMS such as RJDBC [19]

and RMySQL [20], SQL is replaced by a query language (Jaql), which

can be executed in the MapReduce model in Ricardo. These approaches

allow R to directly access data from database or file systems, but the

R script execution remains in a single computer. For parallel data

analysis, it requires reimplementing most of a statistic algorithm in the

Cloud Computing with e-Science Applications

Search WWH ::

Custom Search

Home