Information Technology Reference
In-Depth Information
1000
Pig code
R code
20
RPig
RHadoop
900
15
800
10
700
5
0
600
RPig
RHadoop (rmr2)
4
8
12
16
Size of cluster
Framework
(a)
(b)
FIGURE 9.4
(a) Lines of code comparison. (b) Scalability on SVM training.
9.6 Related Work
9.6.1 Related to R
With the emergence of big data analytics, many researchers are addressing
the scalability issues of R. The existing approaches can be classified in three
different categories:
1. Scaling memory size : All data used in R calculations, such as lists
and data frames, need to be loaded into memory; however, a single
computer only has a limited memory size that restricts a large data
set from being loaded into R. RevoScaleR [16] and bigmemory [17]
are R packages that allow R to use a hard disk as external memory
for calculations. This approach allows R to handle a data size much
bigger than its memory size since the size of the hard disk is gener-
ally much larger than the memory size of a computer.
2. Scaling storage size : Terabyte-level big data are generally stored in dis-
tributed file systems, such as Hadoop clusters. To enable R to directly
read/write data in these large-scale data warehouses, interfaces
between these warehouses and R are developed, such as Ricardo
[18], which offers a bridge between R and Hadoop HDFS. Comparing
Ricardo to R bridging work on traditional RDBMS such as RJDBC [19]
and RMySQL [20], SQL is replaced by a query language (Jaql), which
can be executed in the MapReduce model in Ricardo. These approaches
allow R to directly access data from database or file systems, but the
R  script execution remains in a single computer. For parallel data
analysis, it requires reimplementing most of a statistic algorithm in the
Search WWH ::




Custom Search