RPig: Concise Programming Framework by Integrating R with Pig for Big Data Analytics - Cloud Computing with e-Science Applications

Information Technology Reference

In-Depth Information

9.5.2.2 Result and Discussion

The necessary data must be converted and loaded into R first when an R

function is involved in a Pig data flow, and we consider this as performance

overhead. Minimizing this overhead was one of our main tasks after the

initial version of RPig development. As a consequence, the initial version of

RPig was used for a comparison study in this case. The code implementation

for this use case based on the initial version can be found in Reference 9, and

it is similar to the implementation using the current version.

Figure 9.3a shows the results with 20 fixed nodes. The data size represents

the initial raw data size loaded in Pig. With both versions of the implemen-

tations (both with stand-alone R engines), the performance decreased with

increasing data size, as expected. In this scenario, the performance mainly

depends on Pig/Hadoop, which needs to handle a large amount of raw data,

where R only plays a small part in the overall process. We can see the current

version has better overall performance, and the improvement becomes larger

when more data are involved. Figure 9.3b shows the improvement in detail

when sending data from Pig to R in a single node. In this case, summarized

data with more than a half million data tuples and four data fields in each

tuple will take 20 seconds in the initial version but only takes 10 seconds

in current version. Overhead is reduced 50% in the current version. This is

achieved by sending data directly to R through the socket connection and

many code optimizations in the current version. The initial version of RPig

streams the data to the disk as an R source file, then makes R load the source

file. Still, when more data need to be exchanged between R and Pig, then

the overhead becomes larger. This overhead can be considered as a trade-off

between user development effort and processing efficiency. We only have

10 lines of R code in the R functions in this use case, but we or the user had

to write around 100 lines of Java code for the EMA Pig UDF without using

RPig (initial)

RPig (current)

RPig (initial)

RPig (current)

1200

45

40

35

1000

30

25

20

15

10

5

0

800

600

400

200

0

5

10

15

20

25

200

400

600

Tu ples (K)

800

1000 200

Data Size (GB)

(a)

(b)

FIGURE 9.3

(a) Overall performance comparison. (b) Overhead comparison on data exchange (Pig to R).

Cloud Computing with e-Science Applications

Search WWH ::

Custom Search

Home