Information Technology Reference
In-Depth Information
9.5.2.2 Result and Discussion
The necessary data must be converted and loaded into R first when an R
function is involved in a Pig data flow, and we consider this as performance
overhead. Minimizing this overhead was one of our main tasks after the
initial version of RPig development. As a consequence, the initial version of
RPig was used for a comparison study in this case. The code implementation
for this use case based on the initial version can be found in Reference 9, and
it is similar to the implementation using the current version.
Figure 9.3a shows the results with 20 fixed nodes. The data size represents
the initial raw data size loaded in Pig. With both versions of the implemen-
tations (both with stand-alone R engines), the performance decreased with
increasing data size, as expected. In this scenario, the performance mainly
depends on Pig/Hadoop, which needs to handle a large amount of raw data,
where R only plays a small part in the overall process. We can see the current
version has better overall performance, and the improvement becomes larger
when more data are involved. Figure 9.3b shows the improvement in detail
when sending data from Pig to R in a single node. In this case, summarized
data with more than a half million data tuples and four data fields in each
tuple will take 20 seconds in the initial version but only takes 10  seconds
in current version. Overhead is reduced 50% in the current version. This is
achieved by sending data directly to R through the socket connection and
many code optimizations in the current version. The initial version of RPig
streams the data to the disk as an R source file, then makes R load the source
file. Still, when more data need to be exchanged between R and Pig, then
the overhead becomes larger. This overhead can be considered as a trade-off
between user development effort and processing efficiency. We only have
10 lines of R code in the R functions in this use case, but we or the user had
to write around 100 lines of Java code for the EMA Pig UDF without using
RPig (initial)
RPig (current)
RPig (initial)
RPig (current)
1200
45
40
35
1000
30
25
20
15
10
5
0
800
600
400
200
0
5
10
15
20
25
200
400
600
Tu ples (K)
800
1000 200
Data Size (GB)
(a)
(b)
FIGURE 9.3
(a) Overall performance comparison. (b) Overhead comparison on data exchange (Pig to R).
Search WWH ::




Custom Search