Information Technology Reference
In-Depth Information
DataFu
RPig (JVM R)
RPig (standalone R)
2000
1500
1000
500
0
22k
33k
45k
90k
Data Size (rows)
FIGURE 9.2
The performance comparison on DataFu and RPig.
through Pig operations, and these take more time before calling the quantile
function.
DataFu has some convenience bag (e.g., enumerating bags) and utility
functions, but the availability of statistical functions in DataFu is extremely
limited. It only includes common statistics tasks (e.g., quantile, variance),
PageRank, and the like algorithms that are relevant to the LinkedIn use
cases. Even for the quantile function, DataFu only implements the type R-2
estimation, which is one of several algorithms for estimating quantiles. RPig
allows the use of nine quantile algorithms implemented in R, selected by the
type parameter in the example. With RPig, it is easy to wrap and expose any
statistical function of R as a Pig UDF. The statistical functions available in
RPig are as many and as comprehensive as in the original R.
In summary, RPig provides extensive statistical and machine learning
algorithms by wrapping any original R function in a Pig UDF, and the UDF
is flexible with input and output data formats and gives the best perfor-
mance (with stand-alone R) in the this case. In contrast, DataFu is ready to
use without needing additional installation of a script engine since it runs on
the JVM, but the number of functions is extremely limited.
9.5.2 Forecasting with EMA
9.5.2.1 Design and Implementation
EMA is used for forecasting data traffic on selected VoIP service clients for
a use case described in Section 9.2. Since EMA is a light algorithm, and the
Search WWH ::




Custom Search