Information Technology Reference
In-Depth Information
is the same as other newly developed frameworks, such as MLbase [28] and
Cloudera ML [29], and also lacks commonly used statistic and machine
learning algorithm implementations.
9.7 Conclusion
R provides comprehensive machine learning and statistical algorithms.
However, the R execution environment is not distributed and is not con-
sidered scalable. In contrast, Pig supports parallel data processing using
high-level language; but it does not provide implementations of common
statistical algorithms; it lacks the necessary features for advanced statistical
analysis. In this chapter, we presented an integrated RPig framework that
takes advantage of both R and Pig, allowing scalable deep analysis while
minimizing the development effort with concise programming.
We have described the design and implementation of an RPig framework.
Based on the use case scenarios, we have demonstrated the use of our frame-
work. We have shown experimental results related to scalability and coding
effort reduction with examples. We also did a comparison study in each use
case experiment to show the difference or improvement over related work.
Our future work will create an R package that would allow calling Pig in R.
References
1. Dean, J., and S. Ghemawat, MapReduce: simplified data processing on large
clusters . Communications of the ACM , 2008, 51(1): 107-113.
2. Apache Mahout. Home page. http://mahout.apache.org/.
3. RHadoop. https://github.com/RevolutionAnalytics/RHadoop/wiki/.
4. Handurukande, S., et al. Magneto approach to QoS monitoring. In IFIP/IEEE
International Symposium on Integrated Network Management . 2011.
5. White, T., Hadoop: The Definitive Guide . 2nd ed. Sebastopol, CA: O'Reilly Media,
2011.
6. Eaton, C., et al., Understanding Big Data: Analytics for Enterprise Class Hadoop and
Streaming Data . New York: McGraw-Hill, 2012.
7. DataFu. http://data.linkedin.com/opensource/datafu.
8. Olston, C., et al. Pig Latin: a not-so-foreign language for data processing. In
ACM International Conference on Management of Data . 2008.
9. Wang, M., S. B. Handurukande, and M. Nassar. RPig: a scalable framework for
machine learning and advanced statistical functionalities. In IEEE International
Conference on Cloud Computing Technology and Science . New York: IEEE, 2012.
10. Renjin. Home page. http://www.renjin.org/.
11. rsession. http://code.google.com/p/rsession/.
 
Search WWH ::




Custom Search