Information Technology Reference
In-Depth Information
Summary
In many domains, such as telecommunications, various scenarios necessi-
tate the processing of large amounts of data using statistical and machine
learning algorithms for deep analytics. A noticeable effort has been made
to move the data management systems into MapReduce parallel processing
environments, such as Hadoop, and Pig. Nevertheless, these systems lack
the necessary statistical and machine learning algorithms and therefore can
only be used for simple data analysis. Frameworks such as Mahout, on top
of Hadoop, support machine learning, but their implementations are at the
early stage. For example, Mahout does not provide support vector machine
(SVM) algorithms, and it is difficult to use. On the other hand, traditional
statistical software tools, such as R, containing comprehensive statisti-
cal algorithms for advanced analysis, are widely used. But, such software
can only run on a single computer; therefore, it is not scalable for big data.
In this chapter, we present RPig, an integrated framework with R and Pig for
scalable machine learning and advanced statistical functionalities, which
makes it feasible to use high-level languages to develop analytic jobs easily
in concise programming. Using application scenarios from the telecommu-
nications domain, we show the use of RPig. With comparable evaluation
results, we demonstrate advantages of RPig, such as less development effort
compared with related work.
9.1 Introduction
With the explosive growth in the use of information communication tech-
nology (ICT), applications that involve deep analytics need to be shifted to
scalable solutions for big data. Our work is motivated by the big data ana-
lytic capabilities of network management systems, such as network traffic
analysis, in the telecommunications (telecom) domain. More specifically, the
work is an extension of Apache Pig/Hadoop frameworks, which are com-
monly used to build cost-effective big data systems in industry. The design,
the developed software implementation, and the solution we describe here
are general and applicable to other domains.
To build a scalable system, one approach is to use distributed parallel com-
puting models, such as MapReduce [1], that allow adding more (computer)
nodes into the system to scale horizontally. MapReduce has been recently
applied to many data management systems (DMSs), such as Hadoop and Pig.
These systems target the storage and querying of data for top-layer appli-
cations. However, they lack the necessary statistical and machine learning
 
Search WWH ::




Custom Search