RPig: Concise Programming Framework by Integrating R with Pig for Big Data Analytics - Cloud Computing with e-Science Applications

Information Technology Reference

In-Depth Information

R for the same calculation. In another example, DataFu has around 200 lines

of Java code for the Quantile function in the first use case. This shows the

significant reduction of coding and code maintenance effort with RPig.

In summary, RPig offers concise programming for data analytics by utilizing

existing implementations of algorithms in R. However, the necessary data

required must be converted and loaded into R; this causes the performance

overhead. As a result, data should always be minimized as much as possible

before exchanging data between Pig and R to reduce overhead. The current

RPig has a significant performance improvement over the initial implemen-

tation; it has reduced the overhead of data exchange by 50%.

9.5.3 Prediction with SVM

9.5.3.1 Design and Implementation

Because an SVM model is constructed based on determined support vectors

in SVM algorithms, an SVM training data set can be represented by data

samples as support vectors. The remaining data of the data set that do not

directly contribute to the final SVM model can be viewed as redundant, even

though minor inaccuracies may occur in some cases [13, 14]. Therefore, if we

have two map functions, one ( map sv ) is for extracting samples marked as

support vectors from a data set, another ( map svm _ m ) is for having an SVM

model from a data set, and a generic reduce function ( reduce ) is only to aggre-

gate a list of results from map functions, then the SVM training phase to

obtain a model for a data set D can be defined in the MapReduce model as

the following.

Training Phase:

repeat a number of times if required:

split D to {D 1 ,D 2 ,...,D n }

D ← in parallel execution: reduce(map sv (D 1 ),map sv (D 2 ),...map sv (D n ))

Model ← reduce(map svm_m (D))

Since support vectors are often only a small data subset of the original input

data set map sv (D)<D , and map and reduce functions are executed in paral-

lel in Hadoop, building an SVM model from a data subset would be much

faster than building the original data set. Hence, the overall SVM training is

expected to be scalable with the size of the cluster. A parallel algorithm can

also be structured as multiple rounds of map and reduce. Collected samples

as support vectors can be treated as a new data set; hence, the map sv can also

be applied repeatedly to further reduce the data size if it is required.

In the prediction phase, it takes the trained model and network KPIs at

the IP layer, such as packet loss, as input, then gives a predicted MOS value

instantly. In this case, we want to do MOS value prediction in parallel for a

large amount of VoIP call sessions S , then a map function map predict can be

defined to take a subset of call sessions to increase the scalability.

Search WWH ::

Custom Search

Home