Information Technology Reference
In-Depth Information
R for the same calculation. In another example, DataFu has around 200 lines
of Java code for the Quantile function in the first use case. This shows the
significant reduction of coding and code maintenance effort with RPig.
In summary, RPig offers concise programming for data analytics by utilizing
existing implementations of algorithms in R. However, the necessary data
required must be converted and loaded into R; this causes the performance
overhead. As a result, data should always be minimized as much as possible
before exchanging data between Pig and R to reduce overhead. The current
RPig has a significant performance improvement over the initial implemen-
tation; it has reduced the overhead of data exchange by 50%.
9.5.3 Prediction with SVM
9.5.3.1 Design and Implementation
Because an SVM model is constructed based on determined support vectors
in SVM algorithms, an SVM training data set can be represented by data
samples as support vectors. The remaining data of the data set that do not
directly contribute to the final SVM model can be viewed as redundant, even
though minor inaccuracies may occur in some cases [13, 14]. Therefore, if we
have two map functions, one ( map sv ) is for extracting samples marked as
support vectors from a data set, another ( map svm _ m ) is for having an SVM
model from a data set, and a generic reduce function ( reduce ) is only to aggre-
gate a list of results from map functions, then the SVM training phase to
obtain a model for a data set D can be defined in the MapReduce model as
the following.
Training Phase:
repeat a number of times if required:
split D to {D 1 ,D 2 ,...,D n }
D ← in parallel execution: reduce(map sv (D 1 ),map sv (D 2 ),...map sv (D n ))
Model ← reduce(map svm_m (D))
Since support vectors are often only a small data subset of the original input
data set map sv (D)<D , and map and reduce functions are executed in paral-
lel in Hadoop, building an SVM model from a data subset would be much
faster than building the original data set. Hence, the overall SVM training is
expected to be scalable with the size of the cluster. A parallel algorithm can
also be structured as multiple rounds of map and reduce. Collected samples
as support vectors can be treated as a new data set; hence, the map sv can also
be applied repeatedly to further reduce the data size if it is required.
In the prediction phase, it takes the trained model and network KPIs at
the IP layer, such as packet loss, as input, then gives a predicted MOS value
instantly. In this case, we want to do MOS value prediction in parallel for a
large amount of VoIP call sessions S , then a map function map predict can be
defined to take a subset of call sessions to increase the scalability.
Search WWH ::




Custom Search