Information Technology Reference
In-Depth Information
aggregated data Traffics (from Section 9.3.3) is already small enough in
this case, we can group all the data together and send them to one R engine
using RPig. The following shows the Pig statements:
Results = FOREACH (GROUP Traffics ALL) GENERATE RFuncs.ema_all
($1, n);
ema_all() is a defined R function processing the grouped data, as in the
following:
ema_all.outputSchema ← toString(lapply(seq(1,11), function(x)
{paste(“map[tuple(double)]”, sep = “”)}))
ema_all ← function(x, n) {
xDf ← as.data.frame(do.call(rbind, x[[1]]))
# convert to a data frame
...
# sorted the data and initial variables
library('TTR')
for(i in 1:length(clients)){
t ← xDf[xDf[,c(3)] = =clients[i], c(4)]
results ← append(results,list(list(as.character(clients[i]),
EMA(t,n))))}
return (results)}
In this case, the data passed to R is a nested list ( x ), which contains
aggregated traffic data for all service clients in different time windows,
( xxx
( ) ( ) .). The first line of the R script converts the nested
list to a data frame called xDf , so the input data can be easily sorted and
selected as a data table. A sorted numeric list containing traffic data of pre-
vious time windows for each service client is selected and is used as input
for the R EMA() function of the TTR package. Results of all service clients
as a nested list results will be subsequently converted to a Pig map data
structure specified by the output schema. The name of the service client is
the key of the map, and the forecasted result is the value of the map. In this
case, the Pig statement is used as the query language for accessing the data
from the HDFS file system, and then the converted data will be sent to R for
analytic tasks. Afterward, the data analytic result is printed on screen or
stored in HDFS through Pig statements. Hence, RPig can be used as a way
for R programmers to read and write data and files in HDFS.
To summarize this use case of RPig, the Pig operations are used as pre-
processing steps to extract and summarize only the necessary information
needed for R processing. When the summarized data are small enough to
be handled in R in a single node, then we can use any statistical algorithm
implementations of R directly on the summarized data similar to the tradi-
tional single-machine approach of R.
1
1
1
2
2
2
,
,
,
xxx
,
,
,
1
2
3
1
2
3
Search WWH ::




Custom Search