RPig: Concise Programming Framework by Integrating R with Pig for Big Data Analytics - Cloud Computing with e-Science Applications

Information Technology Reference

In-Depth Information

expected from the R output and remove ambiguity during the conversion.

For example, the logical value in R could be True, False, or NA (Not Available),

but the Pig Boolean type can only be either True or False. By using the output

schema, the logical value can be converted to a Boolean value. Alternatively,

the user may specify an int or chararray value and no semantic information is

lost. The following rules are used for type conversion from R to Pig:

• Simple data type

• (schema: int) numeric/integer/logical/factor: int

(T:1; F:0; NA:128); (schema: float/double) numeric/double:

double ; (schema: chararray) character/logical/factor:

chararray ; (schema: bytearray) raw: bytearray ; (schema:

boolean) logical: boolean (T: T, F/NA: F); NULL: null ;

(schema: datetime) POSIXlt/POSIXt: datetime

• Complex data type

• (schema: tuple) numeric array/character array/logical

array/factors/list: tuple , e.g. structure(c(1L, 2L, 1L),

.Label = c(“a”, “b”), class = “factor”): (a, b, a); (schema: bag) nested

list: bag ; (schema: map) list: map

• Anything else raises an exception.

9.4.3 Execution and Monitor

At the parallel execution stage, the defined R functions or UDFs are trans-

formed into map functions that are automatically generated by taking

advantage of Pig. They are executed in parallel in different Hadoop task

nodes. Each R or map task will take a piece of split data and execute

independently on an R engine on one task node. If the Hadoop cluster is

configured with more than one map task capacity per node, each map or

R function will have an isolated session. When a task is completed and

a result is returned, the data stored in the R session will be cleared, and

the process will be killed by the RPig framework. As a consequence, no

R session will be kept alive after the R execution is complete, and all data

that need to be saved or persisted from R must be saved in HDFS through

Pig operations. This design was chosen because an R session only exists in

a single task node, which is replaceable by any other task node in a Hadoop

cluster at any time. The R session cannot be retrieved by other nodes at

a later time. Pig stores data, including temporary data generated between

MapReduce jobs during processing, in HDFS to guarantee that data can be

retrieved later from every node of the cluster. The results of all R functions

will be collected through Reduce tasks for continuous processing. Users do

not need to develop key-value pair map and reduce functions within RPig.

They only need to assign the number of map and reduce tasks in parallel

execution through Hadoop and Pig configuration.

Search WWH ::

Custom Search

Home