Information Technology Reference
In-Depth Information
expected from the R output and remove ambiguity during the conversion.
For example, the logical value in R could be True, False, or NA (Not Available),
but the Pig Boolean type can only be either True or False. By using the output
schema, the logical value can be converted to a Boolean value. Alternatively,
the user may specify an int or chararray value and no semantic information is
lost. The following rules are used for type conversion from R to Pig:
• Simple data type
• (schema: int) numeric/integer/logical/factor: int
(T:1; F:0; NA:128); (schema: float/double) numeric/double:
double ; (schema: chararray) character/logical/factor:
chararray ; (schema: bytearray) raw: bytearray ; (schema:
boolean) logical: boolean (T: T, F/NA: F); NULL: null ;
(schema: datetime) POSIXlt/POSIXt: datetime
• Complex data type
• (schema: tuple) numeric array/character array/logical
array/factors/list: tuple , e.g. structure(c(1L, 2L,  1L),
.Label = c(“a”, “b”), class = “factor”): (a, b, a); (schema: bag) nested
list: bag ; (schema: map) list: map
• Anything else raises an exception.
9.4.3 Execution and Monitor
At the parallel execution stage, the defined R functions or UDFs are trans-
formed into map functions that are automatically generated by taking
advantage of Pig. They are executed in parallel in different Hadoop task
nodes. Each R or map task will take a piece of split data and execute
independently on an R engine on one task node. If the Hadoop cluster is
configured with more than one map task capacity per node, each map or
R  function will have an isolated session. When a task is completed and
a result is returned, the data stored in the R session will be cleared, and
the process will be killed by the RPig framework. As a consequence, no
R session will be kept alive after the R execution is complete, and all data
that need to be saved or persisted from R must be saved in HDFS through
Pig operations. This design was chosen because an R session only exists in
a single task node, which is replaceable by any other task node in a Hadoop
cluster at any time. The R session cannot be retrieved by other nodes at
a later time. Pig stores data, including temporary data generated between
MapReduce jobs during processing, in HDFS to guarantee that data can be
retrieved later from every node of the cluster. The results of all R functions
will be collected through Reduce tasks for continuous processing. Users do
not need to develop key-value pair map and reduce functions within RPig.
They only need to assign the number of map and reduce tasks in parallel
execution through Hadoop and Pig configuration.
Search WWH ::




Custom Search