RPig: Concise Programming Framework by Integrating R with Pig for Big Data Analytics - Cloud Computing with e-Science Applications

Information Technology Reference

In-Depth Information

R's data model contains simple data structure types, such as scalars, vectors,

and lists , and special compound data structure types: Factors are used to

describe items that can have a finite number of values; data frames are matrices

and may contain different data types (numeric, factor, etc.). All data struc-

tures of R are R objects, which also include other statistical specific models

or functions and so on.

The following code snippet shows a simple example of EMA calculation

using R. TTR is an R package implementing various moving average calcula-

tions. The temp is a series for EMA calculation with 20 periods to average over.

Library(TTR); results <- EMA(temp, 20)

9.3.2 Hadoop and MapReduce

Hadoop offers the Hadoop Distributed File System (HDFS) to manage data stor-

age and a distributed parallel programming framework based on MapReduce

[5] for data processing. Computations are defined in Map and Reduce functions,

which have key-value pairs for input. A map function takes one pair of data,

which can be processed in parallel Map(k1,v1)→list(k2,v2) . A reduce func-

tion aggregates related results of map functions (k2, list(v2))→list(v3) .

Programs need to be written as map and reduce programs to enable parallel

computing through Hadoop MapReduce Java APIs.

9.3.3 Pig and Pig Latin

Pig is built on top of Hadoop and gives a high-level data flow language

called Pig (Latin) [8] for expressing data queries and processing. It is similar

to SQL of a relational database management system (RDBMS), but it is pro-

cedural style and gives more control and optimization over the flow of the

data. Pig scripts are compiled into sequences of MapReduce jobs by Pig, and

they are executed in the Hadoop MapReduce environment.

The Pig data model contains scalar types that have a single atomic value

( integer , long , etc.), and three complex types that can contain other types:

Tuple is a data record consisting of a sequence of “fields,” which can be any

data type; Bag is a set of tuples, similar to a “table”; Map is a map of a string

key to a value, which can be any data type.

Pig provides a set of operators for data processing. For example: LOAD and

STORE can be used for reading and writing data from HDFS. Processing

every tuple of a data set can use the FOREACH operator. Many operators

are similar to SQL, such as JOIN , GROUP BY , and UNION for standard data

operations. As with many SQL implementations, Pig supports user-defined

functions (UDFs), which allows performing tasks written in low-level lan-

guage (Java or Python) to extend Pig. The following Pig script shows how to

Search WWH ::

Custom Search

Home