Information Technology Reference
In-Depth Information
R's data model contains simple data structure types, such as scalars, vectors,
and lists , and special compound data structure types: Factors are used to
describe items that can have a finite number of values; data frames are matrices
and may contain different data types (numeric, factor, etc.). All data struc-
tures of R are R objects, which also include other statistical specific models
or functions and so on.
The following code snippet shows a simple example of EMA calculation
using R. TTR is an R package implementing various moving average calcula-
tions. The temp is a series for EMA calculation with 20 periods to average over.
Library(TTR); results <- EMA(temp, 20)
9.3.2 Hadoop and MapReduce
Hadoop offers the Hadoop Distributed File System (HDFS) to manage data stor-
age and a distributed parallel programming framework based on MapReduce
[5] for data processing. Computations are defined in Map and Reduce functions,
which have key-value pairs for input. A map function takes one pair of data,
which can be processed in parallel Map(k1,v1)→list(k2,v2) . A reduce func-
tion aggregates related results of map functions (k2, list(v2))→list(v3) .
Programs need to be written as map and reduce programs to enable parallel
computing through Hadoop MapReduce Java APIs.
9.3.3 Pig and Pig Latin
Pig is built on top of Hadoop and gives a high-level data flow language
called Pig (Latin) [8] for expressing data queries and processing. It is similar
to SQL of a relational database management system (RDBMS), but it is pro-
cedural style and gives more control and optimization over the flow of the
data. Pig scripts are compiled into sequences of MapReduce jobs by Pig, and
they are executed in the Hadoop MapReduce environment.
The Pig data model contains scalar types that have a single atomic value
( integer , long , etc.), and three complex types that can contain other types:
Tuple is a data record consisting of a sequence of “fields,” which can be any
data type; Bag is a set of tuples, similar to a “table”; Map is a map of a string
key to a value, which can be any data type.
Pig provides a set of operators for data processing. For example: LOAD and
STORE can be used for reading and writing data from HDFS. Processing
every tuple of a data set can use the FOREACH operator. Many operators
are similar to SQL, such as JOIN , GROUP BY , and UNION for standard data
operations. As with many SQL implementations, Pig supports user-defined
functions (UDFs), which allows performing tasks written in low-level lan-
guage (Java or Python) to extend Pig. The following Pig script shows how to
Search WWH ::




Custom Search