Information Technology Reference
In-Depth Information
R's data model contains simple data structure types, such as
scalars, vectors,
and
lists
, and special compound data structure types:
Factors
are used to
describe items that can have a finite number of values;
data frames
are
matrices
and may contain different data types (numeric, factor,
etc.). All data struc-
tures of R are R objects, which also include other statistical specific models
or functions and so on.
The following code snippet shows a simple example of EMA calculation
using R.
TTR
is an R package implementing various moving average calcula-
tions. The
temp
is a series for EMA calculation with 20 periods to average over.
Library(TTR); results <- EMA(temp, 20)
9.3.2 Hadoop and MapReduce
Hadoop offers the Hadoop Distributed File System (HDFS) to manage data stor-
age and a distributed parallel programming framework based on MapReduce
[5] for data processing. Computations are defined in
Map
and
Reduce
functions,
which have key-value pairs for input. A map function takes one pair of data,
which can be processed in parallel
Map(k1,v1)→list(k2,v2)
. A reduce func-
tion aggregates related results of map functions
(k2, list(v2))→list(v3)
.
Programs need to be written as map and reduce programs to enable parallel
computing through Hadoop MapReduce Java APIs.
9.3.3 Pig and Pig Latin
Pig is built on top of Hadoop and gives a high-level data flow language
called Pig (Latin) [8] for expressing data queries and processing. It is similar
to SQL of a relational database management system (RDBMS), but it is pro-
cedural style and gives more control and optimization over the flow of the
data. Pig scripts are compiled into sequences of MapReduce jobs by Pig, and
they are executed in the Hadoop MapReduce environment.
The Pig data model contains scalar types that have a single atomic value
(
integer
,
long
,
etc.), and three complex types that can contain other types:
Tuple
is a data record consisting of a sequence of “fields,” which can be any
data type;
Bag
is a set of tuples, similar to a “table”;
Map
is a map of a string
key to a value, which can be any data type.
Pig provides a set of operators for data processing. For example:
LOAD
and
STORE
can be used for reading and writing data from HDFS. Processing
every tuple of a data set can use the
FOREACH
operator. Many operators
are similar to SQL, such as
JOIN
,
GROUP BY
,
and
UNION
for standard data
operations. As with many SQL implementations, Pig supports user-defined
functions (UDFs), which allows performing tasks written in low-level lan-
guage (Java or Python) to extend Pig. The following Pig script shows how to
Search WWH ::
Custom Search