Information Technology Reference
In-Depth Information
1. A programming model and development tools
2. Facility for program loading, execution, and for process and thread
scheduling
3. System configuration and management tools
The context for all of these framework components is tightly coupled with the
key characteristics of a big data application—algorithms that take advantage
of running lots of tasks in parallel on many computing nodes to analyze lots of
data distributed among many storage nodes. Typically, a big data platform will
consist of a collection (or a pool ) of processing nodes; the optimal performances
can be achieved when all the processing nodes are kept busy, and that means
maintaining a healthy allocation of tasks to idle nodes within the pool. Any
big application that is to be developed must map to this context, and that is
where the programming model comes in. The programming model essentially
describes two aspects of application execution within a parallel environment:
1. How an application is coded
2. How that code maps to the parallel environment
MapReduce programming model is a combination of the familiar proce-
dural/imperative approaches used by Java or C++ programmers embedded
within what is effectively a functional language programming model such
as the one used within languages like Lisp and APL. The similarity is based
on MapReduce's dependence on two basic operations that are applied to sets
or lists of data value pairs:
1. Map, which describes the computation or analysis applied to a set of
input key-value pairs to produce a set of intermediate key-value pairs
2. Reduce, in which the set of values associated with the intermediate
key-value pairs output by the map operation are combined to pro-
vide the results
A MapReduce application is envisioned as a series of basic operations
applied in a sequence to small sets of many (millions, billions, or even more)
data items. These data items are logically organized in a way that enables
the MapReduce execution model to allocate tasks that can be executed in
parallel.
Combining both data and computational independence means
that both the data and the computations can be distributed
across multiple storage and processing units and automatically
parallelized. This parallelizability allows the programmer to
exploit scalable massively parallel processing resources for increased
processing speed and performance.
 
Search WWH ::




Custom Search