Big Data Computing Applications - Guide to Cloud Computing for Business and Technology Managers

Information Technology Reference

In-Depth Information

1. A programming model and development tools

2. Facility for program loading, execution, and for process and thread

scheduling

3. System configuration and management tools

The context for all of these framework components is tightly coupled with the

key characteristics of a big data application—algorithms that take advantage

of running lots of tasks in parallel on many computing nodes to analyze lots of

data distributed among many storage nodes. Typically, a big data platform will

consist of a collection (or a pool ) of processing nodes; the optimal performances

can be achieved when all the processing nodes are kept busy, and that means

maintaining a healthy allocation of tasks to idle nodes within the pool. Any

big application that is to be developed must map to this context, and that is

where the programming model comes in. The programming model essentially

describes two aspects of application execution within a parallel environment:

1. How an application is coded

2. How that code maps to the parallel environment

MapReduce programming model is a combination of the familiar proce-

dural/imperative approaches used by Java or C++ programmers embedded

within what is effectively a functional language programming model such

as the one used within languages like Lisp and APL. The similarity is based

on MapReduce's dependence on two basic operations that are applied to sets

or lists of data value pairs:

1. Map, which describes the computation or analysis applied to a set of

input key-value pairs to produce a set of intermediate key-value pairs

2. Reduce, in which the set of values associated with the intermediate

key-value pairs output by the map operation are combined to pro-

vide the results

A MapReduce application is envisioned as a series of basic operations

applied in a sequence to small sets of many (millions, billions, or even more)

data items. These data items are logically organized in a way that enables

the MapReduce execution model to allocate tasks that can be executed in

parallel.

Combining both data and computational independence means

that both the data and the computations can be distributed

across multiple storage and processing units and automatically

parallelized. This parallelizability allows the programmer to

exploit scalable massively parallel processing resources for increased

processing speed and performance.

Search WWH ::

Custom Search

Home