Database Reference
In-Depth Information
map(String key, String value):
//key: document name
//value: document contents
for each word w in value:
EmitIntermediate(w, “1”);
reduce(String key, Iterator values):
//key: a word
//values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
FIGURE 2.1 An example MapReduce program. (From J. Dean and S. Ghemawat,
MapReduce: Simplified data processing on large clusters, in: OSDI , pp. 137-150, 2004.)
emits each word plus an associated count of occurrences while the reduce function
sums together all counts emitted for a particular word. In principle, the design of the
MapReduce framework has considered the following main principles [36]:
Low-cost unreliable commodity hardware : Instead of using expensive,
high-performance, reliable symmetric multiprocessing (SMP) or massively
parallel processing (MPP) machines equipped with high-end network and
storage subsystems, the MapReduce framework is designed to run on large
clusters of commodity hardware. This hardware is managed and powered
by open-source operating systems and utilities so that the cost is low.
Extremely Scalable RAIN Cluster : Instead of using centralized RAID-
based SAN or NAS storage systems, every MapReduce node has its own
local off-the-shelf hard drives. These nodes are loosely coupled where they
are placed in racks that can be connected with standard networking hard-
ware connections. These nodes can be taken out of service with almost no
impact to still-running MapReduce jobs. These clusters are called redun-
dant array of independent (and inexpensive) nodes (RAIN).
Fault-Tolerant yet Easy to Administer : MapReduce jobs can run on clusters
with thousands of nodes or even more. These nodes are not very reliable
as at any point in time, a certain percentage of these commodity nodes or
hard drives will be out of order. Hence, the MapReduce framework applies
straightforward mechanisms to replicate data and launch backup tasks so
as to keep still-running processes going. To handle crashed nodes, system
administrators simply take crashed hardware off-line. New nodes can be
plugged in at any time without much administrative hassle. There is no
complicated backup, restore, and recovery configurations like those that
can be seen in many DBMS.
Highly Parallel yet Abstracted : The most important contribution of the
MapReduce framework is its ability to automatically support the paral-
lelization of task executions. Hence, it allows developers to focus mainly
on the problem at hand rather than worrying about the low level imple-
mentation details such as memory management, file allocation, parallel,
multithreaded, or network programming. Moreover, MapReduce's shared-
nothing architecture [120] makes it much more scalable and ready for
parallelization.
Search WWH ::




Custom Search