Database Reference
In-Depth Information
2
MapReduce and the New Software Stack
Modern data-mining applications, often called “big-data” analysis, require us to manage im-
mense amounts of data quickly. In many of these applications, the data is extremely regular,
and there is ample opportunity to exploit parallelism. Important examples are:
(1) The ranking of Web pages by importance, which involves an iterated matrix-vector
multiplication where the dimension is many billions.
(2) Searches in “friends” networks at social-networking sites, which involve graphs with
hundreds of millions of nodes and many billions of edges.
To deal with applications such as these, a new software stack has evolved. These program-
ming systems are designed to get their parallelism not from a “super-computer,” but from
“computing clusters” - large collections of commodity hardware, including conventional
processors (“compute nodes”) connected by Ethernet cables or inexpensive switches. The
software stack begins with a new form of file system, called a “distributed file system,”
which features much larger units than the disk blocks in a conventional operating system.
Distributed file systems also provide replication of data or redundancy to protect against the
frequent media failures that occur when data is distributed over thousands of low-cost com-
pute nodes.
On top of these file systems, many different higher-level programming systems have been
developed. Central to the new software stack is a programming system called MapReduce .
Implementations of MapReduce enable many of the most common calculations on large-
scale data to be performed on computing clusters efficiently and in a way that is tolerant of
hardware failures during the computation.
MapReduce systems are evolving and extending rapidly. Today, it is common for MapRe-
duce programs to be created from still higher-level programming systems, often an imple-
mentation of SQL. Further, MapReduce turns out to be a useful, but simple, case of more
general and powerful ideas. We include in this chapter a discussion of generalizations of
MapReduce, first to systems that support acyclic workflows and then to systems that imple-
ment recursive algorithms.
Search WWH ::




Custom Search