MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

2

MapReduce and the New Software Stack

Modern data-mining applications, often called “big-data” analysis, require us to manage im-

mense amounts of data quickly. In many of these applications, the data is extremely regular,

and there is ample opportunity to exploit parallelism. Important examples are:

(1) The ranking of Web pages by importance, which involves an iterated matrix-vector

multiplication where the dimension is many billions.

(2) Searches in “friends” networks at social-networking sites, which involve graphs with

hundreds of millions of nodes and many billions of edges.

To deal with applications such as these, a new software stack has evolved. These program-

ming systems are designed to get their parallelism not from a “super-computer,” but from

“computing clusters” - large collections of commodity hardware, including conventional

processors (“compute nodes”) connected by Ethernet cables or inexpensive switches. The

software stack begins with a new form of file system, called a “distributed file system,”

which features much larger units than the disk blocks in a conventional operating system.

Distributed file systems also provide replication of data or redundancy to protect against the

frequent media failures that occur when data is distributed over thousands of low-cost com-

pute nodes.

On top of these file systems, many different higher-level programming systems have been

developed. Central to the new software stack is a programming system called MapReduce .

Implementations of MapReduce enable many of the most common calculations on large-

scale data to be performed on computing clusters efficiently and in a way that is tolerant of

hardware failures during the computation.

MapReduce systems are evolving and extending rapidly. Today, it is common for MapRe-

duce programs to be created from still higher-level programming systems, often an imple-

mentation of SQL. Further, MapReduce turns out to be a useful, but simple, case of more

general and powerful ideas. We include in this chapter a discussion of generalizations of

MapReduce, first to systems that support acyclic workflows and then to systems that imple-

ment recursive algorithms.

Search WWH ::

Custom Search

Home