MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

airline reservation system would not be suitable for a DFS, even if the data were

very large, because the data is changed so frequently.

DFS Implementations

There are several distributed file systems of the type we have described that are used in practice. Among these:

(1) The Google File System (GFS), the original of the class.

(2) Hadoop Distributed File System (HDFS), an open-source DFS used with Hadoop, an implementation of MapRe-

duce (see Section 2.2 ) and distributed by the Apache Software Foundation.

(3) CloudStore , an open-source DFS originally developed by Kosmix.

Files are divided into chunks , which are typically 64 megabytes in size. Chunks are rep-

licated, perhaps three times, at three different compute nodes. Moreover, the nodes holding

copies of one chunk should be located on different racks, so we don't lose all copies due to

a rack failure. Normally, both the chunk size and the degree of replication can be decided

by the user.

To find the chunks of a file, there is another small file called the master node or name

node for that file. The master node is itself replicated, and a directory for the file system

as a whole knows where to find its copies. The directory itself can be replicated, and all

participants using the DFS know where the directory copies are.

2.2 MapReduce

MapReduce is a style of computing that has been implemented in several systems, includ-

ing Google's internal implementation (simply called MapReduce) and the popular open-

source implementation Hadoop which can be obtained, along with the HDFS file system

from the Apache Foundation. You can use an implementation of MapReduce to manage

many large-scale computations in a way that is tolerant of hardware faults. All you need

to write are two functions, called Map and Reduce , while the system manages the paral-

lel execution, coordination of tasks that execute Map or Reduce, and also deals with the

possibility that one of these tasks will fail to execute. In brief, a MapReduce computation

executes as follows:

(1) Some number of Map tasks each are given one or more chunks from a distributed file

system. These Map tasks turn the chunk into a sequence of key-value pairs. The way

key-value pairs are produced from the input data is determined by the code written by

the user for the Map function.

Search WWH ::

Custom Search

Home