Database Reference
In-Depth Information
airline reservation system would not be suitable for a DFS, even if the data were
very large, because the data is changed so frequently.
DFS Implementations
There are several distributed file systems of the type we have described that are used in practice. Among these:
(1) The Google File System (GFS), the original of the class.
(2) Hadoop Distributed File System (HDFS), an open-source DFS used with Hadoop, an implementation of MapRe-
duce (see Section 2.2 ) and distributed by the Apache Software Foundation.
(3) CloudStore , an open-source DFS originally developed by Kosmix.
Files are divided into chunks , which are typically 64 megabytes in size. Chunks are rep-
licated, perhaps three times, at three different compute nodes. Moreover, the nodes holding
copies of one chunk should be located on different racks, so we don't lose all copies due to
a rack failure. Normally, both the chunk size and the degree of replication can be decided
by the user.
To find the chunks of a file, there is another small file called the master node or name
node for that file. The master node is itself replicated, and a directory for the file system
as a whole knows where to find its copies. The directory itself can be replicated, and all
participants using the DFS know where the directory copies are.
2.2 MapReduce
MapReduce is a style of computing that has been implemented in several systems, includ-
ing Google's internal implementation (simply called MapReduce) and the popular open-
source implementation Hadoop which can be obtained, along with the HDFS file system
from the Apache Foundation. You can use an implementation of MapReduce to manage
many large-scale computations in a way that is tolerant of hardware faults. All you need
to write are two functions, called Map and Reduce , while the system manages the paral-
lel execution, coordination of tasks that execute Map or Reduce, and also deals with the
possibility that one of these tasks will fail to execute. In brief, a MapReduce computation
executes as follows:
(1) Some number of Map tasks each are given one or more chunks from a distributed file
system. These Map tasks turn the chunk into a sequence of key-value pairs. The way
key-value pairs are produced from the input data is determined by the code written by
the user for the Map function.
Search WWH ::




Custom Search