Database Reference
In-Depth Information
airline reservation system would not be suitable for a DFS, even if the data were
very large, because the data is changed so frequently.
DFS Implementations
There are several distributed file systems of the type we have described that are used in practice. Among these:
(1) The
Google File System
(GFS), the original of the class.
(2)
Hadoop Distributed File System
(HDFS), an open-source DFS used with Hadoop, an implementation of MapRe-
duce (see
Section 2.2
) and distributed by the Apache Software Foundation.
(3)
CloudStore
, an open-source DFS originally developed by Kosmix.
Files are divided into
chunks
, which are typically 64 megabytes in size. Chunks are rep-
licated, perhaps three times, at three different compute nodes. Moreover, the nodes holding
copies of one chunk should be located on different racks, so we don't lose all copies due to
a rack failure. Normally, both the chunk size and the degree of replication can be decided
by the user.
To find the chunks of a file, there is another small file called the
master node
or
name
node
for that file. The master node is itself replicated, and a directory for the file system
as a whole knows where to find its copies. The directory itself can be replicated, and all
participants using the DFS know where the directory copies are.
2.2 MapReduce
MapReduce
is a style of computing that has been implemented in several systems, includ-
ing Google's internal implementation (simply called MapReduce) and the popular open-
source implementation Hadoop which can be obtained, along with the HDFS file system
from the Apache Foundation. You can use an implementation of MapReduce to manage
many large-scale computations in a way that is tolerant of hardware faults. All you need
to write are two functions, called
Map
and
Reduce
, while the system manages the paral-
lel execution, coordination of tasks that execute Map or Reduce, and also deals with the
possibility that one of these tasks will fail to execute. In brief, a MapReduce computation
executes as follows:
(1) Some number of Map tasks each are given one or more chunks from a distributed file
system. These Map tasks turn the chunk into a sequence of
key-value
pairs. The way
key-value pairs are produced from the input data is determined by the code written by
the user for the Map function.