Database Reference
In-Depth Information
• Jobs need to be monitored and managed to ensure that any encountered
errors are properly handled so that the job continues to execute if the
system partially fails.
• Input data needs to be spread across the cluster.
• Map step processing of the input needs to be conducted across the
distributed system, preferably on the same machines where the data
resides.
• Intermediate outputs from the numerous map steps need to be collected
and provided to the proper machines for the reduce step execution.
• Final output needs to be made available for use by another user, another
application, or perhaps another MapReduce job.
Fortunately, Apache Hadoop handles these activities and more. Furthermore,
many of these activities are transparent to the developer/user. The following
material examines the implementation of MapReduce in Hadoop, an open source
project managed and licensed by the Apache Software Foundation [11].
The origins of Hadoop began as a search engine called Nutch, developed by Doug
Cutting and Mike Cafarella. Based on two Google papers [9] [12], versions of
MapReduce and the Google File System were added to Nutch in 2004. In 2006,
Yahoo! hired Cutting, who helped to develop Hadoop based on the code in Nutch
[13]. The name “Hadoop” came from the name of Cutting's child's stuffed toy
elephant that also inspired the well-recognized symbol for the Hadoop project.
Next, an overview of how data is stored in a Hadoop environment is presented.
Hadoop Distributed File System (HDFS)
Based on the Google File System [12], the Hadoop Distributed File System (HDFS)
is a file system that provides the capability to distribute data across a cluster to take
advantage of the parallel processing of MapReduce. HDFS is not an alternative
to common file systems, such as ext3, ext4, and XFS. In fact, HDFS depends on
each disk drive's file system to manage the data being stored to the drive media.
The Hadoop Wiki [14] provides more details on disk configuration options and
considerations.
For a given file, HDFS breaks the file, say, into 64 MB blocks and stores the blocks
across the cluster. So, if a file size is 300 MB, the file is stored in five blocks: four
64 MB blocks and one 44 MB block. If a file size is smaller than 64 MB, the block
is assigned the size of the file.
Search WWH ::




Custom Search