Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

• Jobs need to be monitored and managed to ensure that any encountered

errors are properly handled so that the job continues to execute if the

system partially fails.

• Input data needs to be spread across the cluster.

• Map step processing of the input needs to be conducted across the

distributed system, preferably on the same machines where the data

resides.

• Intermediate outputs from the numerous map steps need to be collected

and provided to the proper machines for the reduce step execution.

• Final output needs to be made available for use by another user, another

application, or perhaps another MapReduce job.

Fortunately, Apache Hadoop handles these activities and more. Furthermore,

many of these activities are transparent to the developer/user. The following

material examines the implementation of MapReduce in Hadoop, an open source

project managed and licensed by the Apache Software Foundation [11].

The origins of Hadoop began as a search engine called Nutch, developed by Doug

Cutting and Mike Cafarella. Based on two Google papers [9] [12], versions of

MapReduce and the Google File System were added to Nutch in 2004. In 2006,

Yahoo! hired Cutting, who helped to develop Hadoop based on the code in Nutch

[13]. The name “Hadoop” came from the name of Cutting's child's stuffed toy

elephant that also inspired the well-recognized symbol for the Hadoop project.

Next, an overview of how data is stored in a Hadoop environment is presented.

Hadoop Distributed File System (HDFS)

Based on the Google File System [12], the Hadoop Distributed File System (HDFS)

is a file system that provides the capability to distribute data across a cluster to take

advantage of the parallel processing of MapReduce. HDFS is not an alternative

to common file systems, such as ext3, ext4, and XFS. In fact, HDFS depends on

each disk drive's file system to manage the data being stored to the drive media.

The Hadoop Wiki [14] provides more details on disk configuration options and

considerations.

For a given file, HDFS breaks the file, say, into 64 MB blocks and stores the blocks

across the cluster. So, if a file size is 300 MB, the file is stored in five blocks: four

64 MB blocks and one 44 MB block. If a file size is smaller than 64 MB, the block

is assigned the size of the file.

Search WWH ::

Custom Search

Home