Database Reference
In-Depth Information
With a MapReduce job, the data set is divided into chunks in such a way that each
of these chunks can be processed independently as a map task in a collocated way and
then the framework sorts and supplies the output for reduced tasks. The Hadoop frame-
work automatically manages task scheduling and data distribution.
Obviously such a huge amount of data cannot be managed on a single node and it
has to be distributed across multiple nodes. Two issues that are clearly visible here are:
Data Distribution
Co-located data processing
HDFS
Data distribution with Hadoop is managed with the Hadoop Distributed File System
(HDFS). HDFS is one of the submodules for Apache Hadoop project. It is specifically
designed to build a scalable system over a less expensive hardware. HDFS is based on
master/slave architecture, where a single process known as NameNode runs on the
master node and manages all the information about data files and replication across the
cluster of nodes.
For more information about HDFS architecture, please refer to ht-
tp://hadoop.apache.org/docs/r1.2.1/hdfs_design.html .
MapReduce
MapReduce implementation is also the answer to another issue: It's a framework for
parallel processing of large amounts of data distributed across multiple nodes.
Three primary tasks performed during the MapReduce process are:
Split and map
Shuffle and sort
Reduce and output
Upon submitting a job, input data is split in chunks and assigned as a parallel map
task to each mapper. Each mapper generates a distinct key value pair as an output,
which is shuffled and sorted by keys and supplied as input to each reducer. Each map-
per and reducer running on data node would operate on local data only (data locality).
Search WWH ::




Custom Search