MapReduce with Cassandra - Beginning Apache Cassandra Development

Database Reference

In-Depth Information

With a MapReduce job, the data set is divided into chunks in such a way that each

of these chunks can be processed independently as a map task in a collocated way and

then the framework sorts and supplies the output for reduced tasks. The Hadoop frame-

work automatically manages task scheduling and data distribution.

Obviously such a huge amount of data cannot be managed on a single node and it

has to be distributed across multiple nodes. Two issues that are clearly visible here are:

•

Data Distribution

•

Co-located data processing

HDFS

Data distribution with Hadoop is managed with the Hadoop Distributed File System

(HDFS). HDFS is one of the submodules for Apache Hadoop project. It is specifically

designed to build a scalable system over a less expensive hardware. HDFS is based on

master/slave architecture, where a single process known as NameNode runs on the

master node and manages all the information about data files and replication across the

cluster of nodes.

For more information about HDFS architecture, please refer to ht-

MapReduce

MapReduce implementation is also the answer to another issue: It's a framework for

parallel processing of large amounts of data distributed across multiple nodes.

Three primary tasks performed during the MapReduce process are:

•

Split and map

•

Shuffle and sort

•

Reduce and output

Upon submitting a job, input data is split in chunks and assigned as a parallel map

task to each mapper. Each mapper generates a distinct key value pair as an output,

which is shuffled and sorted by keys and supplied as input to each reducer. Each map-

per and reducer running on data node would operate on local data only (data locality).

Search WWH ::

Custom Search

Home