Application Architectures for Big Data and Analytics - Big Data Imperatives

Databases Reference

In-Depth Information

Eventually an answer is arrived at. The map stage is a filter/workload partition stage.

It simply distributes selection criteria across every node. Each node selects data from

HDFS files at its node, based on key values. HDFS stores data as a key with attached other

data that is undefined (in the sense of being in a schema). Hence it is a primitive key

value store, with the records consisting of a head (the key) and a tail (all other data).

The map phase reads data serially from the file and retains only keys that fit the

map. Java hooks are provided for any further processing at this stage. The map phase

then sends results to other nodes for reduction, so that records that fit the same criteria

end up on the same node for reduction. In effect, results are mapped to and sent to an

appropriate node for reduction.

The reduce phase processes this data. Usually it will be aggregating or averaging or

counting or some combination of such operations. Java hooks are provided for adding

sophistication to such processing. Then there is a result of some kind on each reduce node.

Further reduction passes may then be carried out to arrive at a final result. This may

involve further data passing in the form of mapping and reducing, making up the full

Hadoop job. In essence, this is simply a parallelization by workload partitioning scheme

with the added nuance of being fault tolerant.

Once the map-reduce phase is complete, the processed data is ready for further

analysis by data scientists and others with advanced data analytics skills. Data scientists

can manipulate and analyze the data using any of a number of tools for any number of

uses, including to search for hidden insights and patterns or to use as the foundation to

build user-facing analytic applications. The data can also be modeled and transferred

from Hadoop clusters into existing relational databases, data warehouses, and other

traditional IT systems for further analysis and/or to support transactional processing.

Hadoop Technical Components

A Hadoop “stack” is made up of a number of components. They include:

• Hadoop Distributed File System (HDFS): The default storage

layer in any given Hadoop cluster;

• Name Node: The node in a Hadoop cluster that provides the

client information on where in the cluster particular data is stored

and if any nodes fail.

• Secondary Node: A backup to the Name Node, it periodically

replicates and stores data from the Name Node should it fail.

• Job Tracker: The node in a Hadoop cluster that initiates and

coordinates map-reduce jobs, or the processing of the data.

• Slave Nodes: The grunts of any Hadoop cluster, slave nodes store

data and take direction to process it from the Job Tracker.

In addition to the above, the Hadoop ecosystem is made up of a number of

complimentary sub-components. NoSQL data stores like Cassandra and HBase are

also used to store the results of map-reduce jobs in Hadoop. In addition to Java, some

map-reduce jobs and other Hadoop functions are written in Pig, an open-source

Search WWH ::

Custom Search

Home