Databases Reference
In-Depth Information
Eventually an answer is arrived at. The map stage is a filter/workload partition stage.
It simply distributes selection criteria across every node. Each node selects data from
HDFS files at its node, based on key values. HDFS stores data as a key with attached other
data that is undefined (in the sense of being in a schema). Hence it is a primitive key
value store, with the records consisting of a head (the key) and a tail (all other data).
The map phase reads data serially from the file and retains only keys that fit the
map. Java hooks are provided for any further processing at this stage. The map phase
then sends results to other nodes for reduction, so that records that fit the same criteria
end up on the same node for reduction. In effect, results are mapped to and sent to an
appropriate node for reduction.
The reduce phase processes this data. Usually it will be aggregating or averaging or
counting or some combination of such operations. Java hooks are provided for adding
sophistication to such processing. Then there is a result of some kind on each reduce node.
Further reduction passes may then be carried out to arrive at a final result. This may
involve further data passing in the form of mapping and reducing, making up the full
Hadoop job. In essence, this is simply a parallelization by workload partitioning scheme
with the added nuance of being fault tolerant.
Once the map-reduce phase is complete, the processed data is ready for further
analysis by data scientists and others with advanced data analytics skills. Data scientists
can manipulate and analyze the data using any of a number of tools for any number of
uses, including to search for hidden insights and patterns or to use as the foundation to
build user-facing analytic applications. The data can also be modeled and transferred
from Hadoop clusters into existing relational databases, data warehouses, and other
traditional IT systems for further analysis and/or to support transactional processing.
Hadoop Technical Components
A Hadoop “stack” is made up of a number of components. They include:
Hadoop Distributed File System (HDFS): The default storage
layer in any given Hadoop cluster;
Name Node: The node in a Hadoop cluster that provides the
client information on where in the cluster particular data is stored
and if any nodes fail.
Secondary Node: A backup to the Name Node, it periodically
replicates and stores data from the Name Node should it fail.
Job Tracker: The node in a Hadoop cluster that initiates and
coordinates map-reduce jobs, or the processing of the data.
Slave Nodes: The grunts of any Hadoop cluster, slave nodes store
data and take direction to process it from the Job Tracker.
In addition to the above, the Hadoop ecosystem is made up of a number of
complimentary sub-components. NoSQL data stores like Cassandra and HBase are
also used to store the results of map-reduce jobs in Hadoop. In addition to Java, some
map-reduce jobs and other Hadoop functions are written in Pig, an open-source
 
Search WWH ::




Custom Search