Application Architectures for Big Data and Analytics - Big Data Imperatives

Databases Reference

In-Depth Information

How Hadoop Works

A client program accesses unstructured and semi-structured data from sources

including log files, social media feeds, and internal data stores. It breaks the data up

into parts, which are then loaded into a file system made up of multiple nodes running

on commodity hardware. The default file store in Hadoop is the Hadoop Distributed

File System, or HDFS. File systems such as HDFS are adept at storing large volumes of

unstructured and semi-structured data, as they do not require data to be organized into

relational rows and columns.

Each part is replicated multiple times and loaded into the file system so that if a node

fails, another node has a copy of the data contained on the failed node. A Name Node

acts as facilitator, communicating back to the client information such as which nodes are

available, where in the cluster certain data resides, and which nodes have failed.

Once the data is loaded into the cluster, it is ready to be analyzed via the map-reduce

framework. The client program submits a map job, usually a query written in Java, to one

of the nodes in the cluster known as the Job Tracker. The Job Tracker refers to the Name

Node to determine which data it needs to access to complete the job and where in the

cluster that data is located. Once determined, the Job Tracker submits the query to the

relevant nodes.

■ The design philosophy is based on the concept that rather than bringing all

the data back into a central location for processing, processing occurs at each node

simultaneously, or in parallel. This is an essential characteristic of Hadoop.

Note

When each node has finished the processing task, it stores the results. The client

program then initiates a reduce job through the Job Tracker in which results of the map

phase stored locally on individual nodes are aggregated to determine the answer to

the original query, then loaded on to another node in the cluster. The client accesses

these results, which can then be loaded into one of number of analytic environments for

analysis. The map-reduce job has now been completed (Figure 5-6 ).

Search WWH ::

Custom Search

Home