Emerging Database Landscape - Big Data Imperatives

Databases Reference

In-Depth Information

The Influence of Map-Reduce and Hadoop

Until recently there was no widely used framework for parallel programming. Parallel

programming was thus a fairly a specialized skill acquired by programmers to develop

custom applications. Scale-out hardware models offering parallelism capabilities came

onto the scene due to the search engine technology. The Web spans billions of web pages

and the number increases daily, yet when searching for a word or phrase you receive an

answer in a fraction of a second. This is achieved using parallel computing.

A much publicized use case is Google, each search query that users use, is spread

out across thousands of CPUs, each of which has a large amount of memory. A highly

compressed schema consisting of the entire web content is held in memory, and the

search query accesses that schema. The software framework Google used to address

this application is called Map-Reduce . Hadoop is another significant component of the

parallel computing framework. The Hadoop ecosystem consists of Hadoop distributed file

system (HDFS), which allows very large data files to be distributed across all the nodes of

a very large grid of servers in a way that supports recovery from the failure of any node.

Below is an introduction to map-reduce technology and associated Hadoop

ecosystem components. In Chapter 5, there will be more about map-reduce and Hadoop

ecosystem components.

The map-reduce mode of operation is to partition the workload across all the servers

in a grid and to apply first a mapping step (Map) and then a reduction step (Reduce).

• Map: The map step partitions the workload across all the nodes

for execution. This step may cycle, as each node can spawn a

further division of work and share it with other nodes. In any

event, an answer set is arrived at on each node.

• Reduce: The reduce step combines the answers from all the

nodes. This activity may also be distributed across the grid, if

needed, with data being passed as well.

In essence, Hadoop implements parallelism that works well on large volumes of data

distributed across many servers. The processing is kept local to each node (in the Map step),

and only sent across the network for arriving at an answer (in the Reduce step). It is easy

to see how you can implement an SQL-like query using this, since the Map step would do

the SELECT operations, getting the appropriate data from each node, then the Reduce step

would compile the answer, possibly implementing a JOIN or carrying out a S O R T.

HDFS keeps three copies of all data by default, and this enables Hadoop to recover

from the failure of any of its nodes, as Hadoop also takes snapshots on each node to

enable recovery from any node failure. Hadoop is thus “fault tolerant” and the fault

tolerance is hardware independent, so it can be deployed on inexpensive commodity

hardware. Fault tolerance is important for a system that can be using hundreds of nodes

at once, because the probability of any node failing is multiplied by the number of nodes.

In its native form, Hadoop is not a database. HBase , another component of the Hadoop

ecosystem provides a column-oriented data store capability leveraging Hadoop and HDFS,

and it also provides indexing for HDFS. With HBase it is possible to have multiple large

tables or even just one large table distributed beneath Hadoop. Hive provides a formal query

capability turning Hadoop into a data warehouse-like system, allowing data summarization,

ad hoc queries, and the analysis of data stored in HBase or at a native level in HDFS. Hive

Big Data Imperatives

Search WWH ::

Custom Search

Home