Databases Reference
In-Depth Information
The Influence of Map-Reduce and Hadoop
Until recently there was no widely used framework for parallel programming. Parallel
programming was thus a fairly a specialized skill acquired by programmers to develop
custom applications. Scale-out hardware models offering parallelism capabilities came
onto the scene due to the search engine technology. The Web spans billions of web pages
and the number increases daily, yet when searching for a word or phrase you receive an
answer in a fraction of a second. This is achieved using parallel computing.
A much publicized use case is Google, each search query that users use, is spread
out across thousands of CPUs, each of which has a large amount of memory. A highly
compressed schema consisting of the entire web content is held in memory, and the
search query accesses that schema. The software framework Google used to address
this application is called Map-Reduce . Hadoop is another significant component of the
parallel computing framework. The Hadoop ecosystem consists of Hadoop distributed file
system (HDFS), which allows very large data files to be distributed across all the nodes of
a very large grid of servers in a way that supports recovery from the failure of any node.
Below is an introduction to map-reduce technology and associated Hadoop
ecosystem components. In Chapter 5, there will be more about map-reduce and Hadoop
ecosystem components.
The map-reduce mode of operation is to partition the workload across all the servers
in a grid and to apply first a mapping step (Map) and then a reduction step (Reduce).
Map: The map step partitions the workload across all the nodes
for execution. This step may cycle, as each node can spawn a
further division of work and share it with other nodes. In any
event, an answer set is arrived at on each node.
Reduce: The reduce step combines the answers from all the
nodes. This activity may also be distributed across the grid, if
needed, with data being passed as well.
In essence, Hadoop implements parallelism that works well on large volumes of data
distributed across many servers. The processing is kept local to each node (in the Map step),
and only sent across the network for arriving at an answer (in the Reduce step). It is easy
to see how you can implement an SQL-like query using this, since the Map step would do
the SELECT operations, getting the appropriate data from each node, then the Reduce step
would compile the answer, possibly implementing a JOIN or carrying out a S O R T.
HDFS keeps three copies of all data by default, and this enables Hadoop to recover
from the failure of any of its nodes, as Hadoop also takes snapshots on each node to
enable recovery from any node failure. Hadoop is thus “fault tolerant” and the fault
tolerance is hardware independent, so it can be deployed on inexpensive commodity
hardware. Fault tolerance is important for a system that can be using hundreds of nodes
at once, because the probability of any node failing is multiplied by the number of nodes.
In its native form, Hadoop is not a database. HBase , another component of the Hadoop
ecosystem provides a column-oriented data store capability leveraging Hadoop and HDFS,
and it also provides indexing for HDFS. With HBase it is possible to have multiple large
tables or even just one large table distributed beneath Hadoop. Hive provides a formal query
capability turning Hadoop into a data warehouse-like system, allowing data summarization,
ad hoc queries, and the analysis of data stored in HBase or at a native level in HDFS. Hive
 
Search WWH ::




Custom Search