Meet Hadoop - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Beyond Batch

For all its strengths, MapReduce is fundamentally a batch processing system, and is not

suitable for interactive analysis. You can't run a query and get results back in a few seconds

or less. Queries typically take minutes or more, so it's best for offline use, where there isn't

a human sitting in the processing loop waiting for results.

However, since its original incarnation, Hadoop has evolved beyond batch processing.

Indeed, the term “Hadoop” is sometimes used to refer to a larger ecosystem of projects, not

just HDFS and MapReduce, that fall under the umbrella of infrastructure for distributed

computing and large-scale data processing. Many of these are hosted by the Apache Soft-

ware Foundation , which provides support for a community of open source software pro-

jects, including the original HTTP Server from which it gets its name.

The first component to provide online access was HBase, a key-value store that uses HDFS

for its underlying storage. HBase provides both online read/write access of individual rows

and batch operations for reading and writing data in bulk, making it a good solution for

building applications on.

The real enabler for new processing models in Hadoop was the introduction of YARN

(which stands for Yet Another Resource Negotiator ) in Hadoop 2. YARN is a cluster re-

source management system, which allows any distributed program (not just MapReduce) to

run on data in a Hadoop cluster.

In the last few years, there has been a flowering of different processing patterns that work

with Hadoop. Here is a sample:

Interactive SQL

By dispensing with MapReduce and using a distributed query engine that uses dedicated

“always on” daemons (like Impala) or container reuse (like Hive on Tez), it's possible to

achieve low-latency responses for SQL queries on Hadoop while still scaling up to large

dataset sizes.

Iterative processing

Many algorithms — such as those in machine learning — are iterative in nature, so it's

much more efficient to hold each intermediate working set in memory, compared to

loading from disk on each iteration. The architecture of MapReduce does not allow this,

but it's straightforward with Spark, for example, and it enables a highly exploratory

style of working with datasets.

Stream processing

Search WWH ::

Custom Search

Home