Designing Real-Time Streaming Architectures - Real-Time Analytics

Database Reference

In-Depth Information

it was explicitly designed to handle data flow in a large multi-datacenter

installation.

Processing

Map-Reduce processing (Hadoop being the most popular open source

version) became popular not because it was a game-changing paradigm.

After all, the Hollerith tabulation machines used for the 1890 census used

a tabulating mechanism very similar to the map-reduce procedure

implemented by Hadoop and others.

NOTE

Hollerith later founded the corporations that would eventually become

IBM.

Map-reduce processing also did not introduce the idea of distributed

computation and the importance of the computation being local to the data

for performance. The high-performance and large-scale computing

communities have long known that it is often more efficient to move

software close to the data, and there is no shortage of software available

from that community for distributed computing.

What the map-reduce paper did do was to use this paradigm to efficiently

arrange the computation in a way that did not require work on the part of

the application developer, provided the developer could state the problem

in this peculiar way. Large amounts of data had long been tabulated, but it

was up to the specific application to decide how to divide that data among

processing units and move it around. The first generation log collection

systems made it possible to collect large amounts of data in a generic way.

Map-reduce, Hadoop in particular, made it easy to distribute that

processing so that it did not require application-specific code or expensive

hardware and software solutions.

The same thing happened in the real-time space with the introduction of

the second generation of data-flow frameworks. Data that had formerly

been locked in log files became available in a low-latency setting. Like the

original batch systems, the first systems tended to have task-specific

Search WWH ::

Custom Search

Home