Database Reference
In-Depth Information
it was explicitly designed to handle data flow in a large multi-datacenter
installation.
Processing
Map-Reduce processing (Hadoop being the most popular open source
version) became popular not because it was a game-changing paradigm.
After all, the Hollerith tabulation machines used for the 1890 census used
a tabulating mechanism very similar to the map-reduce procedure
implemented by Hadoop and others.
NOTE
Hollerith later founded the corporations that would eventually become
IBM.
Map-reduce processing also did not introduce the idea of distributed
computation and the importance of the computation being local to the data
for performance. The high-performance and large-scale computing
communities have long known that it is often more efficient to move
software close to the data, and there is no shortage of software available
from that community for distributed computing.
What the map-reduce paper did do was to use this paradigm to efficiently
arrange the computation in a way that did not require work on the part of
the application developer, provided the developer could state the problem
in this peculiar way. Large amounts of data had long been tabulated, but it
was up to the specific application to decide how to divide that data among
processing units and move it around. The first generation log collection
systems made it possible to collect large amounts of data in a generic way.
Map-reduce, Hadoop in particular, made it easy to distribute that
processing so that it did not require application-specific code or expensive
hardware and software solutions.
The same thing happened in the real-time space with the introduction of
the second generation of data-flow frameworks. Data that had formerly
been locked in log files became available in a low-latency setting. Like the
original batch systems, the first systems tended to have task-specific
Search WWH ::




Custom Search