Database Reference
In-Depth Information
Figure 5-1
.
ETL based analytics
In 2008, Google's research program published a paper “MapReduce: Simplified
Data Processing on Large Clusters” that introduced the MapReduce paradigm, though
it had been in use since 2003. (You can download this white paper at
ht-
MapReduce algorithm is a distributed parallel programming model, comprised of the
map and reduce function, and is one solution to business problems that require analyt-
ics over heterogeneous forms of massive amounts of data. Since the Google paper was
published, various MapReduce implementations have been built. One popular and
open-source implementation is
Apache Hadoop
.
Apache Hadoop
Apache Hadoop is a popular open-source library for distributed computation for large-
scale data sets. It's a large-scale batch processing infrastructure that is primarily meant
to deal with batch analytics over data distributed across hundreds of nodes.