Databases Reference
In-Depth Information
6.7
Using MapReduce to transform your data
over distributed systems
Now let's take an in-depth look to see how MapReduce systems can be used to process
large datasets on multiple processors. You'll see how MapReduce clusters work in con-
junction with distributed filesystems such as the Apache Hadoop Distributed File Sys-
tem ( HDFS ) and review how NoSQL systems such as Hadoop use both the map and
reduce functions to transform data that's stored in NoSQL databases.
If you've been moving data between SQL systems, you're familiar with the extract,
load, and transform ( ETL ) process. The ETL process is typically used when extracting
data from an operational RDBMS to transfer it into the staging area of a data ware-
house. We reviewed this process and ETL in chapter 3 when we covered OLAP systems.
ETL systems are typically written in SQL . They use the SELECT statement on a
source system and INSERT , UPDATE , or DELETE functions on the destination system.
SQL -based ETL systems usually don't have the inherent ability to use a large number of
processors to do their work. This single-processor bottleneck is common in data ware-
house systems as well as areas of big data.
To solve this type of problem, organizations have moved to a distributed transfor-
mation model built around the map and reduce functions. To effectively distribute
work evenly over a cluster of processors, the output of a map phase must be a set of
key-value pairs where part of the key structure is used to correlate results into the
reduce phase. These functions are designed to be inherently linearly scalable using a
large number of shared-nothing processors.
Yet, at its core, the fundamental process of MapReduce is the parallel transforma-
tion of data from one form to another. MapReduce processes don't require the use of
databases in the middle of the transformation. To work effectively on big data prob-
lems, MapReduce operations do require a large amount of input and output. In an
ideal situation, data transformed by a MapReduce server will have all the input on the
local disk of a shared-nothing cluster and write the results to the same local disk. Mov-
ing large datasets in and out of a MapReduce cluster can be inefficient.
The MapReduce way of processing data is to specify a series of stepwise functions
on uniform input data. This process is similar to the functional programming con-
structs that became popular in the 1950s with the LISP systems at MIT . Functional pro-
gramming is about taking a function and a list, and returning a list where the function
has been applied to each member of the list. What's different about modern Map-
Reduce is all the infrastructure that goes with the reliable and efficient execution of
transforms on lists of billions of items. The most popular implementation of the
MapReduce algorithm is the Apache Hadoop system.
The Hadoop system doesn't fundamentally change the concepts of mapping and
reducing functions. What it does is provide an entire ecosystem of tools to allow map
and reduce functions to have linear scalability. It does this by requiring that the out-
put of all map functions return a key-value pair. This is how work can then be distrib-
uted evenly over the nodes of a Hadoop cluster. Hadoop addresses all of the hard
Search WWH ::




Custom Search