Using NoSQL to manage big data - Making Sense of NoSQL

Databases Reference

In-Depth Information

6.7

Using MapReduce to transform your data

over distributed systems

Now let's take an in-depth look to see how MapReduce systems can be used to process

large datasets on multiple processors. You'll see how MapReduce clusters work in con-

junction with distributed filesystems such as the Apache Hadoop Distributed File Sys-

tem ( HDFS ) and review how NoSQL systems such as Hadoop use both the map and

reduce functions to transform data that's stored in NoSQL databases.

If you've been moving data between SQL systems, you're familiar with the extract,

load, and transform ( ETL ) process. The ETL process is typically used when extracting

data from an operational RDBMS to transfer it into the staging area of a data ware-

house. We reviewed this process and ETL in chapter 3 when we covered OLAP systems.

ETL systems are typically written in SQL . They use the SELECT statement on a

source system and INSERT , UPDATE , or DELETE functions on the destination system.

SQL -based ETL systems usually don't have the inherent ability to use a large number of

processors to do their work. This single-processor bottleneck is common in data ware-

house systems as well as areas of big data.

To solve this type of problem, organizations have moved to a distributed transfor-

mation model built around the map and reduce functions. To effectively distribute

work evenly over a cluster of processors, the output of a map phase must be a set of

key-value pairs where part of the key structure is used to correlate results into the

reduce phase. These functions are designed to be inherently linearly scalable using a

large number of shared-nothing processors.

Yet, at its core, the fundamental process of MapReduce is the parallel transforma-

tion of data from one form to another. MapReduce processes don't require the use of

databases in the middle of the transformation. To work effectively on big data prob-

lems, MapReduce operations do require a large amount of input and output. In an

ideal situation, data transformed by a MapReduce server will have all the input on the

local disk of a shared-nothing cluster and write the results to the same local disk. Mov-

ing large datasets in and out of a MapReduce cluster can be inefficient.

The MapReduce way of processing data is to specify a series of stepwise functions

on uniform input data. This process is similar to the functional programming con-

structs that became popular in the 1950s with the LISP systems at MIT . Functional pro-

gramming is about taking a function and a list, and returning a list where the function

has been applied to each member of the list. What's different about modern Map-

Reduce is all the infrastructure that goes with the reliable and efficient execution of

transforms on lists of billions of items. The most popular implementation of the

MapReduce algorithm is the Apache Hadoop system.

The Hadoop system doesn't fundamentally change the concepts of mapping and

reducing functions. What it does is provide an entire ecosystem of tools to allow map

and reduce functions to have linear scalability. It does this by requiring that the out-

put of all map functions return a key-value pair. This is how work can then be distrib-

uted evenly over the nodes of a Hadoop cluster. Hadoop addresses all of the hard

Making Sense of NoSQL

Search WWH ::

Custom Search

Home