Databases Reference
In-Depth Information
The first part of a MapReduce job is the map operation. Map operations retrieve data
from your source database and convert it into a series of independent transform oper-
ations that can be executed on different processors. The output of all map operations
is a key-value structure where the keys are uniform across all input documents. The
second phase is the reduce operation. The reduce operation uses the key-value pairs
created in the map as input, performs the requested operation, and returns the values
you need.
When creating a MapReduce program, you must ensure that the map function is
only dependent on the inputs to the map function and that the output of the map
operation doesn't change the state of data; it only returns a key-value pair. In Map-
Reduce operations, no other intermediate information can be passed between map
functions.
At first glance, it may seem like creating a MapReduce framework would be simple.
Realistically, it's not. First, what if your source data is replicated on three or more
nodes? Do you move the data between nodes? Not if you want your job to be efficient.
Then you must consider which node the map function should run on. How do you
assign the right key to the right reduce processor? What happens if one of the map or
reduce jobs fails in mid-operation? Do you need to restart the entire batch or can you
reassign the work to another node? As you can see, there are many factors to consider
and in the end it's not as simple as it appears.
The good news is that if you stick to these rules, a MapReduce framework like
Hadoop can do most of the hard work finding the right processor to do the map, mak-
ing sure the right reduce node gets the input based on the keys, and making sure that
the job finishes even if there's hardware failure during the job.
Now that we've covered the types of big data problems and some of the architec-
ture patterns, let's look into the strategies that NoSQL systems use to attack these
problems.
6.8
Four ways that NoSQL systems
handle big data problems
As you've seen, understanding your big data is important in determining the best solu-
tion. Now let's take a look at four of the most popular ways NoSQL systems handle big
data challenges.
Understanding these techniques is important when you're evaluating any NoSQL
system. Knowing that a product will give you linear scaling with these techniques will
help you not only to select the right NoSQL system, but also to set up and configure
your NoSQL system correctly.
6.8.1
Moving queries to the data, not data to the queries
With the exception of large graph databases, most NoSQL systems use commodity
processors that each hold a subset of the data on their local shared-nothing drives.
When a client wants to send a general query to all nodes that hold data, it's more
Search WWH ::




Custom Search