Using NoSQL to manage big data - Making Sense of NoSQL

Databases Reference

In-Depth Information

The first part of a MapReduce job is the map operation. Map operations retrieve data

from your source database and convert it into a series of independent transform oper-

ations that can be executed on different processors. The output of all map operations

is a key-value structure where the keys are uniform across all input documents. The

second phase is the reduce operation. The reduce operation uses the key-value pairs

created in the map as input, performs the requested operation, and returns the values

you need.

When creating a MapReduce program, you must ensure that the map function is

only dependent on the inputs to the map function and that the output of the map

operation doesn't change the state of data; it only returns a key-value pair. In Map-

Reduce operations, no other intermediate information can be passed between map

functions.

At first glance, it may seem like creating a MapReduce framework would be simple.

Realistically, it's not. First, what if your source data is replicated on three or more

nodes? Do you move the data between nodes? Not if you want your job to be efficient.

Then you must consider which node the map function should run on. How do you

assign the right key to the right reduce processor? What happens if one of the map or

reduce jobs fails in mid-operation? Do you need to restart the entire batch or can you

reassign the work to another node? As you can see, there are many factors to consider

and in the end it's not as simple as it appears.

The good news is that if you stick to these rules, a MapReduce framework like

Hadoop can do most of the hard work finding the right processor to do the map, mak-

ing sure the right reduce node gets the input based on the keys, and making sure that

the job finishes even if there's hardware failure during the job.

Now that we've covered the types of big data problems and some of the architec-

ture patterns, let's look into the strategies that NoSQL systems use to attack these

problems.

6.8

Four ways that NoSQL systems

handle big data problems

As you've seen, understanding your big data is important in determining the best solu-

tion. Now let's take a look at four of the most popular ways NoSQL systems handle big

data challenges.

Understanding these techniques is important when you're evaluating any NoSQL

system. Knowing that a product will give you linear scaling with these techniques will

help you not only to select the right NoSQL system, but also to set up and configure

your NoSQL system correctly.

6.8.1

Moving queries to the data, not data to the queries

With the exception of large graph databases, most NoSQL systems use commodity

processors that each hold a subset of the data on their local shared-nothing drives.

When a client wants to send a general query to all nodes that hold data, it's more

Search WWH ::

Custom Search

Home