Building a NoSQL-Based Web App to Collect Crowd-Sourced Data - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

be able to detect the additional machines and evenly distribute some of our existing

data to these machines accordingly. Because the data needs to be distributed, incom-

ing database requests can be automatically routed to the correct node on the network.

This ability to simply add individual machines to a pool without worrying too much

about configuration of new application logic is known as linear scalability , and in

practice, it can be difficult to achieve. Anytime a single piece of data needs to be

accessed by more than one machine, there is potential for bottlenecks to appear. For

example, if one machine is writing a piece of data, and another wants to do the same

thing at the same time, the result will be a resource conf lict. These problems are

challenging, but luckily there are a variety of strategies available for distributing, or

sharding, data across many machines.

One way to shard data across multiple Redis instances is to decide on a key range

beforehand. This is the easiest way to accomplish this, but there's a drawback: It is not

robust as data scale gets large. For example, imagine that your application collects the

latest scores from thousands of players from an online game. If you have two instances

of Redis, your application might be instructed to send scores from usernames starting

with A through C to one instance, and scores from players with names starting from D

to F to another instance, and so on.

Automatic Partitioning with Twemproxy

Underneath it all, Redis is really designed to be a performant single-server database.

Although the fact that Redis is a key-value data store makes it a bit easier to distrib-

ute an entire dataset across various instances in a cluster pool, we still need to choose

and implement some kind of sharding strategy, as described earlier, to make it work.

As of the release of this topic, the developers of Redis have indeed been working on a

native, fault-tolerant version of the standalone server that allows for automatic cluster

management.

In this example, we will demonstrate an open-source technology developed at

Twitter, called Twemproxy (originally called nutcracker), to help partition our data

needs among a pool of Redis instances, either running on a single machine or running

across multiple machines. Twemproxy accepts requests from clients and uses a config-

ured hashing function to decide which instance in the pool of machines is responsible

for handling the request. Twemproxy is able to not only speak to Redis instances but

also talk to Memcached, another popular in-memory key-value store that is often used

as a data cache for high-traffic applications. According to the Redis development team,

Twemproxy is the recommended way to shard your data needs among multiple Redis

instances.

Twemproxy can also handle errors or other situations. If a particular instance in a

pool of Redis machines is down, Twemproxy can be instructed to hold off for a short

time before retrying the request. It can also be instructed to eject nodes from the pool

if they are down due to failure.

When Twemproxy receives a request to get or set the value for a particular key,

how does it know which machine to alert? Twemproxy supports a variety of hashing

Search WWH ::

Custom Search

Home