Building high-availability solutions with NoSQL - Making Sense of NoSQL

Databases Reference

In-Depth Information

Block 1 is stored

on two servers on

rack 1 and one

server on rack 2.

Block 2 is stored

on two servers on

rack 2 and one

server on rack 1.

1

2

1

2

Server 1

Server 3

1

2

Server 2

Server 4

Rack 1

Rack 2

Figure 8.3 HDFS is designed to have rack awareness so that you can instruct data

blocks to be spread onto multiple racks, which could be located in multiple data

centers. In this example, all blocks are stored on three separate servers (replication

factor of 3) and HDFS spreads the blocks over two racks. If either rack becomes

unavailable, there will always be a replica of both block 1 and block 2.

physically stored in the same rack usually have higher bandwidth connectivity between

the nodes and using this network keeps data off of other shared networks. This is

shown in figure 8.3.

One of the advantages of rack awareness is that you can increase your availability

by carefully distributing HDFS blocks on different racks.

We've also discussed how NoSQL systems move the queries to the data, not the

data to the query. Since the data is stored on three nodes, which node should run the

query? The answer is usually the least-busy node. How does the query distribution sys-

tem know what nodes hold the data? This is where there must be a tight coupling of

the filesystem and the query distribution system. The information about what node

the data is on must be communicated to the client program.

The primary disadvantage of using an external filesystem is that your database may

not be as portable on operating systems that don't support these filesystems. For

example, HDFS is usually run on UNIX or Linux operating systems. If you want to

deploy HBase—which is designed to run on HDFS —you may have more hoops to

jump through to get HDFS to run on a Windows system. Using a virtual machine is one

way to do this, but there can be a performance penalty with using virtual machines.

We should note that you can get the same replication factor by using an off-the-

shelf storage area network ( SAN ). What you won't get with this configuration is an easy

way to keep the query and the data on the same server. Using a SAN for high-availability

databases can work for small datasets, but larger datasets will result in excessive net-

work traffic. In the long run, the Hadoop architecture of shared-nothing processors all

working on their own copy of a large dataset is the most scalable solution.

Setting up a Hadoop cluster can be a great way to make sure your NoSQL database

has both high availability and scale-out performance. Early versions of Hadoop (usu-

ally referred to as version 1.x) had a single point of failure on the NameNode service.

For high availability, Hadoop used a secondary failover node that was automatically

Making Sense of NoSQL

Search WWH ::

Custom Search

Home