Databases Reference
In-Depth Information
Block 1 is stored
on two servers on
rack 1 and one
server on rack 2.
Block 2 is stored
on two servers on
rack 2 and one
server on rack 1.
1
2
1
2
Server 1
Server 3
1
2
Server 2
Server 4
Rack 1
Rack 2
Figure 8.3 HDFS is designed to have rack awareness so that you can instruct data
blocks to be spread onto multiple racks, which could be located in multiple data
centers. In this example, all blocks are stored on three separate servers (replication
factor of 3) and HDFS spreads the blocks over two racks. If either rack becomes
unavailable, there will always be a replica of both block 1 and block 2.
physically stored in the same rack usually have higher bandwidth connectivity between
the nodes and using this network keeps data off of other shared networks. This is
shown in figure 8.3.
One of the advantages of rack awareness is that you can increase your availability
by carefully distributing HDFS blocks on different racks.
We've also discussed how NoSQL systems move the queries to the data, not the
data to the query. Since the data is stored on three nodes, which node should run the
query? The answer is usually the least-busy node. How does the query distribution sys-
tem know what nodes hold the data? This is where there must be a tight coupling of
the filesystem and the query distribution system. The information about what node
the data is on must be communicated to the client program.
The primary disadvantage of using an external filesystem is that your database may
not be as portable on operating systems that don't support these filesystems. For
example, HDFS is usually run on UNIX or Linux operating systems. If you want to
deploy HBase—which is designed to run on HDFS —you may have more hoops to
jump through to get HDFS to run on a Windows system. Using a virtual machine is one
way to do this, but there can be a performance penalty with using virtual machines.
We should note that you can get the same replication factor by using an off-the-
shelf storage area network ( SAN ). What you won't get with this configuration is an easy
way to keep the query and the data on the same server. Using a SAN for high-availability
databases can work for small datasets, but larger datasets will result in excessive net-
work traffic. In the long run, the Hadoop architecture of shared-nothing processors all
working on their own copy of a large dataset is the most scalable solution.
Setting up a Hadoop cluster can be a great way to make sure your NoSQL database
has both high availability and scale-out performance. Early versions of Hadoop (usu-
ally referred to as version 1.x) had a single point of failure on the NameNode service.
For high availability, Hadoop used a secondary failover node that was automatically
Search WWH ::




Custom Search