Databases Reference
In-Depth Information
Master-slave
Peer-to-peer
Used only
if primary
master fails
Requests
Requests
Node
Master
Standby
m a ster
Node
Node
Node
Node
Node
Node
Figure 6.7 Master-slave versus peer-to-peer—the panel on the left
illustrates a master-slave configuration where all incoming database
requests (reads or writes) are sent to a single master node and
redistributed from there. The master node is called the NameNode in
Hadoop. This node keeps a database of all the other nodes in the
cluster and the rules for distributing requests to each node. The panel
on the right shows how the peer-to-peer model stores all the information
about the cluster on each node in the cluster. If any node crashes, the
other nodes can take over and processing can continue.
Let's look at the trade-offs. With a master-slave distribution model, the role of manag-
ing the cluster is done on a single master node. This node can run on specialized
hardware such as RAID drives to lower the probability that it crashes. The cluster can
also be configured with a standby master that's continually updated from the master
node. The challenge with this option is that it's difficult to test the standby master
without jeopardizing the health of the cluster. Failure of the standby master to take
over from the master node is a real concern for high-availability operations.
Peer-to-peer systems distribute the responsibility of the master to each node in the
cluster. In this situation, testing is much easier since you can remove any node in the
cluster and the other nodes will continue to function. The disadvantage of peer-to-
peer networks is that there's an increased complexity and communication overhead
that must occur for all nodes to be kept up to date with the cluster status.
The initial versions of Hadoop (frequently referred to as the 1.x versions) were
designed to use a master-slave architecture with the NameNode of a cluster being
responsible for managing the status of the cluster. NameNodes usually don't deal with
any MapReduce data themselves. Their job is to manage and distribute queries to the
correct nodes on the cluster. Hadoop 2.x versions are designed to remove single
points of failure from a Hadoop cluster.
Using the right distribution model will depend on your business requirements: if
high availability is a concern, a peer-to-peer network might be the best solution. If you
can manage your big data using batch jobs that run in off hours, then the simpler
master-slave model might be best. As we move to the next section, you'll see how Map-
Reduce systems can be used in multiprocessor configurations to process your big data.
Search WWH ::




Custom Search