Integration with Hadoop - Mastering Apache Cassandra

Database Reference

In-Depth Information

Hadoop in a Cassandra cluster

The production version of the Hadoop and Cassandra combination needs to go into a separ-

ate cluster. The first obvious issue is you probably wouldn't want Hadoop to keep polling

Cassandra nodes, hampering Cassandra's performance to end users. The general pattern to

avoid this is to split the ring into two data centers. Since Cassandra automatically and im-

mediately replicates the changes between data centers, they will always be in sync. What's

more, you can assign one of the data centers as transactional with a higher replication

factor and the other as an analytical data center with a replication factor 1. The analytical

data center is the one used by Hadoop without affecting the transactional data center.

Now, you do not really need to have two physically separated data centers to make this

configuration work. Remember NetworkTopologyStrategy ? (Refer to Chapter 3 ,

Effective CQL .) You can tweak Cassandra thinking there are two data centers by just as-

signing the nodes that you wanted to use for analytics in a different data center. You may

need to use PropertyFileSnitch and specify the details about data centers in a

cassandra-toplogy.properties file. So, your keyspace creation looks something

like this:

createkeyspacemyKeyspace

withplacement_strategy = 'NetworkTopologyStrategy'

andstrategy_options = {TX_DC : 2, HDP_DC: 1};

The previous statement defines two data centers, TX_DC for transactional purposes and

HDP_DC for analytics in Hadoop. A node in a transactional data center has a snitch con-

figured like this:

# Transaction Data Center

192.168.1.1=TX_DC:RAC1

192.168.1.2=TX_DC:RAC1

192.168.2.1=TX_DC:RAC2

# Analytics Data Center

192.168.1.3=HDP_DC:RAC1

192.168.2.2=HDP_DC:RAC2

192.168.2.3=HDP_DC:RAC2

Search WWH ::

Custom Search

Home