Integration with Hadoop - Mastering Apache Cassandra

Database Reference

In-Depth Information

Summary

We can store a lot of data and run MapReduce on them to analyze the data. We can also set

up Hadoop in such a manner that it does not impact the transactional part of Cassandra in a

negative way. We know how to set up Pig for those who want to quickly assemble an ana-

lysis instead of writing lengthy Java code. We can also power Solr searches by Cassandra,

making Solr more scalable than it already is.

With a plethora of analytical tooling available in the market, you may or may not choose

Cassandra. Maybe you could perform stream analysis, which does not require data to be

stored and analyzed later; for example, if you decide to apply multiple operations on live

streaming tweets and show the result immediately, you would like to use a tool such as

Twitter Storm. Although there is no specific project that could guide you on how to do this,

it is pretty simple to configure Twitter as Storm Spout. This will emit the tweet stream to

the next Bolt and get it processed and forwarded to the next Bolt so that you can use the

Cassandra Java driver to simply store the result. It is as simple as that. You may want to put

a queue between Bolt and Cassandra as a buffer if you find Tweets are too fast for Cas-

sandra. But normally, you wouldn't need that.

Cassandra is a rapidly developing project. The changes and feature additions in this open

source project take place once in six months and don't happen in many big label proprietary

applications. You get faster, stronger, and better Cassandra for free (obviously, there are

technical debts) every half year. While this is a great thing, it comes with a pain

point—new learning. To be able to upgrade, you will need to know new ways to do things.

There may be changes that require you to change things at the code level to keep up with

Cassandra. Most times, you could just upgrade Cassandra and things will work as expected.