MapReduce with Cassandra - Beginning Apache Cassandra Development

Database Reference

In-Depth Information

Oseias Moraes | 334

Adeesh Fulay | 334

Sat Nov 05 21:29:20 IST 2011 | 334

Sun Jul 24 21:44:17 IST 2011 | 334

Manthita. | 334

ebooksdealofdaybot | 334

Wed Apr 23 19:19:49 IST 2014 | 334

The News Selector | 6680

Louise Corrigan | 334

Mon Mar 03 01:19:17 IST 2014 | 6680

22 Rows Returned.

The complete source code is available with the downloads for this topic, and

classes discussed are

• com.apress.chapter5.mapreduce.twittercount.hdfs.TwitterHDFSCQLJob

• com.apress.chapter5.mapreduce.twittercount.hdfs.TweetAggregator

In next section we will discuss using Cassandra as an input and output format for

MapReduce.

Cassandra In and Cassandra Out

Let's discuss running a MapReduce where input will be fetched from Cassandra and

output will also get stored in Cassandra.

So far we have seen that the MapReduce job execution is possible over default

HDFS and over an external file system such as Cassandra. You must be wondering

which file system to adopt and why? Well it depends on the use case. For example, if

an application has already been built using various Cassandra features, it's better to

keep its MapReduce base batch analytics to be implemented in Cassandra. There can

be use cases where HDFS has already been used for storing raw data and the user

might not agree with migration but still want to run a few MapReduce jobs and store

output into Cassandra. Similarly the user may want to migrate away from HDFS and its

ecosystem (Hive, Pig, and so forth) to a single database solution (i.e., Cassandra). One

big difference we must remember is that HDFS is a distributed file system, whereas

Cassandra is a distributed database. Cassandra is fault-tolerant and doesn't have a

Search WWH ::

Custom Search

Home