MapReduce with Cassandra - Beginning Apache Cassandra Development

Database Reference

In-Depth Information

single point of failure whereas HDFS is not. Another key difference is that Hadoop is

master-slave architecture whereas Cassandra is peer to peer. Since the solutions build

over Cassandra are scalable and use Cassandra's specific features (such as secondary

indexes, composite columns, etc.), we still might need to perform batch analytics using

MapReduce over Cassandra. In this recipe, we discuss the same tweet count example

using Cassandra as both the input and output format.

The program takes a user name as an input argument (the default user value is

mevivs ), for which a number of tweets is calculated.

1.

We need to prepare the data definition first. Let's create a keyspace

tweet_keyspace and column families tweetstore and tweet-

count . Here tweetstore will store raw tweets, whereas the count

for a specific user will be stored in the tweetcount column family

via cqlsh.

// create keyspace.

create keyspace tweet_keyspace with

replication={'class': 'SimpleStrategy',

'replication_factor:3};

use tweet_keyspace;

// create input column family.

create table tweetstore(tweet_id timeuuid

PRIMARY KEY, user text, tweeted_at timestamp,

body text);

// update column family from

Cassandra-cli(thrift way) to enable index

over user.

create column family tweetstore with

column_metadata=[{column_name:'user',

validation_class:'UTF8Type', index_type:

KEYS},{column_name:'body', validation_class:

'UTF8Type'},

{column_name:'tweeted_at',validation_class:

'DateType'}]

// create output column family via

Search WWH ::

Custom Search

Home