Database Reference
In-Depth Information
single point of failure whereas HDFS is not. Another key difference is that Hadoop is
master-slave architecture whereas Cassandra is peer to peer. Since the solutions build
over Cassandra are scalable and use Cassandra's specific features (such as secondary
indexes, composite columns, etc.), we still might need to perform batch analytics using
MapReduce over Cassandra. In this recipe, we discuss the same tweet count example
using Cassandra as both the input and output format.
The program takes a user name as an input argument (the default user value is
mevivs ), for which a number of tweets is calculated.
1.
We need to prepare the data definition first. Let's create a keyspace
tweet_keyspace and column families tweetstore and tweet-
count . Here tweetstore will store raw tweets, whereas the count
for a specific user will be stored in the tweetcount column family
via cqlsh.
// create keyspace.
create keyspace tweet_keyspace with
replication={'class': 'SimpleStrategy',
'replication_factor:3};
use tweet_keyspace;
// create input column family.
create table tweetstore(tweet_id timeuuid
PRIMARY KEY, user text, tweeted_at timestamp,
body text);
// update column family from
Cassandra-cli(thrift way) to enable index
over user.
create column family tweetstore with
column_metadata=[{column_name:'user',
validation_class:'UTF8Type', index_type:
KEYS},{column_name:'body', validation_class:
'UTF8Type'},
{column_name:'tweeted_at',validation_class:
'DateType'}]
// create output column family via
Search WWH ::




Custom Search