Database Reference
In-Depth Information
Cassandra MapReduce Integration
In this section, we will read tweets (see above section) from HDFS and discuss a
MapReduce program to perform the computation of tweets. Finally, reduced output
will be stored in the Cassandra tweetcount column family. We will discuss MapRe-
duce over Cassandra with two recipes, which are:
Reading tweets from HDFS and storing tweet counts into Cassandra.
Reading tweets from Cassandra and storing tweet counts into Cas-
sandra.
For this example, we will be using the Thrift protocol to create a Cassandra
schema. As an open-source API for Cassandra, MapReduce integration is available
with the Thrift protocol. Sample exercises will demonstrate the MapReduce integration
with Cassandra in a CQL3 way.
Reading Tweets from HDFS and Storing Count Results
into Cassandra
In this section we will be reading the previously stored tweets file on the HDFS direct-
ory /apress/tweetdata (see the preceding section) and storing the tweet count
per user and per date in Cassandra. Cassandra provides MapReduce support for both
Thrift and CQL3. We will explore both protocols starting with Thrift.
The Thrift Way
Let's explore it with the same Twitter example, where the user and their tweets should
be stored locally and sorted by tweet_date .
1.
First, we need to prepare the data definition. Let's create a column
family tweetcount using Cassandra-cli:
// create keyspace.
Create keyspace tweet_keyspace;
use tweet_keyspace;
Search WWH ::




Custom Search