MapReduce with Cassandra - Beginning Apache Cassandra Development

Database Reference

In-Depth Information

Cassandra MapReduce Integration

In this section, we will read tweets (see above section) from HDFS and discuss a

MapReduce program to perform the computation of tweets. Finally, reduced output

will be stored in the Cassandra tweetcount column family. We will discuss MapRe-

duce over Cassandra with two recipes, which are:

•

Reading tweets from HDFS and storing tweet counts into Cassandra.

•

Reading tweets from Cassandra and storing tweet counts into Cas-

sandra.

For this example, we will be using the Thrift protocol to create a Cassandra

schema. As an open-source API for Cassandra, MapReduce integration is available

with the Thrift protocol. Sample exercises will demonstrate the MapReduce integration

with Cassandra in a CQL3 way.

Reading Tweets from HDFS and Storing Count Results

into Cassandra

In this section we will be reading the previously stored tweets file on the HDFS direct-

ory /apress/tweetdata (see the preceding section) and storing the tweet count

per user and per date in Cassandra. Cassandra provides MapReduce support for both

Thrift and CQL3. We will explore both protocols starting with Thrift.

The Thrift Way

Let's explore it with the same Twitter example, where the user and their tweets should

be stored locally and sorted by tweet_date .

1.

First, we need to prepare the data definition. Let's create a column

family tweetcount using Cassandra-cli:

// create keyspace.

Create keyspace tweet_keyspace;

use tweet_keyspace;

Search WWH ::

Custom Search

Home