Data Migration and Analytics - Beginning Apache Cassandra Development

Database Reference

In-Depth Information

A folder named tweetcount will be created in the PIG_HOME directory, which

will contain a file with a name like part-r-00000 with the total tweet count.

Until now we have explored Pig for running MapReduce jobs over the local file

system. Now let's try to run Pig MapReduce scripts over the Cassandra file system.

It is important to note that we can create complex Pig scripts which may end up

running multiple MapReduce jobs. One problem with such Pig scripts is running those

jobs in sequence and losing parallel programming. Also you must have noticed the in-

termediate outputs like loading tweets generated during the running of Pig scripts.

Pig with Cassandra

Cassandra and Pig integration is fairly easy. As mentioned above, to transform Pig Lat-

in scripts into MapReduce over Cassandra requires Cassandra-specific storage func-

tions and connection settings. By default Apache Cassandra comes up with the built-in

function support for Pig integration under the package

org.apache.cassandra.hadoop.pig .

In this section we will use the CQL-based storage function CqlStorage for exer-

cises. Readers may also run the same exercise using CassandraStorage , which is

primarily for column families created in a non-CQL way. For more details on CQL

versus Thrift, please refer to Chapter 1 .

The first step is to configure Pig for Cassandra-specific settings:

# cassandra daemon host,

export PIG_INITIAL_ADDRESS=localhost

/# thrift rpc port

export PIG_RPC_PORT=9160

#configured partitioner

export

PIG_PARTITIONER=org.apache.cassandra.dht.Murmur3Partitioner

#Add thrift library to pig's classpath.

export PIG_CLASSPATH=/home/vivek/software/

apache-cassandra-2.0.4/lib/libthrift-0.9.1.jar

Data Import

Search WWH ::

Custom Search

Home