Database Reference
In-Depth Information
A folder named tweetcount will be created in the PIG_HOME directory, which
will contain a file with a name like part-r-00000 with the total tweet count.
Until now we have explored Pig for running MapReduce jobs over the local file
system. Now let's try to run Pig MapReduce scripts over the Cassandra file system.
It is important to note that we can create complex Pig scripts which may end up
running multiple MapReduce jobs. One problem with such Pig scripts is running those
jobs in sequence and losing parallel programming. Also you must have noticed the in-
termediate outputs like loading tweets generated during the running of Pig scripts.
Pig with Cassandra
Cassandra and Pig integration is fairly easy. As mentioned above, to transform Pig Lat-
in scripts into MapReduce over Cassandra requires Cassandra-specific storage func-
tions and connection settings. By default Apache Cassandra comes up with the built-in
function support for Pig integration under the package
org.apache.cassandra.hadoop.pig .
In this section we will use the CQL-based storage function CqlStorage for exer-
cises. Readers may also run the same exercise using CassandraStorage , which is
primarily for column families created in a non-CQL way. For more details on CQL
versus Thrift, please refer to Chapter 1 .
The first step is to configure Pig for Cassandra-specific settings:
# cassandra daemon host,
export PIG_INITIAL_ADDRESS=localhost
/# thrift rpc port
export PIG_RPC_PORT=9160
#configured partitioner
export
PIG_PARTITIONER=org.apache.cassandra.dht.Murmur3Partitioner
#Add thrift library to pig's classpath.
export PIG_CLASSPATH=/home/vivek/software/
apache-cassandra-2.0.4/lib/libthrift-0.9.1.jar
Data Import
Search WWH ::




Custom Search