Database Reference
In-Depth Information
A folder named
tweetcount
will be created in the
PIG_HOME
directory, which
will contain a file with a name like
part-r-00000
with the total tweet count.
Until now we have explored Pig for running MapReduce jobs over the local file
system. Now let's try to run Pig MapReduce scripts over the Cassandra file system.
It is important to note that we can create complex Pig scripts which may end up
running multiple MapReduce jobs. One problem with such Pig scripts is running those
jobs in sequence and losing parallel programming. Also you must have noticed the in-
termediate outputs like loading tweets generated during the running of Pig scripts.
Pig with Cassandra
Cassandra and Pig integration is fairly easy. As mentioned above, to transform Pig Lat-
in scripts into MapReduce over Cassandra requires Cassandra-specific storage func-
tions and connection settings. By default Apache Cassandra comes up with the built-in
function support for Pig integration under the package
org.apache.cassandra.hadoop.pig
.
In this section we will use the CQL-based storage function
CqlStorage
for exer-
cises. Readers may also run the same exercise using
CassandraStorage
, which is
primarily for column families created in a non-CQL way. For more details on CQL
versus Thrift, please refer to
Chapter 1
.
The first step is to configure Pig for Cassandra-specific settings:
# cassandra daemon host,
export PIG_INITIAL_ADDRESS=localhost
/# thrift rpc port
export PIG_RPC_PORT=9160
#configured partitioner
export
PIG_PARTITIONER=org.apache.cassandra.dht.Murmur3Partitioner
#Add thrift library to pig's classpath.
export PIG_CLASSPATH=/home/vivek/software/
apache-cassandra-2.0.4/lib/libthrift-0.9.1.jar
Data Import