Integration with Hadoop - Mastering Apache Cassandra

Database Reference

In-Depth Information

Installing Pig

Installing Pig is very simple, what is hard is getting it to work with Hadoop and Cassandra

nicely. To install Pig, just download the latest version of Pig and untar it as follows:

$ wget http://www.eng.lsu.edu/mirrors/apache/pig/pig-0.11.1/

pig-0.11.1.tar.gz

$ tar xvzf pig-0.11.1.tar.gz

$ ln -s pig-0.11.1 pig

Let's call this directory $PIG_HOME . Ideally, you should just execute $PIG_HOME/bin/

pig , and the Pig console should start to work given that your Cassandra and Hadoop are

up and working. Unfortunately, it does not. Documentation, at the time of writing this, is

not adequate to configure Pig. To get Pig started, you need to do the following:

1. Set Hadoop's installation directory as a HADOOP_PREFIX variable.

2. Add all the JAR files in Cassandra's lib directory to PIG_CLASSPATH .

3. Add udf.import.list to the PIG_OPTS Pig options variable, as follows:

export PIG_OPTS="$PIG_OPTS

-Dudf.import.list=org.apache.cassandra.hadoop.pig";

4. Set one of the Cassandra nodes' address, Cassandra RPC port, and Cassandra parti-

tioner to PIG_INITIAL_ADDRESS , PIG_RPC_PORT , and

PIG_PARTITIONER , respectively.

You may write a simple shell script that does this for you. Here is a shell script that accom-

modates the four steps (assuming, $CASSANDRA_HOME points to the Cassandra installa-

tion directory).

Note

Pig 0.14, Cassandra 2.1.2, and Hadoop 2.6.0 have some classpath conflicts among each

other. Some JAR has been added and deleted to make the integration work. You may spe-

cifically want to replace all Guava libraries with Guava version 16.0. Cassandra does not

like the older version, and Hadoop fails if we have the newer version (17 onwards, ht-

Search WWH ::

Custom Search

Home