Integration with Hadoop - Mastering Apache Cassandra

Database Reference

In-Depth Information

Integrating Pig and Cassandra

By getting Hadoop working with Cassandra, we are almost done and ready to use the Pig

console to get data from Cassandra and store results back into Cassandra. One thing that

you need to know is what storage method is used to store and retrieve data from Cassandra.

It is CassandraStorage() that you will be using in your Pig Latin to transfer data to

and from Cassandra. The usage is exactly the same as you would use in PigStorage() .

In Pig, the data structure that is used to store/get data to/from Cassandra is a tuple of row

keys and a bag of tuples, where each tuple is a column-name and column-value pair, such

as this:

(ROW_KEY, { (COL1, VAL1), (COL2, VAL2), (COL3, VAL3), ...})

Here is an example of the word count from the Cassandra table. This example uses the

same data (from Alice in Wonderland) as we did when we showed the MapReduce ex-

ample with Cassandra. The topic is split into lines, and each row contains 500 lines in 500

columns. There are a total of 6 rows:

# Pull Data from dataCF column family under testksKeyspace

grunt> rows = LOAD 'cassandra://testks/dataCF' USING

CassandraStorage();

grunt> cols = FOREACH rows GENERATE flatten(columns);

grunt>vals = FOREACH cols GENERATE

flatten(TOKENIZE((chararray)$1)) as word;

grunt>grps = group vals by word;

grunt>cnt = foreachgrps generate group, COUNT(vals), 'count'

as ccnt;

grunt>grp_by_word = group cnt by $0;

grunt>cagg = foreachgrp_by_word generate group, cnt.(ccnt,

$1);

# Put Data into result1CF column family under testksKeyspace

grunt> STORE cagg into 'cassandra://testks/result1CF' USING

CassandraStorage();

2013-07-22 14:12:45,144 [main] INFO

org.apache.pig.tools.pigstats.ScriptState - Pig features

used in the script: GROUP_BY

Search WWH ::

Custom Search

Home