Database Reference
In-Depth Information
Integrating Pig and Cassandra
By getting Hadoop working with Cassandra, we are almost done and ready to use the Pig
console to get data from Cassandra and store results back into Cassandra. One thing that
you need to know is what storage method is used to store and retrieve data from Cassandra.
It is CassandraStorage() that you will be using in your Pig Latin to transfer data to
and from Cassandra. The usage is exactly the same as you would use in PigStorage() .
In Pig, the data structure that is used to store/get data to/from Cassandra is a tuple of row
keys and a bag of tuples, where each tuple is a column-name and column-value pair, such
as this:
(ROW_KEY, { (COL1, VAL1), (COL2, VAL2), (COL3, VAL3), ...})
Here is an example of the word count from the Cassandra table. This example uses the
same data (from Alice in Wonderland) as we did when we showed the MapReduce ex-
ample with Cassandra. The topic is split into lines, and each row contains 500 lines in 500
columns. There are a total of 6 rows:
# Pull Data from dataCF column family under testksKeyspace
grunt> rows = LOAD 'cassandra://testks/dataCF' USING
CassandraStorage();
grunt> cols = FOREACH rows GENERATE flatten(columns);
grunt>vals = FOREACH cols GENERATE
flatten(TOKENIZE((chararray)$1)) as word;
grunt>grps = group vals by word;
grunt>cnt = foreachgrps generate group, COUNT(vals), 'count'
as ccnt;
grunt>grp_by_word = group cnt by $0;
grunt>cagg = foreachgrp_by_word generate group, cnt.(ccnt,
$1);
# Put Data into result1CF column family under testksKeyspace
grunt> STORE cagg into 'cassandra://testks/result1CF' USING
CassandraStorage();
2013-07-22 14:12:45,144 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features
used in the script: GROUP_BY
Search WWH ::




Custom Search