Database Reference
In-Depth Information
However, work is being done to support it in the near future. For details about the current status,
see the wiki.
Tools Above MapReduce
MapReduce is a great abstraction for developers so that they can worry less about the details of
distributed computing and more about the problems they are trying to solve. Over time, an even
more abstracted toolset has emerged. Pig and Hive operate at a level above MapReduce and al-
low developers to perform more complex analytics more easily. Both of these frameworks can
operate against data in Cassandra.
Pig
Pig ( http://hadoop.apache.org/pig ) is a platform for data analytics developed at Yahoo!. Included
in the platform is a high-level language called Pig Latin and a compiler that translates programs
written in Pig Latin into sequences of MapReduce jobs.
Along with the direct Hadoop integration for running MapReduce jobs over data in Cassandra,
there has also been work done to provide integration for Pig. With its grunt shell prompt and the
Pig Latin scripting language, Pig provides a way to simplify writing analytics code. To write our
word count example using Pig Latin:
LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() \
as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)});
cols = FOREACH rows GENERATE flatten(cols) as (name, value);
words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word;
grouped = GROUP words BY word;
counts = FOREACH grouped GENERATE group, COUNT(words) as count;
ordered = ORDER counts BY count DESC;
topten = LIMIT ordered 10;
dump topten;
This alternative word count is only eight lines long. Line 1 gets all the data in the Standard1
column family, describing that data with aliases and data types. We extract the name/value pairs
in each of the rows. In line 3, we have to cast the value to a character array in order to tokenize
it with the built-in TOKENIZE function. We next group by and count each word instance. Finally,
we order our data by count and output the top 10 words found.
NOTE
It is trivial to operate over super columns with Pig. It is simply another nested level of data that we can
flatten in order to get its values.
Search WWH ::




Custom Search