Database Reference
In-Depth Information
ing with solutions for big data problems. Today, Hadoop is widely used, by Yahoo!, Facebook,
LinkedIn, Twitter, IBM, Rackspace, and many other companies. There is a vibrant community
and a growing ecosystem.
Cassandra has built-in support for the Hadoop implementation of MapReduce ( ht-
tp://hadoop.apache.org/mapreduce ) .
Working with MapReduce
This section covers details on how to write a simple MapReduce job over data stored in Cas-
sandra using the Java language. We also briefly cover how to output data into Cassandra and dis-
cuss ongoing progress with using Cassandra with Hadoop Streaming for languages beyond Java.
NOTE
The word count example given in this section is also found in the Cassandra source download in its con-
tribmodule. It can be compiled and run using instructions found there. It is best to run with that code,
as the current version might have minor modifications. However, the principles remain the same.
For convenience, the word count MapReduce example can be run locally against a single Cas-
sandra node. However, for more information on how to configure Cassandra and Hadoop to run
MapReduce in a more distributed fashion, see the section Cluster Coniguration .
Cassandra Hadoop Source Package
Cassandra has a Java source package for Hadoop integration code, called
org.apache.cassandra.hadoop. There we find:
ColumnFamilyInputFormat
ColumnFamilyInputFormat
The main class we'll use to interact with data stored in Cassandra from Hadoop. It's an ex-
tension of Hadoop's InputFormat abstract class.
ConfigHelper
ConfigHelper
A helper class to configure Cassandra-specific information such as the server node to point
to, the port, and information specific to your MapReduce job.
ColumnFamilySplit
ColumnFamilySplit
The extension of Hadoop's InputSplit abstract class that creates splits over our Cassandra
data. It also provides Hadoop with the location of the data, so that it may prefer running tasks
on nodes where the data is stored.
Search WWH ::




Custom Search