Database Reference
In-Depth Information
To some, coding MapReduce programs is tedious and filled with boilerplate code. Pig provides
an abstraction that makes our code significantly more concise. Pig also allows programmers to
express operations such as joins much more simply than by using MapReduce alone.
The Pig integration code (a LoadFunc implementation) is found in the contribsection of Cas-
sandra's source download. It can be compiled and run using instructions found there, and it also
includes instructions on how to configure Cassandra-specific configuration options. In a moment,
we'll see how to configure a Cassandra cluster to run Pig jobs (compiled down to MapReduce)
in a distributed way.
Hive
Like Pig, Hive ( http://hadoop.apache.org/hive ) is a platform for data analytics. Instead of a
scripting language, queries are written in a query language similar to the familiar SQL called
Hive-QL. Hive was developed by Facebook to allow large data sets to be abstracted into a com-
mon structure.
As of this writing, work on a Hive storage handler for Cassandra is being finalized. For updates
and documentation on its usage with Cassandra, see the wiki.
Cluster Configuration
MapReduce and other tools can run in a nondistributed way for trying things out or troubleshoot-
ing a problem. However, in order to run in a production environment, you'll want to install Ha-
doop in your Cassandra cluster as well. Although a comprehensive discussion of Hadoop install-
ation and configuration is outside the scope of this chapter, we do go over how to configure Cas-
sandra alongside Hadoop for best performance. Readers can find more about Hadoop configur-
ation at http://hadoop.apache.org or in Tom White's excellent reference, Hadoop:TheDeinitive
Guide(O'Reilly).
Because Hadoop has some unfamiliar terminology, here are some useful definitions:
HDFS
Hadoop distributed filesystem.
Namenode
The master node for HDFS. It has locations of data blocks stored in several datanodes and
often runs on the same server as the jobtracker in smaller clusters.
Dattanode
Nodes for storing data blocks for HDFS. Datanodes run on the same servers as tasktrackers.
Search WWH ::




Custom Search