Integrating Hadoop - Cassandra: The Definitive Guide

Database Reference

In-Depth Information

To some, coding MapReduce programs is tedious and filled with boilerplate code. Pig provides

an abstraction that makes our code significantly more concise. Pig also allows programmers to

express operations such as joins much more simply than by using MapReduce alone.

The Pig integration code (a LoadFunc implementation) is found in the contribsection of Cas-

sandra's source download. It can be compiled and run using instructions found there, and it also

includes instructions on how to configure Cassandra-specific configuration options. In a moment,

we'll see how to configure a Cassandra cluster to run Pig jobs (compiled down to MapReduce)

in a distributed way.

Hive

Like Pig, Hive ( http://hadoop.apache.org/hive ) is a platform for data analytics. Instead of a

scripting language, queries are written in a query language similar to the familiar SQL called

Hive-QL. Hive was developed by Facebook to allow large data sets to be abstracted into a com-

mon structure.

As of this writing, work on a Hive storage handler for Cassandra is being finalized. For updates

and documentation on its usage with Cassandra, see the wiki.

Cluster Configuration

MapReduce and other tools can run in a nondistributed way for trying things out or troubleshoot-

ing a problem. However, in order to run in a production environment, you'll want to install Ha-

doop in your Cassandra cluster as well. Although a comprehensive discussion of Hadoop install-

ation and configuration is outside the scope of this chapter, we do go over how to configure Cas-

sandra alongside Hadoop for best performance. Readers can find more about Hadoop configur-

ation at http://hadoop.apache.org or in Tom White's excellent reference, Hadoop:TheDeinitive

Guide(O'Reilly).

Because Hadoop has some unfamiliar terminology, here are some useful definitions:

HDFS

Hadoop distributed filesystem.

Namenode

The master node for HDFS. It has locations of data blocks stored in several datanodes and

often runs on the same server as the jobtracker in smaller clusters.

Dattanode

Nodes for storing data blocks for HDFS. Datanodes run on the same servers as tasktrackers.

Search WWH ::

Custom Search

Home