Integrating Hadoop - Cassandra: The Definitive Guide

Database Reference

In-Depth Information

Chapter12.Integrating Hadoop

JeremyHanna

As companies and organizations adopt technologies like Cassandra, they look for tools that can

be used to perform analytics and queries against their data. The built-in ways to query can do

much, along with custom layers atop that. However, there are distributed tools in the community

that can be fitted to work with Cassandra as well.

Hadoop seems to be the elephant in the room when it comes to open source big data frameworks.

There we find tools such as an open source MapReduce implementation and higher-level analyt-

ics engines built on top of that, such as Pig and Hive. Thanks to members of both the Cassandra

and Hadoop communities, Cassandra has gained some significant integration points with Hadoop

and its analytics tools.

In this chapter, we explore how Cassandra and Hadoop fit together. First, we give a brief history

of the Apache Hadoop project and go into how one can write MapReduce programs against data

in Cassandra. From there, we cover integration with higher-level tools built on top of Hadoop:

Pig and Hive. Once we have an understanding of these tools, we cover how a Cassandra cluster

can be configured to run these analytics in a distributed way. Finally, we share a couple of use

cases where Cassandra is being used alongside Hadoop to solve real-world problems.

What Is Hadoop?

If you're already familiar with Hadoop, you can safely skip this section. If you haven't had the

pleasure, Hadoop ( http://hadoop.apache.org ) is a set of open source projects that deal with large

amounts of data in a distributed way. Its Hadoop distributed filesystem (HDFS) and MapReduce

subprojects are open source implementations of Google's GFS and MapReduce.

Google found that several internal groups had been implementing similar functionality in order

to solve problems in a distributed way. They saw that it was common to have two phases of op-

erations over distributed data: a map phase and a reduce phase. A map function operates over

raw data and produces intermediate values. A reduce function distills those intermediate values

in some way, producing the final output for that MapReduce computation. By standardizing on

a common framework, they could build more solutions to problems rather than new models of

the same wheel.

Doug Cutting decided to write open source implementations of the Google File System ( ht-

duce.html ) , and thus, Hadoop was born. Since then, it has blossomed into myriad tools, all deal-

Search WWH ::

Custom Search

Home