Database Reference
In-Depth Information
Chapter12.Integrating Hadoop
JeremyHanna
As companies and organizations adopt technologies like Cassandra, they look for tools that can
be used to perform analytics and queries against their data. The built-in ways to query can do
much, along with custom layers atop that. However, there are distributed tools in the community
that can be fitted to work with Cassandra as well.
Hadoop seems to be the elephant in the room when it comes to open source big data frameworks.
There we find tools such as an open source MapReduce implementation and higher-level analyt-
ics engines built on top of that, such as Pig and Hive. Thanks to members of both the Cassandra
and Hadoop communities, Cassandra has gained some significant integration points with Hadoop
and its analytics tools.
In this chapter, we explore how Cassandra and Hadoop fit together. First, we give a brief history
of the Apache Hadoop project and go into how one can write MapReduce programs against data
in Cassandra. From there, we cover integration with higher-level tools built on top of Hadoop:
Pig and Hive. Once we have an understanding of these tools, we cover how a Cassandra cluster
can be configured to run these analytics in a distributed way. Finally, we share a couple of use
cases where Cassandra is being used alongside Hadoop to solve real-world problems.
What Is Hadoop?
If you're already familiar with Hadoop, you can safely skip this section. If you haven't had the
pleasure, Hadoop ( http://hadoop.apache.org ) is a set of open source projects that deal with large
amounts of data in a distributed way. Its Hadoop distributed filesystem (HDFS) and MapReduce
subprojects are open source implementations of Google's GFS and MapReduce.
Google found that several internal groups had been implementing similar functionality in order
to solve problems in a distributed way. They saw that it was common to have two phases of op-
erations over distributed data: a map phase and a reduce phase. A map function operates over
raw data and produces intermediate values. A reduce function distills those intermediate values
in some way, producing the final output for that MapReduce computation. By standardizing on
a common framework, they could build more solutions to problems rather than new models of
the same wheel.
Doug Cutting decided to write open source implementations of the Google File System ( ht-
tp://labs.google.com/papers/gfs.html ) and MapReduce ( http://labs.google.com/papers/mapre-
duce.html ) , and thus, Hadoop was born. Since then, it has blossomed into myriad tools, all deal-
Search WWH ::




Custom Search