Integrating Hadoop - Cassandra: The Definitive Guide

Database Reference

In-Depth Information

take their storage and analytics backend to the next level. Their legacy storage solution and ana-

lytics were home grown, and they were outgrowing them. Doing queries across the entire dataset

was tedious and could take hours to run.

Keith saw Cassandra as a promising storage solution for the following reasons:

▪ Built-in scaling instead of scaffolded on

▪ Single view of read/write access (no masters or slaves)

▪ A hands-off style of operations that under normal cases (node failures, adding new nodes,

etc.) “just works” and requires very little micromanagement

Keith also watched as the Cassandra/Hadoop integration evolved and saw Pig as an analytics

solution he could use. Initially he wanted to look for ways to use PHP or Python to use MapRe-

duce. However, after becoming familiar with Pig, he didn't see a need. He noted that the turn-

around time from idea to execution with Pig was very quick. The query runtime was also a nice

surprise. He could traverse all of the data in 10-15, minutes rather than hours. As a result, Raptr

is able to explore new possibilities in analyzing their data.

As far as configuration, Keith has a separate namenode/jobtracker and installed the datanode/

tasktracker on each of his Cassandra nodes. He notes that a nice side effect of this is that the

analytics engine scales with the data.

Imagini: Dave Gardner

Imagini provides publishers with tools to profile all their site visitors through “visual quizzes”

and an inference engine. Behind the scenes, this involves processing large amounts of behavioral

data and then making the results available for real-time access.

After looking at several alternatives, Imagini went with Cassandra because of its fault tolerance,

decentralized architecture (no single point of failure), and large write capacity.

Dave Gardner, a senior Imagini developer, writes, “We use Cassandra to store our real-time data,

including information on roughly 100 million users, which is expected to grow substantially over

the coming year. This is nearly all accessed via simple key lookup.”

Currently Imagini aggregates data from a variety of sources into Hadoop's distributed filesystem,

HDFS. Using Hadoop Streaming, they use PHP to MapReduce over their data and output directly

to Cassandra via Thrift in their reducers. The results reside in Cassandra to provide real-time ac-

cess to the data.

Looking forward, Imagini hopes to simplify their workflow once Hadoop Streaming becomes

available with Cassandra. They're planning on storing even raw data in Cassandra, MapReduce

over that data, and then output the result into Cassandra.

Search WWH ::

Custom Search

Home