Database Reference
In-Depth Information
take their storage and analytics backend to the next level. Their legacy storage solution and ana-
lytics were home grown, and they were outgrowing them. Doing queries across the entire dataset
was tedious and could take hours to run.
Keith saw Cassandra as a promising storage solution for the following reasons:
▪ Built-in scaling instead of scaffolded on
▪ Single view of read/write access (no masters or slaves)
▪ A hands-off style of operations that under normal cases (node failures, adding new nodes,
etc.) “just works” and requires very little micromanagement
Keith also watched as the Cassandra/Hadoop integration evolved and saw Pig as an analytics
solution he could use. Initially he wanted to look for ways to use PHP or Python to use MapRe-
duce. However, after becoming familiar with Pig, he didn't see a need. He noted that the turn-
around time from idea to execution with Pig was very quick. The query runtime was also a nice
surprise. He could traverse all of the data in 10-15, minutes rather than hours. As a result, Raptr
is able to explore new possibilities in analyzing their data.
As far as configuration, Keith has a separate namenode/jobtracker and installed the datanode/
tasktracker on each of his Cassandra nodes. He notes that a nice side effect of this is that the
analytics engine scales with the data.
Imagini: Dave Gardner
Imagini provides publishers with tools to profile all their site visitors through “visual quizzes”
and an inference engine. Behind the scenes, this involves processing large amounts of behavioral
data and then making the results available for real-time access.
After looking at several alternatives, Imagini went with Cassandra because of its fault tolerance,
decentralized architecture (no single point of failure), and large write capacity.
Dave Gardner, a senior Imagini developer, writes, “We use Cassandra to store our real-time data,
including information on roughly 100 million users, which is expected to grow substantially over
the coming year. This is nearly all accessed via simple key lookup.”
Currently Imagini aggregates data from a variety of sources into Hadoop's distributed filesystem,
HDFS. Using Hadoop Streaming, they use PHP to MapReduce over their data and output directly
to Cassandra via Thrift in their reducers. The results reside in Cassandra to provide real-time ac-
cess to the data.
Looking forward, Imagini hopes to simplify their workflow once Hadoop Streaming becomes
available with Cassandra. They're planning on storing even raw data in Cassandra, MapReduce
over that data, and then output the result into Cassandra.
Search WWH ::




Custom Search