Case Studies - Practical Cassandra

Database Reference

In-Depth Information

ing of these insights quick, scalable, and easy? Traditionally, real-time queries for

Cassandra have involved minimizing the number of rows you have to read from.

Hadoop is very scalable but very slow, and the programming API lacks fea-

tures. Also, the built-in Cassandra InputFormat is designed only for reading an en-

tire ColumnFamily of data. Real-time streaming frameworks such as Storm are a

good fit for fixed processing from firehoses but not so good for flexible queries

from a data store. Thus, we have turned to Spark, a very fast, in-memory distrib-

uted computing framework, to help us with these insights.

Spark is used to run distributed jobs that read raw player events from Cassandra

and generate materialized views, which are cached in memory. These materialized

views can then be queried using subsequent jobs. Spark is fast enough that these

jobs running on materialized views can be used for interactive queries!

An example materialized view would have country, region, device type, and

other metrics as columns. Thus, finding out what the top device types in the United

States are would involve a query like that shown in Listing 12.2 . This query would

be entered into Shark, which is HIVE on Spark.

Listing 12.2 Example Query to Find Top Device Types in the United States

SELECT device_type, sum(plays) as p FROM

view1_cached WHERE country = "US" GROUP BY

device_type SORT BY p ORDER DESC LIMIT 20;

To minimize the maintenance involved in having such a large cluster, we have

decided to have all ColumnFamilys use LeveledCompaction. For use cases like

ours, SizeTieredCompaction can provide better performance, but it requires lots of

free disk space, which has caused issues for us in the past. We're happy to take the

small performance hit to get an easier-to-manage system.

When building our newest hardware cluster, we had some tough decisions to

make about what disks to buy and how to present them to Cassandra. We've used

Linux MDRAID extensively in RAID5 and RAID10 configurations. Both work

fine with ext4 or XFS file systems. What we really wanted was the efficiency of

RAID5, but with more flexibility so we can run Mesos and Spark on the same

hardware. When it comes to data integrity and ease of management, ZFS is a great

choice, especially when dealing with lumbering 3TB and 4TB hard drives. We

tested ZFS-on-Linux on a couple of nodes in one of our production clusters and

Search WWH ::

Custom Search

Home