Database Reference
In-Depth Information
ing of these insights quick, scalable, and easy? Traditionally, real-time queries for
Cassandra have involved minimizing the number of rows you have to read from.
Hadoop is very scalable but very slow, and the programming API lacks fea-
tures. Also, the built-in Cassandra InputFormat is designed only for reading an en-
tire ColumnFamily of data. Real-time streaming frameworks such as Storm are a
good fit for fixed processing from firehoses but not so good for flexible queries
from a data store. Thus, we have turned to Spark, a very fast, in-memory distrib-
uted computing framework, to help us with these insights.
Spark is used to run distributed jobs that read raw player events from Cassandra
and generate materialized views, which are cached in memory. These materialized
views can then be queried using subsequent jobs. Spark is fast enough that these
jobs running on materialized views can be used for interactive queries!
An example materialized view would have country, region, device type, and
other metrics as columns. Thus, finding out what the top device types in the United
States are would involve a query like that shown in Listing 12.2 . This query would
be entered into Shark, which is HIVE on Spark.
Listing 12.2 Example Query to Find Top Device Types in the United States
Click here to view code image
SELECT device_type, sum(plays) as p FROM
view1_cached WHERE country = "US" GROUP BY
device_type SORT BY p ORDER DESC LIMIT 20;
To minimize the maintenance involved in having such a large cluster, we have
decided to have all ColumnFamilys use LeveledCompaction. For use cases like
ours, SizeTieredCompaction can provide better performance, but it requires lots of
free disk space, which has caused issues for us in the past. We're happy to take the
small performance hit to get an easier-to-manage system.
When building our newest hardware cluster, we had some tough decisions to
make about what disks to buy and how to present them to Cassandra. We've used
Linux MDRAID extensively in RAID5 and RAID10 configurations. Both work
fine with ext4 or XFS file systems. What we really wanted was the efficiency of
RAID5, but with more flexibility so we can run Mesos and Spark on the same
hardware. When it comes to data integrity and ease of management, ZFS is a great
choice, especially when dealing with lumbering 3TB and 4TB hard drives. We
tested ZFS-on-Linux on a couple of nodes in one of our production clusters and
 
Search WWH ::




Custom Search