Database Reference
In-Depth Information
Hadoop and Cassandra
In the age of big data analytics, there are hardly any data-rich companies that do not want
their data to be extracted, evaluated, and inferred to provide more business inside. In the
past, analyzing large datasets (structured or unstructured) that span terabytes or petabytes
used to be expensive and a technically challenging task to a team; distributed computing
was harder to keep track of, and hardware to support this kind of infrastructure was not fin-
ancially feasible to everyone.
Note
This chapter does not cover Cassandra integration with Hive and Oozie. To learn about
Cassandra integration with Oozie, visit http://wiki.apache.org/cassandra/HadoopSup-
port#Oozie .
There are ongoing efforts to bring Hive integration to Cassandra as its native part. If you
are planning to use Cassandra with Hive, visit https://issues.apache.org/jira/browse/
CASSANDRA-4131 .
DataStax Enterprise editions have built-in Cassandra-enabled Hive MapReduce clients.
Check them out at http://www.datastax.com/documentation/datastax_enterprise/4.6/data-
stax_enterprise/ana/anaHiv.html .
A couple of things changed the demography completely in favor of medium and small
companies. Hardware prices dropped down. The memory and processing powers of com-
puting units increased dramatically at the same time. Hardware on-demand came into the
picture. You can spend about 20 dollars to rent about a 100 virtual machines with quad-core
(virtual) processors, 7.5 GB RAM, and 840 GB of ephemeral storage (you can plug in gi-
gantic network attached storage that is permanent) from AWS for an hour. There are mul-
tiple vendors that provide this sort of cloud infrastructure. However, the biggest leap in
making big data analysis commonplace is the availability of extremely high, quality free,
and open source solutions that abstract the developers from managing distributed systems.
This software made it possible to plug in various algorithms and use the system as a black
box to take care of getting data, applying routines, and returning results. Hadoop is the
most prominent name in this field. Currently, it is the de facto standard of big data process-
ing.
Search WWH ::




Custom Search