Integration with Hadoop - Mastering Apache Cassandra

Database Reference

In-Depth Information

Hadoop and Cassandra

In the age of big data analytics, there are hardly any data-rich companies that do not want

their data to be extracted, evaluated, and inferred to provide more business inside. In the

past, analyzing large datasets (structured or unstructured) that span terabytes or petabytes

used to be expensive and a technically challenging task to a team; distributed computing

was harder to keep track of, and hardware to support this kind of infrastructure was not fin-

ancially feasible to everyone.

Note

This chapter does not cover Cassandra integration with Hive and Oozie. To learn about

Cassandra integration with Oozie, visit http://wiki.apache.org/cassandra/HadoopSup-

port#Oozie .

There are ongoing efforts to bring Hive integration to Cassandra as its native part. If you

are planning to use Cassandra with Hive, visit https://issues.apache.org/jira/browse/

CASSANDRA-4131 .

DataStax Enterprise editions have built-in Cassandra-enabled Hive MapReduce clients.

A couple of things changed the demography completely in favor of medium and small

companies. Hardware prices dropped down. The memory and processing powers of com-

puting units increased dramatically at the same time. Hardware on-demand came into the

picture. You can spend about 20 dollars to rent about a 100 virtual machines with quad-core

(virtual) processors, 7.5 GB RAM, and 840 GB of ephemeral storage (you can plug in gi-

gantic network attached storage that is permanent) from AWS for an hour. There are mul-

tiple vendors that provide this sort of cloud infrastructure. However, the biggest leap in

making big data analysis commonplace is the availability of extremely high, quality free,

and open source solutions that abstract the developers from managing distributed systems.

This software made it possible to plug in various algorithms and use the system as a black

box to take care of getting data, applying routines, and returning results. Hadoop is the

most prominent name in this field. Currently, it is the de facto standard of big data process-

ing.

Search WWH ::

Custom Search

Home