Data Migration and Analytics - Beginning Apache Cassandra Development

Database Reference

In-Depth Information

5.

Generate a tuple to have an ID with timeuuid values using the

GenerateBinTimeUUID function and other tuples from actual

tweet files:

data_to = FOREACH tweets GENERATE

TOTUPLE(TOTUPLE('id',GenerateBinTimeUUID())),TOTUPLE(TOTUPLE('tweetdate',date),

body);

6.

Finally, load this data in Cassandra:

STORE data_to INTO 'cql://twitter/

twitterdata?output_query=update twitterdata

set tweetdate %3D%3F,body %3D%3F' USING

CqlStorage();

7.

Now you may explore the twitterdata column family for inser-

ted data:

Select * from twitterdata;

Select count(*) from twitterdata;

Up until this point, we have explored various ways to load and run MapReduce

programs over Cassandra using Apache Pig. Apache Pig comes in very handy for deve-

lopers to quickly write Pig Latin scripts to execute MapReduce programs instead of

writing lengthy native MapReduce programs.

In next section, we will explore running MapReduce analytics over Cassandra in an

SQL manner, which is more commonly used.

Apache Hive

Apache Hive is a platform to provide data analytics support over a very large volume

of data stored over HDFS. Hive comes up with various features like built-in UDTF

(user-defined table functions), UDAF (user-defined aggregation function), analytics

over compressed data, and most importantly Hive Query Language (Hive QL). We will

discuss these functions in upcoming sections.

Initially Hive was developed as part of Facebook's research initiatives and later it

went on to become an Apache TLP (Top Level Project).

Search WWH ::

Custom Search

Home