Database Reference
In-Depth Information
5.
Generate a tuple to have an ID with timeuuid values using the
GenerateBinTimeUUID function and other tuples from actual
tweet files:
data_to = FOREACH tweets GENERATE
TOTUPLE(TOTUPLE('id',GenerateBinTimeUUID())),TOTUPLE(TOTUPLE('tweetdate',date),
body);
6.
Finally, load this data in Cassandra:
STORE data_to INTO 'cql://twitter/
twitterdata?output_query=update twitterdata
set tweetdate %3D%3F,body %3D%3F' USING
CqlStorage();
7.
Now you may explore the twitterdata column family for inser-
ted data:
Select * from twitterdata;
Select count(*) from twitterdata;
Up until this point, we have explored various ways to load and run MapReduce
programs over Cassandra using Apache Pig. Apache Pig comes in very handy for deve-
lopers to quickly write Pig Latin scripts to execute MapReduce programs instead of
writing lengthy native MapReduce programs.
In next section, we will explore running MapReduce analytics over Cassandra in an
SQL manner, which is more commonly used.
Apache Hive
Apache Hive is a platform to provide data analytics support over a very large volume
of data stored over HDFS. Hive comes up with various features like built-in UDTF
(user-defined table functions), UDAF (user-defined aggregation function), analytics
over compressed data, and most importantly Hive Query Language (Hive QL). We will
discuss these functions in upcoming sections.
Initially Hive was developed as part of Facebook's research initiatives and later it
went on to become an Apache TLP (Top Level Project).
Search WWH ::




Custom Search