Databases Reference
In-Depth Information
SerDe stands for “serializer” and “deserializer”, which are interfaces that
tell Hive how it should translate the data into something that Hive can
process. In particular, the deserializer interface is used when we read
data from disk, and converts the data into objects that Hive knows how
to manipulate. We can write a custom SerDe that reads the JSON data in
and translates the objects for Hive.
The SerDe will take a tweet in JSON form, and translate the JSON entities into columns:
SELECT created_at, entities, text, user
FROM tweets
WHERE user.screen_name='HS'
AND re-tweeted_status.user.screen_name='MJ';
See results in Table 7-1 .
Table 7-1. Results of the SerDe Function on the JSON Entity
Created At
Entities
Text
User
Mon Apr 29
13:58:23 +0000
2013
{ “urls”: [], “user_
mentions”: [ {“screen
name”: “HS”, “name”:
“Harsha Srivatsa”}],
“hashtags”: [{“text”:
“BigDataAnalytics”}]}
RT@HS:
#BigDataAnalytics - It
is not bigness of big
data that is interesting,
it is the value that you
can derive from all
these data that can
make huge business
impacts ……
{“screen name”:
“MJ”, “name”:
“Madhu Jagadeesh”,
“friends_count”: 176,
“followers_count”:
231, “statuses_count”:
2458, “verified”: “false”,
“utc_offset”: null,
“time_zone”: null}
We've now managed to put together an end-to-end system, which gathers data from
the twitter-streaming API, sends the tweets to files on HDFS through Flume, and uses
Oozie to periodically load the files into Hive, where we can query the raw JSON data,
through the use of a Hive SerDe.
The tweeter data has some structure, but certain fields may or may not exist. The
re-tweeted_status field, for example, will only be present if the tweet was a re-tweet.
Additionally, some of the fields may be arbitrarily complex. The hashtags field is an array
of all the hashtags present in the tweets, but most RDBMSs do not support arrays as a
column type. This semi-structured quality of the data makes the data very difficult to
query in a traditional RDBMS. Hive can handle this data much more gracefully.
The query below will find usernames, and the number of re-tweets they have
generated across all the tweets that we have data for:
SELECT
t.re-tweeted_screen_name,
sum(re-tweets) AS total_re-tweets,
count(*) AS tweet_count
 
 
Search WWH ::




Custom Search