Databases Reference
In-Depth Information
Gathering Data with Apache Flume
The twitter streaming API provides a constant stream of tweets. To gather the feeds, one
option would be to use a simple utility like curl to access the API and then periodically
load the files. However, this would require us to write code to control where the data goes
in HDFS. The second option will be to use specialized components like Flume within the
Hadoop ecosystem to automatically move the files from the API to HDFS, without manual
intervention.
Flume is a data ingestion utility that is configured by defining endpoints in a data
flow called “sources and sinks”. In Flume, each individual piece of data (tweets, in our
case) is called an event; sources produce events and send the events through a channel,
which connects the source to the sink. The sink then writes the events out to a predefined
location. For our use case, we'll need to design a custom source that accesses the twitter-
streaming API and sends the tweets through a channel to a sink that writes to HDFS files.
Additionally, we can use the custom source to filter the tweets on a set of keywords to
help identify relevant tweets.
Partition Management with Apache Oozie
Once we have the twitter data loaded into HDFS, we can stage it for querying by creating
an external table in Hive. Using an external table will allow us to query the table without
moving the data from the location where it ends up in HDFS. To ensure scalability, as we
add more and more data, we'll need to also partition the table. A partitioned table allows
us to prune the files that we read when querying, which results in better performance
when dealing with large data sets. However, the Twitter API will continue to stream
tweets, and Flume will perpetually create new files. We can automate the periodic process
of adding partitions to our table as the new data comes in.
Apache Oozie is a workflow coordination system that can be used to solve this
problem. Oozie is an extremely flexible system for designing job workflows and can be
scheduled to run based on a set of criteria. We can configure the workflow to run an
ALTER TABLE command that adds a partition containing the last hour's worth of data
into Hive, and we can instruct the workflow to occur every hour. This will ensure that
we're always looking at up-to-date data.
Querying Complex Data with Hive
Before we can query the data, we need to ensure that the Hive table can properly interpret
the JSON data. By default, Hive expects that input files use a delimited row format, but
our Twitter data is in a JSON format, which will not work with the default settings. This
is actually one of Hive's biggest strengths. Hive allows us to flexibly define and redefine
how the data is represented on disk. The schema is only really enforced when we read
the data, and we can use the Hive SerDe interface to specify how to interpret what we've
loaded.
 
Search WWH ::




Custom Search