Big Data Analytics Methodology - Big Data Imperatives

Databases Reference

In-Depth Information

Gathering Data with Apache Flume

The twitter streaming API provides a constant stream of tweets. To gather the feeds, one

option would be to use a simple utility like curl to access the API and then periodically

load the files. However, this would require us to write code to control where the data goes

in HDFS. The second option will be to use specialized components like Flume within the

Hadoop ecosystem to automatically move the files from the API to HDFS, without manual

intervention.

Flume is a data ingestion utility that is configured by defining endpoints in a data

flow called “sources and sinks”. In Flume, each individual piece of data (tweets, in our

case) is called an event; sources produce events and send the events through a channel,

which connects the source to the sink. The sink then writes the events out to a predefined

location. For our use case, we'll need to design a custom source that accesses the twitter-

streaming API and sends the tweets through a channel to a sink that writes to HDFS files.

Additionally, we can use the custom source to filter the tweets on a set of keywords to

help identify relevant tweets.

Partition Management with Apache Oozie

Once we have the twitter data loaded into HDFS, we can stage it for querying by creating

an external table in Hive. Using an external table will allow us to query the table without

moving the data from the location where it ends up in HDFS. To ensure scalability, as we

add more and more data, we'll need to also partition the table. A partitioned table allows

us to prune the files that we read when querying, which results in better performance

when dealing with large data sets. However, the Twitter API will continue to stream

tweets, and Flume will perpetually create new files. We can automate the periodic process

of adding partitions to our table as the new data comes in.

Apache Oozie is a workflow coordination system that can be used to solve this

problem. Oozie is an extremely flexible system for designing job workflows and can be

scheduled to run based on a set of criteria. We can configure the workflow to run an

ALTER TABLE command that adds a partition containing the last hour's worth of data

into Hive, and we can instruct the workflow to occur every hour. This will ensure that

we're always looking at up-to-date data.

Querying Complex Data with Hive

Before we can query the data, we need to ensure that the Hive table can properly interpret

the JSON data. By default, Hive expects that input files use a delimited row format, but

our Twitter data is in a JSON format, which will not work with the default settings. This

is actually one of Hive's biggest strengths. Hive allows us to flexibly define and redefine

how the data is represented on disk. The schema is only really enforced when we read

the data, and we can use the Hive SerDe interface to specify how to interpret what we've

loaded.

Search WWH ::

Custom Search

Home