Moving Data - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

I can see this behavior by running the Flume show script again:

[hadoop@hc1nn flume]$ ./flume_show_hdfs.sh

Found 11 items

-rw-r--r-- 2 hadoop hadoop 1281 2014-07-26 17:50 /flume/messages/FlumeData.1406353810397

-rw-r--r-- 2 hadoop hadoop 1057 2014-07-26 17:50 /flume/messages/FlumeData.1406353810398

-rw-r--r-- 2 hadoop hadoop 926 2014-07-26 17:50 /flume/messages/FlumeData.1406353810399

-rw-r--r-- 2 hadoop hadoop 1528 2014-07-26 17:50 /flume/messages/FlumeData.1406353810400

-rw-r--r-- 2 hadoop hadoop 1281 2014-07-26 17:50 /flume/messages/FlumeData.1406353810401

-rw-r--r-- 2 hadoop hadoop 1214 2014-07-26 17:50 /flume/messages/FlumeData.1406353810402

-rw-r--r-- 2 hadoop hadoop 1190 2014-07-26 17:50 /flume/messages/FlumeData.1406353810403

-rw-r--r-- 2 hadoop hadoop 1276 2014-07-26 17:50 /flume/messages/FlumeData.1406353810404

-rw-r--r-- 2 hadoop hadoop 1387 2014-07-26 17:50 /flume/messages/FlumeData.1406353810405

-rw-r--r-- 2 hadoop hadoop 1107 2014-07-26 17:50 /flume/messages/FlumeData.1406353810406

-rw-r--r-- 2 hadoop hadoop 1281 2014-07-26 17:51 /flume/messages/FlumeData.1406353810407

As you can see in this example, those 100 messages have been written to HDFS and the data is now available

for further processing in an ETL chain by one of the other processing languages. This has further possibilities. For

instance, you could use Apache Pig native to strip information from these files and employ an Oozie workflow to

organize that processing into an ETL chain.

This simple example uses a simple agent with a single source and sink. You could also organize agents to act

as sources or sinks for later agents in the chain so the feeds can fan in and out. You could build complex agent

processing topologies with many different types, depending upon your needs. Check the Apache Flume website at

flume.apache.org for further configuration examples.

You've now seen how to process relational database data with Sqoop and log-based data with Flume, but what

about streamed data? How is it possible to process an endless stream of data from a system like Twitter? The data

would not stop—it would just keep coming. The answer is that systems like Storm allow processing on data streams.

For instance, by using this tool, you can carry out trend analysis continuously on current data in the stream. The next

section examines some uses of Storm.

Moving Data with Storm

Apache Storm ( storm.incubator.apache.org ) from the Apache Software Foundation is an Apache incubator

project for processing unbounded data streams in real time. (The term “incubator” means that this is a new Apache

project; it is not yet mature. It needs to follow the Apache process before it can “graduate,” and this might mean that

its release process or documentation is not complete.) The best way to understand the significance of Storm is with a

comparison. On Hadoop, a Map Reduce job will start, process its data set, and exit; however, a topology (a Storm job

architecture) will run forever because its data feed is unlimited.

Consider a feed of events from the website Twitter; they just keep coming. When Storm processes a feed from

such a source, it processes the data it receives in real time. So, at any point, what Storm presents is a window on a

stream of data at the current time. Because of this, it also presents current trends in the data. In terms of Twitter, that

might indicate what many people are talking about right now. But also, because the data set is a stream that never

ends, Storm needs to be manually stopped.

A topology is a Storm job architecture. It is described in terms of spouts, steams, and bolts. Streams are streams of

data created from a sequence of data records called tuples .

Figure 6-4 shows a simple Storm data record, or tuple; a sequence or pipe of these data records forms a stream,

which is shown in Figure 6-5 .

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Search WWH ::

Custom Search

Home