Database Reference
In-Depth Information
I can see this behavior by running the Flume show script again:
[hadoop@hc1nn flume]$ ./flume_show_hdfs.sh
Found 11 items
-rw-r--r-- 2 hadoop hadoop 1281 2014-07-26 17:50 /flume/messages/FlumeData.1406353810397
-rw-r--r-- 2 hadoop hadoop 1057 2014-07-26 17:50 /flume/messages/FlumeData.1406353810398
-rw-r--r-- 2 hadoop hadoop 926 2014-07-26 17:50 /flume/messages/FlumeData.1406353810399
-rw-r--r-- 2 hadoop hadoop 1528 2014-07-26 17:50 /flume/messages/FlumeData.1406353810400
-rw-r--r-- 2 hadoop hadoop 1281 2014-07-26 17:50 /flume/messages/FlumeData.1406353810401
-rw-r--r-- 2 hadoop hadoop 1214 2014-07-26 17:50 /flume/messages/FlumeData.1406353810402
-rw-r--r-- 2 hadoop hadoop 1190 2014-07-26 17:50 /flume/messages/FlumeData.1406353810403
-rw-r--r-- 2 hadoop hadoop 1276 2014-07-26 17:50 /flume/messages/FlumeData.1406353810404
-rw-r--r-- 2 hadoop hadoop 1387 2014-07-26 17:50 /flume/messages/FlumeData.1406353810405
-rw-r--r-- 2 hadoop hadoop 1107 2014-07-26 17:50 /flume/messages/FlumeData.1406353810406
-rw-r--r-- 2 hadoop hadoop 1281 2014-07-26 17:51 /flume/messages/FlumeData.1406353810407
As you can see in this example, those 100 messages have been written to HDFS and the data is now available
for further processing in an ETL chain by one of the other processing languages. This has further possibilities. For
instance, you could use Apache Pig native to strip information from these files and employ an Oozie workflow to
organize that processing into an ETL chain.
This simple example uses a simple agent with a single source and sink. You could also organize agents to act
as sources or sinks for later agents in the chain so the feeds can fan in and out. You could build complex agent
processing topologies with many different types, depending upon your needs. Check the Apache Flume website at
flume.apache.org for further configuration examples.
You've now seen how to process relational database data with Sqoop and log-based data with Flume, but what
about streamed data? How is it possible to process an endless stream of data from a system like Twitter? The data
would not stop—it would just keep coming. The answer is that systems like Storm allow processing on data streams.
For instance, by using this tool, you can carry out trend analysis continuously on current data in the stream. The next
section examines some uses of Storm.
Moving Data with Storm
Apache Storm ( storm.incubator.apache.org ) from the Apache Software Foundation is an Apache incubator
project for processing unbounded data streams in real time. (The term “incubator” means that this is a new Apache
project; it is not yet mature. It needs to follow the Apache process before it can “graduate,” and this might mean that
its release process or documentation is not complete.) The best way to understand the significance of Storm is with a
comparison. On Hadoop, a Map Reduce job will start, process its data set, and exit; however, a topology (a Storm job
architecture) will run forever because its data feed is unlimited.
Consider a feed of events from the website Twitter; they just keep coming. When Storm processes a feed from
such a source, it processes the data it receives in real time. So, at any point, what Storm presents is a window on a
stream of data at the current time. Because of this, it also presents current trends in the data. In terms of Twitter, that
might indicate what many people are talking about right now. But also, because the data set is a stream that never
ends, Storm needs to be manually stopped.
A topology is a Storm job architecture. It is described in terms of spouts, steams, and bolts. Streams are streams of
data created from a sequence of data records called tuples .
Figure 6-4 shows a simple Storm data record, or tuple; a sequence or pipe of these data records forms a stream,
which is shown in Figure 6-5 .
 
Search WWH ::




Custom Search