Cascading - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Figure 24-10. The ShareThis log processing flow

Using Cascading's event listeners, Amazon SQS could be integrated. When a Flow fin-

ishes, a message is sent to notify other systems that there is data ready to be picked up

from Amazon S3. On failure, a different message is sent, alerting other processes.

The remaining downstream processes pick up where the log processing pipeline leaves off

on different independent clusters. The log processing pipeline today runs once a day; there

is no need to keep a 100-node cluster sitting around for the 23 hours it has nothing to do,

so it is decommissioned and recommissioned 24 hours later.

In the future, it would be trivial to increase this interval on smaller clusters to every 6

hours, or 1 hour, as the business demands. Independently, other clusters are booting and

shutting down at different intervals based on the needs of the business units responsible

for those components. For example, the web crawler component (using Bixo, a

Cascading-based web-crawler toolkit developed by EMI and ShareThis) may run continu-

ously on a small cluster with a companion Hypertable cluster. This on-demand model

works very well with Hadoop, where each cluster can be tuned for the kind of workload it

is expected to handle.

Search WWH ::

Custom Search

Home