Database Reference
In-Depth Information
Figure 24-10. The ShareThis log processing flow
Using Cascading's event listeners, Amazon SQS could be integrated. When a Flow fin-
ishes, a message is sent to notify other systems that there is data ready to be picked up
from Amazon S3. On failure, a different message is sent, alerting other processes.
The remaining downstream processes pick up where the log processing pipeline leaves off
on different independent clusters. The log processing pipeline today runs once a day; there
is no need to keep a 100-node cluster sitting around for the 23 hours it has nothing to do,
so it is decommissioned and recommissioned 24 hours later.
In the future, it would be trivial to increase this interval on smaller clusters to every 6
hours, or 1 hour, as the business demands. Independently, other clusters are booting and
shutting down at different intervals based on the needs of the business units responsible
for those components. For example, the web crawler component (using Bixo, a
Cascading-based web-crawler toolkit developed by EMI and ShareThis) may run continu-
ously on a small cluster with a companion Hypertable cluster. This on-demand model
works very well with Hadoop, where each cluster can be tuned for the kind of workload it
is expected to handle.
Search WWH ::




Custom Search