Databases Reference
In-Depth Information
Predictability at Scale
The code in “Example 1: Simplest Possible App in Cascading” showed how to move data
from point A to point B. That was simply a distributed file copy—loading data via
distributed tasks, or the “L” in ETL.
A copy example may seem trivial, and it may seem like Cascading is overkill for that.
However, moving important data from point A to point B reliably can be a crucial job
to perform. This helps illustrate one of the key reasons to use Cascading.
Consider an analogy of building a small Ferris wheel. With a little bit of imagination
and some background in welding, a person could cobble one together using old bicycle
parts. In fact, those DIY Ferris wheels show up at events such as Maker Faire . Starting
out, a person might construct a little Ferris wheel, just for demo. It might not hold
anything larger than hamsters, but it's not a hard problem. With a bit more skill, a person
could probably build a somewhat larger instance, one that's big enough for small chil‐
dren to ride.
Ask yourself this: how robust would a DIY Ferris wheel need to be before you let your
kids ride on it? That's precisely part of the challenge at an event like Maker Faire. Makers
must be able to build a device such as a Ferris wheel out of spare bicycle parts that is
robust enough that strangers will let their kids ride. Let's hope those welds were made
using best practices and good materials, to avoid catastrophes.
That's a key reason why Cascading was created. When you need to move a few gigabytes
from point A to point B, it's probably simple enough to write a Bash script, or just use
a single command-line copy. When your work requires some reshaping of the data, then
a few lines of Python will probably work fine. Run that Python code from your Bash
script and you're done.
That's a great approach, when it fits the use case requirements. However, suppose you're
not moving just gigabytes. Suppose you're moving terabytes, or petabytes. Bash scripts
won't get you very far. Also think about this: suppose an app not only needs to move
data from point A to point B, but it must follow the required best practices of an En‐
terprise IT shop. Millions of dollars and potentially even some jobs ride on the fact that
the app performs correctly. Day in and day out. That's not unlike trusting a Ferris wheel
made by strangers; the users want to make sure it wasn't just built out of spare bicycle
parts by some amateur welder. Robustness is key.
Or, taking this analogy a few steps in another interesting direction, perhaps you're not
only moving data and reshaping it a little, but you're applying some interesting machine
learning algorithms, some natural language processing, gene sequencing…who knows?
Those imply lots of resource use, lots of potential expense in case of failures. Or lots of
customer exposure. You'll want to use an application framework that is significantly
more robust than a bunch of scripts cobbled together.
Search WWH ::




Custom Search