Getting Started - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

Predictability at Scale

The code in “Example 1: Simplest Possible App in Cascading” showed how to move data

from point A to point B. That was simply a distributed file copy—loading data via

distributed tasks, or the “L” in ETL.

A copy example may seem trivial, and it may seem like Cascading is overkill for that.

However, moving important data from point A to point B reliably can be a crucial job

to perform. This helps illustrate one of the key reasons to use Cascading.

Consider an analogy of building a small Ferris wheel. With a little bit of imagination

and some background in welding, a person could cobble one together using old bicycle

parts. In fact, those DIY Ferris wheels show up at events such as Maker Faire . Starting

out, a person might construct a little Ferris wheel, just for demo. It might not hold

anything larger than hamsters, but it's not a hard problem. With a bit more skill, a person

could probably build a somewhat larger instance, one that's big enough for small chil‐

dren to ride.

Ask yourself this: how robust would a DIY Ferris wheel need to be before you let your

kids ride on it? That's precisely part of the challenge at an event like Maker Faire. Makers

must be able to build a device such as a Ferris wheel out of spare bicycle parts that is

robust enough that strangers will let their kids ride. Let's hope those welds were made

using best practices and good materials, to avoid catastrophes.

That's a key reason why Cascading was created. When you need to move a few gigabytes

from point A to point B, it's probably simple enough to write a Bash script, or just use

a single command-line copy. When your work requires some reshaping of the data, then

a few lines of Python will probably work fine. Run that Python code from your Bash

script and you're done.

That's a great approach, when it fits the use case requirements. However, suppose you're

not moving just gigabytes. Suppose you're moving terabytes, or petabytes. Bash scripts

won't get you very far. Also think about this: suppose an app not only needs to move

data from point A to point B, but it must follow the required best practices of an En‐

terprise IT shop. Millions of dollars and potentially even some jobs ride on the fact that

the app performs correctly. Day in and day out. That's not unlike trusting a Ferris wheel

made by strangers; the users want to make sure it wasn't just built out of spare bicycle

parts by some amateur welder. Robustness is key.

Or, taking this analogy a few steps in another interesting direction, perhaps you're not

only moving data and reshaping it a little, but you're applying some interesting machine

learning algorithms, some natural language processing, gene sequencing…who knows?

Those imply lots of resource use, lots of potential expense in case of failures. Or lots of

customer exposure. You'll want to use an application framework that is significantly

more robust than a bunch of scripts cobbled together.

Search WWH ::

Custom Search

Home