Getting Started - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

doc_id text

doc01 A rain shadow is a dry area on the lee back side of a mountainous area.

Input tuples get copied, TSV row by TSV row, to the sink tap. The second argument

specifies that the sink tap be written to the output/rain output, which is organized as

a partition file. You can verify that those lines got copied by viewing the text output, for

example:

$ head -2 output/rain/part-00000

doc_id text

doc01 A rain shadow is a dry area on the lee back side of a mountainous area.

For quick reference, the source code, input data, and a log for this example are listed in

a GitHub gist . If the log of your run looks terribly different, something is probably not

set up correctly. There are multiple ways to interact with the Cascading developer com‐

munity. You can post a note on the cascading-user email forum . Plenty of experienced

Cascading users are discussing taps and pipes and flows there, and they are eager to

help. Or you can visit one of the Cascading user group meetings .

Cascading Taxonomy

Conceptually, a “flow diagram” for this first example is shown in Figure 1-1 . Our simplest

app possible copies lines of text from file “A” to file “B.” The “M” and “R” labels represent

the map and reduce phases, respectively. As the flow diagram shows, it uses one job step

in Apache Hadoop: only one map and no reduce needed. The implementation is a brief

Java program, 10 lines long.

Wait—10 lines of code to copy a file? That seems excessive; certainly this same work

could be performed in much quicker ways, such as using the cp command on Linux.

However, keep in mind that Cascading is about the “plumbing” required to make En‐

terprise apps robust. There is some overhead in the setup, but those lines of code won't

change much as an app's complexity grows. That overhead helps provide for the prin‐

ciple of “Same JAR, any scale.”

Let's take a look at the components of a Cascading app. Figure 1-2 shows a taxonomy

that starts with apps at the top level. An app has a unique signature and is versioned,

and it includes one or more flows . Optionally, those flows may be organized into

cascades , which are collections of flows without dependencies on one another, so that

they may be run in parallel.

Each flow represents a physical plan, based on the planner for a specific topology such

as Apache Hadoop. The physical plan provides a deterministic strategy for a query.

Developers talk about a principle of “Fail the same way twice.” In other words, when we

need to debug an issue, it's quite important that Cascading flows behave deterministi‐

cally. Otherwise, the process of troubleshooting edge cases on a large cluster and with

Search WWH ::

Custom Search

Home