Databases Reference
In-Depth Information
doc_id text
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
Input tuples get copied, TSV row by TSV row, to the sink tap. The second argument
specifies that the sink tap be written to the output/rain output, which is organized as
a partition file. You can verify that those lines got copied by viewing the text output, for
example:
$ head -2 output/rain/part-00000
doc_id text
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
For quick reference, the source code, input data, and a log for this example are listed in
a GitHub gist . If the log of your run looks terribly different, something is probably not
set up correctly. There are multiple ways to interact with the Cascading developer com‐
munity. You can post a note on the cascading-user email forum . Plenty of experienced
Cascading users are discussing taps and pipes and flows there, and they are eager to
help. Or you can visit one of the Cascading user group meetings .
Cascading Taxonomy
Conceptually, a “flow diagram” for this first example is shown in Figure 1-1 . Our simplest
app possible copies lines of text from file “A” to file “B.” The “M” and “R” labels represent
the map and reduce phases, respectively. As the flow diagram shows, it uses one job step
in Apache Hadoop: only one map and no reduce needed. The implementation is a brief
Java program, 10 lines long.
Wait—10 lines of code to copy a file? That seems excessive; certainly this same work
could be performed in much quicker ways, such as using the cp command on Linux.
However, keep in mind that Cascading is about the “plumbing” required to make En‐
terprise apps robust. There is some overhead in the setup, but those lines of code won't
change much as an app's complexity grows. That overhead helps provide for the prin‐
ciple of “Same JAR, any scale.”
Let's take a look at the components of a Cascading app. Figure 1-2 shows a taxonomy
that starts with apps at the top level. An app has a unique signature and is versioned,
and it includes one or more flows . Optionally, those flows may be organized into
cascades , which are collections of flows without dependencies on one another, so that
they may be run in parallel.
Each flow represents a physical plan, based on the planner for a specific topology such
as Apache Hadoop. The physical plan provides a deterministic strategy for a query.
Developers talk about a principle of “Fail the same way twice.” In other words, when we
need to debug an issue, it's quite important that Cascading flows behave deterministi‐
cally. Otherwise, the process of troubleshooting edge cases on a large cluster and with
Search WWH ::




Custom Search