Databases Reference
In-Depth Information
tap as a collection of text documents. Instead of “Sink” or “File B,” now we've defined
the sink tap to produce word count tuples—the desired end result. Those changes in
the taps began to reference fields in the tuple stream. The source tap in both examples
was based on TextDelimited with parameters so that it reads a TSV file and uses the
header line to assign field names. “Example 1: Simplest Possible App in Cascading”
ignored the fields by simply copying data tuple by tuple. “Example 2: The Ubiquitous
Word Count” begins to reference fields by name, which introduces the notion of scheme
—imposing some expectation of structure on otherwise unstructured data.
The change in taps also added semantics to the workflow, specifying requirements for
added operations needed to reach the desired results. Let's consider the new Cascading
operations that were added to the pipe assembly in “Example 1: Simplest Possible App
in Cascading” : Tokenize , GroupBy , and Count . The first one, Tokenize , transforms the
input data tuples, splitting lines of text into a stream of tokens. That transform represents
the “T” in ETL. The second operation, GroupBy , performs an aggregation. In terms of
Hadoop, this causes a reduce with token as a key. The third operation, Count , gets applied
to each aggregation—counting the values for each token key, i.e., the number of in‐
stances of each token in the stream.
The deltas between “Example 1: Simplest Possible App in Cascading” and “Example 2:
The Ubiquitous Word Count” illustrate important aspects of Cascading. Consider how
data tuples flow through a pipe assembly, getting routed through familiar data operators
such as GroupBy , Count , etc. Each flow must be connected to a source of data as its input
and a sink as its output. The sink tap for one flow may in turn become a source tap for
another flow. Each flow defines a DAG that Cascading uses to infer schema from un‐
structured data.
Enterprise data workflows are complex applications, and managing that complexity is
the purpose for Cascading. Enterprise apps based on Apache Hadoop typically involve
more than just one Hadoop job step. Some apps are known to include hundreds of job
steps, with complex dependencies between them. Cascading leverages this concept of
a DAG to represent the business process of an app. The DAG, in turn, declares the
requirements for the job steps that are needed to complete the app's data flow. Conse‐
quently, a flow planner has sufficient information about the workflow so that it can
leverage the DAG in several ways:
Search WWH ::




Custom Search