Getting Started - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

tap as a collection of text documents. Instead of “Sink” or “File B,” now we've defined

the sink tap to produce word count tuples—the desired end result. Those changes in

the taps began to reference fields in the tuple stream. The source tap in both examples

was based on TextDelimited with parameters so that it reads a TSV file and uses the

header line to assign field names. “Example 1: Simplest Possible App in Cascading”

ignored the fields by simply copying data tuple by tuple. “Example 2: The Ubiquitous

Word Count” begins to reference fields by name, which introduces the notion of scheme

—imposing some expectation of structure on otherwise unstructured data.

The change in taps also added semantics to the workflow, specifying requirements for

added operations needed to reach the desired results. Let's consider the new Cascading

operations that were added to the pipe assembly in “Example 1: Simplest Possible App

in Cascading” : Tokenize , GroupBy , and Count . The first one, Tokenize , transforms the

input data tuples, splitting lines of text into a stream of tokens. That transform represents

the “T” in ETL. The second operation, GroupBy , performs an aggregation. In terms of

Hadoop, this causes a reduce with token as a key. The third operation, Count , gets applied

to each aggregation—counting the values for each token key, i.e., the number of in‐

stances of each token in the stream.

The deltas between “Example 1: Simplest Possible App in Cascading” and “Example 2:

The Ubiquitous Word Count” illustrate important aspects of Cascading. Consider how

data tuples flow through a pipe assembly, getting routed through familiar data operators

such as GroupBy , Count , etc. Each flow must be connected to a source of data as its input

and a sink as its output. The sink tap for one flow may in turn become a source tap for

another flow. Each flow defines a DAG that Cascading uses to infer schema from un‐

structured data.

Enterprise data workflows are complex applications, and managing that complexity is

the purpose for Cascading. Enterprise apps based on Apache Hadoop typically involve

more than just one Hadoop job step. Some apps are known to include hundreds of job

steps, with complex dependencies between them. Cascading leverages this concept of

a DAG to represent the business process of an app. The DAG, in turn, declares the

requirements for the job steps that are needed to complete the app's data flow. Conse‐

quently, a flow planner has sufficient information about the workflow so that it can

leverage the DAG in several ways:

Search WWH ::

Custom Search

Home