Getting Started - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

• Ensure that necessary fields are available to operations that require them—based

on tuple scheme.

• Apply transformations to help optimize the app—e.g., moving code from reduce

into map.

• Track data provenance across different sources and sinks—understand the pro‐

ducer/consumer relationship of data products.

• Annotate the DAG with metrics from each step, across the history of an app's in‐

stances—capacity planning, notifications for data drops, etc.

• Identify or predict bottlenecks, e.g., key/value skew as the shape of the input data

changes—troubleshoot apps.

Those capabilities address important concerns in Enterprise IT and stand as key points

by which Cascading differentiates itself from other Hadoop abstraction layers.

Another subtle point concerns the use of taps . On one hand, data taps are available for

integrating Cascading with several other popular data frameworks, including

Memcached, HBase, Cassandra, etc. Several popular data serialization systems are sup‐

ported, such as Apache Thrift, Avro, Kyro, etc. Looking at the conceptual flow diagram,

our workflow could be using any of a variety of different data frameworks and seriali‐

zation systems. That could apply equally well to SQL query result sets via JDBC or to

data coming from Cassandra via Thrift. It wouldn't be difficult to modify the code in

“Example 2: The Ubiquitous Word Count” to set those details based on configuration

parameters. To wit, the taps generalize many physical aspects of the data so that we can

leverage patterns.

On the other hand, taps also help manage complexity at scale. Our code in “Example 2:

The Ubiquitous Word Count” could be run on a laptop in Hadoop's “standalone” mode

to process a small file such as rain.txt , which is a mere 510 bytes. The same code could

be run on a 1,000-node Hadoop cluster to process several petabytes of the Internet

Archives' Wayback Machine .

Taps are agnostic about scale, because the underlying topology (Hadoop) uses paral‐

lelism to handle very large data. Generally speaking, Cascading apps handle scale-out

into larger and larger data sets by changing the parameters used to define taps. Taps

themselves are formal parameters that specify placeholders for input and output data.

When a Cascading app runs, its actual parameters specify the actual data to be used—

whether those are HDFS partition files, HBase data objects, Memcached key/values, etc.

We call these tap identifiers . They are effectively uniform resource identifiers (URIs) for

connecting through protocols such as HDFS, JDBC, etc. A dependency graph of tap

identifiers and the history of app instances that produced or consumed them is analo‐

gous to a catalog in relational databases.

Search WWH ::

Custom Search

Home