Databases Reference
In-Depth Information
• Ensure that necessary fields are available to operations that require them—based
on tuple scheme.
• Apply transformations to help optimize the app—e.g., moving code from reduce
into map.
• Track data provenance across different sources and sinks—understand the pro‐
ducer/consumer relationship of data products.
• Annotate the DAG with metrics from each step, across the history of an app's in‐
stances—capacity planning, notifications for data drops, etc.
• Identify or predict bottlenecks, e.g., key/value skew as the shape of the input data
changes—troubleshoot apps.
Those capabilities address important concerns in Enterprise IT and stand as key points
by which Cascading differentiates itself from other Hadoop abstraction layers.
Another subtle point concerns the use of taps . On one hand, data taps are available for
integrating Cascading with several other popular data frameworks, including
Memcached, HBase, Cassandra, etc. Several popular data serialization systems are sup‐
ported, such as Apache Thrift, Avro, Kyro, etc. Looking at the conceptual flow diagram,
our workflow could be using any of a variety of different data frameworks and seriali‐
zation systems. That could apply equally well to SQL query result sets via JDBC or to
data coming from Cassandra via Thrift. It wouldn't be difficult to modify the code in
“Example 2: The Ubiquitous Word Count” to set those details based on configuration
parameters. To wit, the taps generalize many physical aspects of the data so that we can
leverage patterns.
On the other hand, taps also help manage complexity at scale. Our code in “Example 2:
The Ubiquitous Word Count” could be run on a laptop in Hadoop's “standalone” mode
to process a small file such as rain.txt , which is a mere 510 bytes. The same code could
be run on a 1,000-node Hadoop cluster to process several petabytes of the Internet
Archives' Wayback Machine .
Taps are agnostic about scale, because the underlying topology (Hadoop) uses paral‐
lelism to handle very large data. Generally speaking, Cascading apps handle scale-out
into larger and larger data sets by changing the parameters used to define taps. Taps
themselves are formal parameters that specify placeholders for input and output data.
When a Cascading app runs, its actual parameters specify the actual data to be used—
whether those are HDFS partition files, HBase data objects, Memcached key/values, etc.
We call these tap identifiers . They are effectively uniform resource identifiers (URIs) for
connecting through protocols such as HDFS, JDBC, etc. A dependency graph of tap
identifiers and the history of app instances that produced or consumed them is analo‐
gous to a catalog in relational databases.
Search WWH ::




Custom Search