Database Reference
In-Depth Information
Chapter 24. Cascading
Chris K. Wensel
Cascading is an open source Java library and API that provides an abstraction layer for
MapReduce. It allows developers to build complex, mission-critical data processing applic-
ations that run on Hadoop clusters.
The Cascading project began in the summer of 2007. Its first public release, version 0.1,
launched in January 2008. Version 1.0 was released in January 2009. Binaries, source code,
and add-on modules can be downloaded from the project website .
Map and reduce operations offer powerful primitives. However, they tend to be at the
wrong level of granularity for creating sophisticated, highly composable code that can be
shared among different developers. Moreover, many developers find it difficult to “think”
in terms of MapReduce when faced with real-world problems.
To address the first issue, Cascading substitutes the keys and values used in MapReduce
with simple field names and a data tuple model, where a tuple is simply a list of values. For
the second issue, Cascading departs from map and reduce operations directly by introdu-
cing higher-level abstractions as alternatives: Function s, Filter s, Aggregator s,
and Buffer s.
Other alternatives began to emerge at about the same time as the project's initial public re-
lease, but Cascading was designed to complement them. Consider that most of these altern-
ative frameworks impose pre- and post-conditions, or other expectations.
For example, in several other MapReduce tools, you must preformat, filter, or import your
data into HDFS prior to running the application. That step of preparing the data must be
performed outside of the programming abstraction. In contrast, Cascading provides the
means to prepare and manage your data as integral parts of the programming abstraction.
This case study begins with an introduction to the main concepts of Cascading, then fin-
ishes with an overview of how ShareThis uses Cascading in its infrastructure.
See the Cascading User Guide on the project website for a more in-depth presentation of
the Cascading processing model.
Search WWH ::




Custom Search