The Workflow Abstraction - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

CHAPTER 7

The Workflow Abstraction

Key Insights

Thus far, we have looked at several examples of how to use Cascading. Now let's step

back a bit and take a look at some of the theory at its foundation.

The author of Cascading, Chris Wensel, was working at a large firm known well for

many data products. Wensel was evaluating the Nutch project, which included Lucene

and subsequently Hadoop—he was evaluating how to leverage these open source tech‐

nologies for Big Data within an Enterprise environment. His takeaway was that it would

be difficult to find enough Java developers who could write complex Enterprise apps

directly in MapReduce.

An obvious response would have been to build some kind of abstraction layer atop

Hadoop. Many different variations of this have been developed over the years, and that

approach dates back to the many “fourth-generation languages” (4GL) starting in the

1970s. However, another takeaway Wensel had from the early days of Apache Hadoop

use was that abstraction layers built by and for the early adopters typically would not

pass the “bench test” for Enterprise. The operational complexity of large-scale apps and

the need to leverage many existing software engineering practices would be difficult if

not impossible to manage through a 4GL-styled abstraction layer.

A key insight into this problem was that MapReduce is based on the functional pro‐

gramming paradigm. In the original MapReduce paper by Jeffrey Dean and Sanjay

Ghemawat at Google, the authors made clear that a functional programming model

allowed for the following:

Search WWH ::

Custom Search

Home