Test-Driven Development - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

That tuple does not match the regex doc\\d+\\s.* that was specified by the stream

assertion. Great, we caught it before it blew up something downstream.

A gist on GitHub shows building and running “Example 6: TF-IDF with Testing” . If

your run looks terribly different, something is probably not set up correctly. Ask the

developer community for troubleshooting advice.

A Word or Two About Testing

At first glance, the notion of TDD might seem a bit antithetical in the context of Big

Data. After all, TDD is supposed to be about short development cycles, writing auto‐

mated test cases that are intended to fail, and lots of refactoring. Those descriptions

don't seem to fit with batch jobs that involve terabytes of data run on huge Hadoop

clusters for days before they complete.

Stated in a somewhat different way, according to Kent Beck, TDD “encourages simple

designs and inspires confidence.” That statement fits quite well with the philosophy of

Cascading. The Cascading API is intended to provide a pattern language for working

with large-scale data— GroupBy , Join , Count , Regex , Filter —so that the need for writ‐

ing custom functions becomes relatively rare. That speaks to “encouraging simple de‐

signs” directly. The practice in Cascading of modeling business process and orchestrat‐

ing Apache Hadoop workflows speaks to “inspiring confidence” in a big way.

So now we'll let the cat out of the bag for a little secret…working with unstructured data

at scale has been shown to be quite valuable by the likes of Google, Amazon, eBay,

Facebook, LinkedIn, Twitter, etc. However, most of the “heavy lifting” that we perform

in MapReduce workflows is essentially cleaning up data. DJ Patil, formerly Chief Sci‐

entist at LinkedIn, explains this point quite eloquently in the mini-book Data Jujitsu :

It's impossible to overstress this: 80% of the work in any data project is in cleaning the

data… Work done up front in getting clean data will be amply repaid over the course of

the project.

— DJ Patil

Data Jujitsu (2012)

Cleaning up unstructured data allows for subsequent use of sampling techniques , di‐

mensional reduction , and other practices that help alleviate some of the bottlenecks that

might otherwise be encountered in Enterprise data workflows. Thinking about this in

another way, we have need for API features that demonstrate how “dirty” data has been

cleaned up. Cascading provides those features, which turn out to be quite valuable in

practice.

Common practices for test-driven development include writing unit tests or mocks .

How does one write a quick unit test for a Godzilla-sized data set?

Search WWH ::

Custom Search

Home