Databases Reference
In-Depth Information
That tuple does not match the regex doc\\d+\\s.* that was specified by the stream
assertion. Great, we caught it before it blew up something downstream.
A gist on GitHub shows building and running “Example 6: TF-IDF with Testing” . If
your run looks terribly different, something is probably not set up correctly. Ask the
developer community for troubleshooting advice.
A Word or Two About Testing
At first glance, the notion of TDD might seem a bit antithetical in the context of Big
Data. After all, TDD is supposed to be about short development cycles, writing auto‐
mated test cases that are intended to fail, and lots of refactoring. Those descriptions
don't seem to fit with batch jobs that involve terabytes of data run on huge Hadoop
clusters for days before they complete.
Stated in a somewhat different way, according to Kent Beck, TDD “encourages simple
designs and inspires confidence.” That statement fits quite well with the philosophy of
Cascading. The Cascading API is intended to provide a pattern language for working
with large-scale data— GroupBy , Join , Count , Regex , Filter —so that the need for writ‐
ing custom functions becomes relatively rare. That speaks to “encouraging simple de‐
signs” directly. The practice in Cascading of modeling business process and orchestrat‐
ing Apache Hadoop workflows speaks to “inspiring confidence” in a big way.
So now we'll let the cat out of the bag for a little secret…working with unstructured data
at scale has been shown to be quite valuable by the likes of Google, Amazon, eBay,
Facebook, LinkedIn, Twitter, etc. However, most of the “heavy lifting” that we perform
in MapReduce workflows is essentially cleaning up data. DJ Patil, formerly Chief Sci‐
entist at LinkedIn, explains this point quite eloquently in the mini-book Data Jujitsu :
It's impossible to overstress this: 80% of the work in any data project is in cleaning the
data… Work done up front in getting clean data will be amply repaid over the course of
the project.
— DJ Patil
Data Jujitsu (2012)
Cleaning up unstructured data allows for subsequent use of sampling techniques , di‐
mensional reduction , and other practices that help alleviate some of the bottlenecks that
might otherwise be encountered in Enterprise data workflows. Thinking about this in
another way, we have need for API features that demonstrate how “dirty” data has been
cleaned up. Cascading provides those features, which turn out to be quite valuable in
practice.
Common practices for test-driven development include writing unit tests or mocks .
How does one write a quick unit test for a Godzilla-sized data set?
Search WWH ::




Custom Search