Test-Driven Development - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

The short answer is: you don't. However, you can greatly reduce the need for writing

unit test coverage by limiting the amount of custom code required. Hopefully we've

shown that aspect of Cascading by now. Beyond that, you can use sampling techniques

to quantify confidence that an app has run correctly. You can also define system tests at

scale in relatively simple ways. Furthermore, you can define contingencies for what to

do when assumptions fail…as they inevitably do, at scale.

Let's discuss sampling. Generally speaking, large MapReduce workflows tend to be rel‐

atively opaque processes that are difficult to observe. Cascading, however, provides two

techniques for observing portions of a workflow. One very simple approach is to insert

a Debug into a pipe to see the tuple values passing through a particular part of a work‐

flow. Debug output goes to the log instead of a file, but it can be turned off, e.g., with a

command-line option. If the data is large, one can use a Sample filter to sample the tuple

values that get written to the log.

Another approach is to use a Checkpoint , which forces intermediate data to be written

out to HDFS. This may also become important for performance reasons, i.e., forcing

results to disk to avoid recomputing—e.g., when there are multiple uses for the output

of a pipe downstream such as with the RHS of a HashJoin . Sampling may be performed

either before (with Debug) or after (with Checkpoint) the data gets persisted to HDFS.

Checkpoints can also be used to restart partially failed workflows, to recover some costs.

Next, let's talk about system tests. Cascading includes support for stream assertions .

These provide mechanisms for asserting that the values in a tuple stream meet certain

criteria—similar to the assert keyword in Java, or an assert not null in a JUnit test.

We can assert patterns strictly as unit tests during development and then run testing

against regression data. For performance reasons, we might use command-line options

to turn off assertions in production—or keep them (fail-fast mode) if a use case requires

that level of guarantee.

Books about Test Driven Development

For more information about TDD in general, check out these topics:

• Test Driven Development: By Example by Kent Beck (Addison-Wesley)

• Test-Driven Development: A Practical Guide by Dave Astels (Prentice Hall)

Lastly, what should you do when assumptions fail? One lesson of working with data at

scale is that the best assumptions will inevitably fail. Unexpected things happen, and

80% of the work will be cleaning up problems.

Cascading defines failure traps , which capture data that would otherwise cause an op‐

eration to fail, e.g., by throwing an exception. For example, perhaps 99% of the cases in

Search WWH ::

Custom Search

Home