Databases Reference
In-Depth Information
The short answer is: you don't. However, you can greatly reduce the need for writing
unit test coverage by limiting the amount of custom code required. Hopefully we've
shown that aspect of Cascading by now. Beyond that, you can use sampling techniques
to quantify confidence that an app has run correctly. You can also define system tests at
scale in relatively simple ways. Furthermore, you can define contingencies for what to
do when assumptions fail…as they inevitably do, at scale.
Let's discuss sampling. Generally speaking, large MapReduce workflows tend to be rel‐
atively opaque processes that are difficult to observe. Cascading, however, provides two
techniques for observing portions of a workflow. One very simple approach is to insert
a Debug into a pipe to see the tuple values passing through a particular part of a work‐
flow. Debug output goes to the log instead of a file, but it can be turned off, e.g., with a
command-line option. If the data is large, one can use a Sample filter to sample the tuple
values that get written to the log.
Another approach is to use a Checkpoint , which forces intermediate data to be written
out to HDFS. This may also become important for performance reasons, i.e., forcing
results to disk to avoid recomputing—e.g., when there are multiple uses for the output
of a pipe downstream such as with the RHS of a HashJoin . Sampling may be performed
either before (with Debug) or after (with Checkpoint) the data gets persisted to HDFS.
Checkpoints can also be used to restart partially failed workflows, to recover some costs.
Next, let's talk about system tests. Cascading includes support for stream assertions .
These provide mechanisms for asserting that the values in a tuple stream meet certain
criteria—similar to the assert keyword in Java, or an assert not null in a JUnit test.
We can assert patterns strictly as unit tests during development and then run testing
against regression data. For performance reasons, we might use command-line options
to turn off assertions in production—or keep them (fail-fast mode) if a use case requires
that level of guarantee.
Books about Test Driven Development
For more information about TDD in general, check out these topics:
Test Driven Development: By Example by Kent Beck (Addison-Wesley)
Test-Driven Development: A Practical Guide by Dave Astels (Prentice Hall)
Lastly, what should you do when assumptions fail? One lesson of working with data at
scale is that the best assumptions will inevitably fail. Unexpected things happen, and
80% of the work will be cleaning up problems.
Cascading defines failure traps , which capture data that would otherwise cause an op‐
eration to fail, e.g., by throwing an exception. For example, perhaps 99% of the cases in
 
Search WWH ::




Custom Search