Crunch - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Further Reading

This chapter has given a short introduction to Crunch. To find out more, consult the Crunch

User Guide .

[ 118 ] Some operations do not require a PType , since they can infer it from the PCollection they are ap-

plied to. For example, filter() returns a PCollection with the same PType as the original.

[ 119 ] Despite the name, APPEND does not append to existing output files.

[ 120 ] HBaseTarget does not check for existing outputs, so it behaves as if APPEND mode is used.

[ 121 ] See the documentation .

[ 122 ] This is an example of where a pipeline gets executed without an explicit call to run() or done() , but

it is still good practice to call done() when the pipeline is finished with so that intermediate files are dis-

posed of.

[ 123 ] There is also an asMap() method on PTable<K, V> that returns an object of type POb-

ject<Map<K, V>> .

[ 124 ] You can increment your own custom counters from Crunch using DoFn 's increment() method.

[ 125 ] This optimization is called parallelDo fusion ; it explained in more detail in the FlumeJava paper ref-

erenced at the beginning of the chapter, along with some of the other optimizations used by Crunch. Note that

parallelDo fusion is what allows you to decompose pipeline operations into small, logically separate

functions without any loss of efficiency, since Crunch fuses them into as few MapReduce stages as possible.

[ 126 ] For details, see Wikipedia .

[ 127 ] You can find the full source code in the Crunch integration tests in a class called PageRankIT .

Search WWH ::

Custom Search

Home