Database Reference
In-Depth Information
Further Reading
This chapter has given a short introduction to Crunch. To find out more, consult the
Crunch
[
118
]
Some operations do not require a
PType
, since they can infer it from the
PCollection
they are ap-
plied to. For example,
filter()
returns a
PCollection
with the same
PType
as the original.
[
121
]
See the
documentation
.
[
122
]
This is an example of where a pipeline gets executed without an explicit call to
run()
or
done()
, but
it is still good practice to call
done()
when the pipeline is finished with so that intermediate files are dis-
posed of.
ject<Map<K, V>>
.
[
125
]
This optimization is called
parallelDo
fusion
; it explained in more detail in the
FlumeJava paper
ref-
erenced at the beginning of the chapter, along with some of the other optimizations used by Crunch. Note that
parallelDo
fusion is what allows you to decompose pipeline operations into small, logically separate
functions without any loss of efficiency, since Crunch fuses them into as few MapReduce stages as possible.
[
127
]
You can find the full source code in the Crunch integration tests in a class called
PageRankIT
.