Database Reference
In-Depth Information
Pipeline Execution
During pipeline construction, Crunch builds an internal execution plan, which is either run
explicitly by the user or implicitly by Crunch (as discussed in
Materialization
). An execu-
tion plan is a directed acyclic graph of operations on
PCollection
s, where each
PCol-
lection
in the plan holds a reference to the operation that produces it, along with the
PCollection
s that are arguments to the operation. In addition, each
PCollection
has an internal state that records whether it has been materialized or not.
Running a Pipeline
A pipeline's operations can be explicitly executed by calling
Pipeline
's
run()
method,
which performs the following steps.
First, it optimizes the execution plan as a number of stages. The details of the optimization
depend on the execution engine — a plan optimized for MapReduce will be different from
the same plan optimized for Spark.
Second, it executes each stage in the optimized plan (in parallel, where possible) to materi-
alize the resulting
PCollection
.
PCollection
s that are to be written to a
Target
are materialized as the target itself — this might be an output file in HDFS or a table in
HBase. Intermediate
PCollection
s are materialized by writing the serialized objects in
the collection to a temporary intermediate file in HDFS.
Finally, the
run()
method returns a
PipelineResult
object to the caller, with inform-
whether the pipeline was successful or not (via the
succeeded()
method).
The
clean()
method removes all of the temporary intermediate files that were created to
materialize
PCollection
s. It should be called after the pipeline is finished with to free
up disk space on HDFS. The method takes a Boolean parameter to indicate whether the
temporary files should be forcibly deleted. If
false
, the temporary files will only be de-
leted if all the targets in the pipeline have been created.
Rather than calling
run()
followed by
clean(false)
, it is more convenient to call
done()
, which has the same effect; it signals that the pipeline should be run and then
cleaned up since it will not be needed any more.