Database Reference
In-Depth Information
Pipeline Execution
During pipeline construction, Crunch builds an internal execution plan, which is either run
explicitly by the user or implicitly by Crunch (as discussed in Materialization ). An execu-
tion plan is a directed acyclic graph of operations on PCollection s, where each PCol-
lection in the plan holds a reference to the operation that produces it, along with the
PCollection s that are arguments to the operation. In addition, each PCollection
has an internal state that records whether it has been materialized or not.
Running a Pipeline
A pipeline's operations can be explicitly executed by calling Pipeline 's run() method,
which performs the following steps.
First, it optimizes the execution plan as a number of stages. The details of the optimization
depend on the execution engine — a plan optimized for MapReduce will be different from
the same plan optimized for Spark.
Second, it executes each stage in the optimized plan (in parallel, where possible) to materi-
alize the resulting PCollection . PCollection s that are to be written to a Target
are materialized as the target itself — this might be an output file in HDFS or a table in
HBase. Intermediate PCollection s are materialized by writing the serialized objects in
the collection to a temporary intermediate file in HDFS.
Finally, the run() method returns a PipelineResult object to the caller, with inform-
ation about each stage that was run (duration and MapReduce counters [ 124 ] ), as well as
whether the pipeline was successful or not (via the succeeded() method).
The clean() method removes all of the temporary intermediate files that were created to
materialize PCollection s. It should be called after the pipeline is finished with to free
up disk space on HDFS. The method takes a Boolean parameter to indicate whether the
temporary files should be forcibly deleted. If false , the temporary files will only be de-
leted if all the targets in the pipeline have been created.
Rather than calling run() followed by clean(false) , it is more convenient to call
done() , which has the same effect; it signals that the pipeline should be run and then
cleaned up since it will not be needed any more.
Search WWH ::




Custom Search