Database Reference
In-Depth Information
NOTE
Crunch tries to write the type of collection to the target file in the most natural way. For example, a PT-
able is written to an Avro file using a Pair record schema with key and value fields that match the
PTable . Similarly, a PCollection 's values are written to a sequence file's values (the keys are
null ), and a PTable is written to a text file with tab-separated keys and values.
Existing outputs
If a file-based target already exists, Crunch will throw a CrunchRuntimeException
when the write() method is called. This preserves the behavior of MapReduce, which
is to be conservative and not overwrite existing outputs unless explicitly directed to by the
user (see Java MapReduce ).
A flag may be passed to the write() method indicating that outputs should be overwrit-
ten as follows:
collection . write ( To . avroFile ( "output" ), Target . WriteMode . OVERWRITE );
If output already exists, then it will be deleted before the pipeline runs.
There is another write mode, APPEND , which will add new files [ 119 ] to the output direct-
ory, leaving any existing ones from previous runs intact. Crunch takes care to use a unique
identifier in filenames to avoid the possibility of a new run overwriting files from a previ-
ous run. [ 120 ]
The final write mode is CHECKPOINT , which is for saving work to a file so that a new
pipeline can start from that point rather than from the beginning of the pipeline. This
mode is covered in Checkpointing a Pipeline .
Combined sources and targets
Sometimes you want to write to a target and then read from it as a source (i.e., in another
pipeline in the same program). For this case, Crunch provides the SourceTarget<T>
interface, which is both a Source<T> and a Target . The At class provides static fact-
ory methods for creating SourceTarget instances for text files, sequence files, and Av-
ro files.
Search WWH ::




Custom Search