Crunch - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

NOTE

Crunch tries to write the type of collection to the target file in the most natural way. For example, a PT-

able is written to an Avro file using a Pair record schema with key and value fields that match the

PTable . Similarly, a PCollection 's values are written to a sequence file's values (the keys are

null ), and a PTable is written to a text file with tab-separated keys and values.

Existing outputs

If a file-based target already exists, Crunch will throw a CrunchRuntimeException

when the write() method is called. This preserves the behavior of MapReduce, which

is to be conservative and not overwrite existing outputs unless explicitly directed to by the

user (see Java MapReduce ).

A flag may be passed to the write() method indicating that outputs should be overwrit-

ten as follows:

collection . write ( To . avroFile ( "output" ), Target . WriteMode . OVERWRITE );

If output already exists, then it will be deleted before the pipeline runs.

There is another write mode, APPEND , which will add new files [ 119 ] to the output direct-

ory, leaving any existing ones from previous runs intact. Crunch takes care to use a unique

identifier in filenames to avoid the possibility of a new run overwriting files from a previ-

ous run. [ 120 ]

The final write mode is CHECKPOINT , which is for saving work to a file so that a new

pipeline can start from that point rather than from the beginning of the pipeline. This

mode is covered in Checkpointing a Pipeline .

Combined sources and targets

Sometimes you want to write to a target and then read from it as a source (i.e., in another

pipeline in the same program). For this case, Crunch provides the SourceTarget<T>

interface, which is both a Source<T> and a Target . The At class provides static fact-

ory methods for creating SourceTarget instances for text files, sequence files, and Av-

ro files.

Search WWH ::

Custom Search

Home