Database Reference
In-Depth Information
NOTE
Crunch tries to write the type of collection to the target file in the most natural way. For example, a
PT-
able
is written to an Avro file using a
Pair
record schema with key and value fields that match the
PTable
. Similarly, a
PCollection
's values are written to a sequence file's values (the keys are
null
), and a
PTable
is written to a text file with tab-separated keys and values.
Existing outputs
If a file-based target already exists, Crunch will throw a
CrunchRuntimeException
when the
write()
method is called. This preserves the behavior of MapReduce, which
is to be conservative and not overwrite existing outputs unless explicitly directed to by the
user (see
Java MapReduce
).
A flag may be passed to the
write()
method indicating that outputs should be overwrit-
ten as follows:
collection
.
write
(
To
.
avroFile
(
"output"
),
Target
.
WriteMode
.
OVERWRITE
);
If
output
already exists, then it will be deleted before the pipeline runs.
ory, leaving any existing ones from previous runs intact. Crunch takes care to use a unique
identifier in filenames to avoid the possibility of a new run overwriting files from a previ-
ous run.
[
120
]
The final write mode is
CHECKPOINT
, which is for saving work to a file so that a new
pipeline can start from that point rather than from the beginning of the pipeline. This
mode is covered in
Checkpointing a Pipeline
.
Combined sources and targets
Sometimes you want to write to a target and then read from it as a source (i.e., in another
pipeline in the same program). For this case, Crunch provides the
SourceTarget<T>
interface, which is both a
Source<T>
and a
Target
. The
At
class provides static fact-
ory methods for creating
SourceTarget
instances for text files, sequence files, and Av-
ro files.