Database Reference
In-Depth Information
into a
parallelDo()
operation, and in this context it can only run on the reduce side,
so there is no possibility for using a combiner to improve its performance.
Finally, the other operation on
PGroupedTable
is
ungroup()
, which turns a
PGroupedTable<K, V>
back into a
PTable<K, V>
— the reverse of
groupByKey()
. (It's not a primitive operation though, since it is implemented with a
parallelDo()
.) Calling
groupByKey()
then
ungroup()
on a
PTable
has the
effect of performing a partial sort on the table by its keys, although it's normally more
convenient to use the
Sort
library, which implements a total sort (which is usually what
you want) and also offers options for ordering.
Types
Every
PCollection<S>
has an associated class,
PType<S>
, that encapsulates type
information about the elements in the
PCollection
. The
PType<S>
determines the
Java class,
S
, of the elements in the
PCollection
, as well as the serialization format
used to read data from persistent storage into the
PCollection
and, conversely, write
data from the
PCollection
to persistent storage.
There are two
PType
families in Crunch: Hadoop Writables and Avro. The choice of
which to use broadly corresponds to the file format that you are using in your pipeline;
Writables for sequence files, and Avro for Avro data files. Either family can be used with
text files. Pipelines can use a mixture of
PType
s from different families (since the
PType
is associated with the
PCollection
, not the pipeline), but this is usually unne-
cessary unless you are doing something that spans families, like file format conversion.
In general, Crunch strives to hide the differences between different serialization formats,
so that the types used in code are familiar to Java programmers. (Another benefit is that
it's easier to write libraries and utilities to work with Crunch collections, regardless of the
serialization family they belong to.) Lines read from a text file, for instance, are presented
as regular Java
String
objects, rather than the Writable
Text
variant or Avro
Utf8
ob-
jects.
The
PType
used by a
PCollection
is specified when the
PCollection
is created,
although sometimes it is implicit. For example, reading a text file will use Writables by
default, as this test shows:
PCollection
<
String
>
lines
=
pipeline
.
read
(
From
.
textFile
(
inputPath
));
assertEquals
(
WritableTypeFamily
.
getInstance
(),
lines
.
getPType
().
getFamily
());