Database Reference
In-Depth Information
ferred in the shuffle (see Combiner Functions ). The mapValues() method is translated
into a parallelDo() operation, and in this context it can only run on the reduce side,
so there is no possibility for using a combiner to improve its performance.
Finally, the other operation on PGroupedTable is ungroup() , which turns a
PGroupedTable<K, V> back into a PTable<K, V> — the reverse of
groupByKey() . (It's not a primitive operation though, since it is implemented with a
parallelDo() .) Calling groupByKey() then ungroup() on a PTable has the
effect of performing a partial sort on the table by its keys, although it's normally more
convenient to use the Sort library, which implements a total sort (which is usually what
you want) and also offers options for ordering.
Types
Every PCollection<S> has an associated class, PType<S> , that encapsulates type
information about the elements in the PCollection . The PType<S> determines the
Java class, S , of the elements in the PCollection , as well as the serialization format
used to read data from persistent storage into the PCollection and, conversely, write
data from the PCollection to persistent storage.
There are two PType families in Crunch: Hadoop Writables and Avro. The choice of
which to use broadly corresponds to the file format that you are using in your pipeline;
Writables for sequence files, and Avro for Avro data files. Either family can be used with
text files. Pipelines can use a mixture of PType s from different families (since the
PType is associated with the PCollection , not the pipeline), but this is usually unne-
cessary unless you are doing something that spans families, like file format conversion.
In general, Crunch strives to hide the differences between different serialization formats,
so that the types used in code are familiar to Java programmers. (Another benefit is that
it's easier to write libraries and utilities to work with Crunch collections, regardless of the
serialization family they belong to.) Lines read from a text file, for instance, are presented
as regular Java String objects, rather than the Writable Text variant or Avro Utf8 ob-
jects.
The PType used by a PCollection is specified when the PCollection is created,
although sometimes it is implicit. For example, reading a text file will use Writables by
default, as this test shows:
PCollection < String > lines = pipeline . read ( From . textFile ( inputPath ));
assertEquals ( WritableTypeFamily . getInstance (),
lines . getPType (). getFamily ());
Search WWH ::




Custom Search