Database Reference
In-Depth Information
However, it is possible to explicitly use Avro serialization by passing the appropriate
PType to the textFile() method. Here we use the static factory method on Avros to
create an Avro representation of PType<String> :
PCollection < String > lines = pipeline . read ( From . textFile ( inputPath ,
Avros . strings () ));
Similarly, operations that create new PCollection s require that the PType is specified
and matches the type parameters of the PCollection . [ 118 ] For instance, in our earlier
example the parallelDo() operation to extract an integer key from a PCollec-
tion<String> , turning it into a PTable<Integer, String> , specified a match-
ing PType of:
tableOf ( ints (), strings ())
where all three methods are statically imported from Writables .
Records and tuples
When it comes to working with complex objects with multiple fields, you can choose
between records or tuples in Crunch. A record is a class where fields are accessed by
name, such as Avro's GenericRecord , a plain old Java object (corresponding to Avro
Specific or Reflect), or a custom Writable . For a tuple, on the other hand, field access
is by position, and Crunch provides a Tuple interface as well as a few convenience
classes for tuples with a small number of elements: Pair<K, V> , Tuple3<V1, V2,
V3> , Tuple4<V1, V2, V3, V4> , and TupleN for tuples with an arbitrary but fixed
number of values.
Where possible, you should prefer records over tuples, since the resulting Crunch pro-
grams are more readable and understandable. If a weather record is represented by a
WeatherRecord class with year, temperature, and station ID fields, then it is easier to
work with this type:
Emitter < Pair < Integer , WeatherRecord >>
than this:
Emitter < Pair < Integer , Tuple3 < Integer , Integer , String >>
The latter does not convey any semantic information through its type names, unlike
WeatherRecord , which clearly describes what it is.
As this example hints, it's is not possible to entirely avoid using Crunch Pair objects,
since they are a fundamental part of the way Crunch represents table collections (recall
Search WWH ::




Custom Search