Database Reference
In-Depth Information
However, it is possible to explicitly use Avro serialization by passing the appropriate
PType
to the
textFile()
method. Here we use the static factory method on
Avros
to
create an Avro representation of
PType<String>
:
PCollection
<
String
>
lines
=
pipeline
.
read
(
From
.
textFile
(
inputPath
,
Avros
.
strings
()
));
Similarly, operations that create new
PCollection
s require that the
PType
is specified
example the
parallelDo()
operation to extract an integer key from a
PCollec-
tion<String>
, turning it into a
PTable<Integer, String>
, specified a match-
ing
PType
of:
tableOf
(
ints
(),
strings
())
where all three methods are statically imported from
Writables
.
Records and tuples
When it comes to working with complex objects with multiple fields, you can choose
between records or tuples in Crunch. A record is a class where fields are accessed by
name, such as Avro's
GenericRecord
, a plain old Java object (corresponding to Avro
Specific or Reflect), or a custom
Writable
. For a tuple, on the other hand, field access
is by position, and Crunch provides a
Tuple
interface as well as a few convenience
classes for tuples with a small number of elements:
Pair<K, V>
,
Tuple3<V1, V2,
V3>
,
Tuple4<V1, V2, V3, V4>
, and
TupleN
for tuples with an arbitrary but fixed
number of values.
Where possible, you should prefer records over tuples, since the resulting Crunch pro-
grams are more readable and understandable. If a weather record is represented by a
WeatherRecord
class with year, temperature, and station ID fields, then it is easier to
work with this type:
Emitter
<
Pair
<
Integer
,
WeatherRecord
>>
than this:
Emitter
<
Pair
<
Integer
,
Tuple3
<
Integer
,
Integer
,
String
>>
The latter does not convey any semantic information through its type names, unlike
WeatherRecord
, which clearly describes what it is.
As this example hints, it's is not possible to entirely avoid using Crunch
Pair
objects,
since they are a fundamental part of the way Crunch represents table collections (recall