Database Reference
In-Depth Information
The
records()
factory method returns a Crunch
PType
for the Avro Reflect data mod-
el, as we have used it here; but it also supports Avro Specific and Generic data models. If
you wanted to use Avro Specific instead, then you would define your custom type using
an Avro schema file, generate the Java class for it, and call
records()
with the gener-
ated class. For Avro Generic, you would declare the class to be a
GenericRecord
.
Writables
also provides a
records()
factory method for using custom
Writable
types; however, they are more cumbersome to define since you have to write serialization
logic yourself (see
Implementing a Custom Writable
).
With a collection of records in hand, we can use Crunch libraries or our own processing
functions to perform computations on it. For example, this will perform a total sort of the
weather records by the fields in the order they are declared (by year, then by temperature,
then by station ID):
PCollection
<
WeatherRecord
>
sortedRecords
=
Sort
.
sort
(
records
);
Sources and Targets
This section covers the different types of sources and targets in Crunch, and how to use
them.
Reading from a source
Crunch pipelines start with one or more
Source<T>
instances specifying the storage
location and
PType<T>
of the input data. For the simple case of reading text files, the
readTextFile()
method on
Pipeline
works well; for other types of source, use the
read()
method that takes a
Source<T>
object. In fact, this:
PCollection
<
String
>
lines
=
pipeline
.
readTextFile
(
inputPath
);
is shorthand for:
PCollection
<
String
>
lines
=
pipeline
.
read
(
From
.
textFile
(
inputPath
));
The
From
class (in the
org.apache.crunch.io
package) acts as a collection of stat-
ic factory methods for file sources, of which text files are just one example.
Another common case is reading sequence files of
Writable
key-value pairs. In this
case, the source is a
TableSource<K, V>
, to accommodate key-value pairs, and it re-
turns a
PTable<K, V>
. For example, a sequence file containing
IntWritable
keys
and
Text
values yields a
PTable<Integer, String>
:
TableSource
<
Integer
,
String
>
source
=
From
.
sequenceFile
(
inputPath
,
Writables
.
ints
(),