Database Reference
In-Depth Information
The records() factory method returns a Crunch PType for the Avro Reflect data mod-
el, as we have used it here; but it also supports Avro Specific and Generic data models. If
you wanted to use Avro Specific instead, then you would define your custom type using
an Avro schema file, generate the Java class for it, and call records() with the gener-
ated class. For Avro Generic, you would declare the class to be a GenericRecord .
Writables also provides a records() factory method for using custom Writable
types; however, they are more cumbersome to define since you have to write serialization
logic yourself (see Implementing a Custom Writable ).
With a collection of records in hand, we can use Crunch libraries or our own processing
functions to perform computations on it. For example, this will perform a total sort of the
weather records by the fields in the order they are declared (by year, then by temperature,
then by station ID):
PCollection < WeatherRecord > sortedRecords = Sort . sort ( records );
Sources and Targets
This section covers the different types of sources and targets in Crunch, and how to use
them.
Reading from a source
Crunch pipelines start with one or more Source<T> instances specifying the storage
location and PType<T> of the input data. For the simple case of reading text files, the
readTextFile() method on Pipeline works well; for other types of source, use the
read() method that takes a Source<T> object. In fact, this:
PCollection < String > lines = pipeline . readTextFile ( inputPath );
is shorthand for:
PCollection < String > lines = pipeline . read ( From . textFile ( inputPath ));
The From class (in the org.apache.crunch.io package) acts as a collection of stat-
ic factory methods for file sources, of which text files are just one example.
Another common case is reading sequence files of Writable key-value pairs. In this
case, the source is a TableSource<K, V> , to accommodate key-value pairs, and it re-
turns a PTable<K, V> . For example, a sequence file containing IntWritable keys
and Text values yields a PTable<Integer, String> :
TableSource < Integer , String > source =
From . sequenceFile ( inputPath , Writables . ints (),
Search WWH ::




Custom Search