Database Reference
In-Depth Information
Writables . strings ());
PTable < Integer , String > table = pipeline . read ( source );
You can also read Avro datafiles into a PCollection as follows:
Source < WeatherRecord > source =
From . avroFile ( inputPath , Avros . records ( WeatherRecord . class ));
PCollection < WeatherRecord > records = pipeline . read ( source );
Any MapReduce FileInputFormat (in the new MapReduce API) can be used as a
TableSource by means of the formattedFile() method on From , providing
Crunch access to the large number of different Hadoop-supported file formats. There are
also more source implementations in Crunch than the ones exposed in the From class, in-
cluding:
AvroParquetFileSource for reading Parquet files as Avro PType s.
FromHBase , which has a table() method for reading rows from HBase tables
into PTable<ImmutableBytesWritable, Result> collections. Im-
mutableBytesWritable is an HBase class for representing a row key as
bytes, and Result contains the cells from the row scan, which can be configured
to return only cells in particular columns or column families.
Writing to a target
Writing a PCollection to a Target is as simple as calling PCollection 's
write() method with the desired Target . Most commonly, the target is a file, and the
file type can be selected with the static factory methods on the To class. For example, the
following line writes Avro files to a directory called output in the default filesystem:
collection . write ( To . avroFile ( "output" ));
This is just a slightly more convenient way of saying:
pipeline . write ( collection , To . avroFile ( "output" ));
Since the PCollection is being written to an Avro file, it must have a PType belong-
ing to the Avro family, or the pipeline will fail.
The To factory also has methods for creating text files, sequence files, and any MapRe-
duce FileOutputFormat . Crunch also has built-in Target implementations for the
Parquet file format ( AvroParquetFileTarget ) and HBase ( ToHBase ).
Search WWH ::




Custom Search