Database Reference
In-Depth Information
Writables
.
strings
());
PTable
<
Integer
,
String
>
table
=
pipeline
.
read
(
source
);
You can also read Avro datafiles into a
PCollection
as follows:
Source
<
WeatherRecord
>
source
=
From
.
avroFile
(
inputPath
,
Avros
.
records
(
WeatherRecord
.
class
));
PCollection
<
WeatherRecord
>
records
=
pipeline
.
read
(
source
);
Any MapReduce
FileInputFormat
(in the new MapReduce API) can be used as a
TableSource
by means of the
formattedFile()
method on
From
, providing
Crunch access to the large number of different Hadoop-supported file formats. There are
also more source implementations in Crunch than the ones exposed in the
From
class, in-
cluding:
▪
AvroParquetFileSource
for reading Parquet files as Avro
PType
s.
▪
FromHBase
, which has a
table()
method for reading rows from HBase tables
into
PTable<ImmutableBytesWritable, Result>
collections.
Im-
mutableBytesWritable
is an HBase class for representing a row key as
bytes, and
Result
contains the cells from the row scan, which can be configured
to return only cells in particular columns or column families.
Writing to a target
Writing a
PCollection
to a
Target
is as simple as calling
PCollection
's
write()
method with the desired
Target
. Most commonly, the target is a file, and the
file type can be selected with the static factory methods on the
To
class. For example, the
following line writes Avro files to a directory called
output
in the default filesystem:
collection
.
write
(
To
.
avroFile
(
"output"
));
This is just a slightly more convenient way of saying:
pipeline
.
write
(
collection
,
To
.
avroFile
(
"output"
));
Since the
PCollection
is being written to an Avro file, it must have a
PType
belong-
ing to the Avro family, or the pipeline will fail.
The
To
factory also has methods for creating text files, sequence files, and any MapRe-
duce
FileOutputFormat
. Crunch also has built-in
Target
implementations for the
Parquet file format (
AvroParquetFileTarget
) and HBase (
ToHBase
).