Database Reference
In-Depth Information
DatumReader
<
GenericRecord
>
reader
=
new
GenericDatumReader
<
GenericRecord
>();
DataFileReader
<
GenericRecord
>
dataFileReader
=
new
DataFileReader
<
GenericRecord
>(
file
,
reader
);
assertThat
(
"Schema is the same"
,
schema
,
is
(
dataFileReader
.
getSchema
()));
DataFileReader
is a regular Java iterator, so we can iterate through its data objects by
calling its
hasNext()
and
next()
methods. The following snippet checks that there is
only one record and that it has the expected field values:
assertThat
(
dataFileReader
.
hasNext
(),
is
(
true
));
GenericRecord result
=
dataFileReader
.
next
();
assertThat
(
result
.
get
(
"left"
).
toString
(),
is
(
"L"
));
assertThat
(
result
.
get
(
"right"
).
toString
(),
is
(
"R"
));
assertThat
(
dataFileReader
.
hasNext
(),
is
(
false
));
Rather than using the usual
next()
method, however, it is preferable to use the over-
loaded form that takes an instance of the object to be returned (in this case,
GenericRe-
cord
), since it will reuse the object and save allocation and garbage collection costs for
files containing many objects. The following is idiomatic:
GenericRecord record
=
null
;
while
(
dataFileReader
.
hasNext
()) {
record
=
dataFileReader
.
next
(
record
);
// process record
}
If object reuse is not important, you can use this shorter form:
for
(
GenericRecord record
:
dataFileReader
) {
// process record
}
For the general case of reading a file on a Hadoop filesystem, use Avro's
FsInput
to
specify the input file using a Hadoop
Path
object.
DataFileReader
actually offers
random access to Avro datafiles (via its
seek()
and
sync()
methods); however, in
many cases, sequential streaming access is sufficient, for which
DataFileStream
should be used.
DataFileStream
can read from any Java
InputStream
.