Database Reference
In-Depth Information
DatumReader < GenericRecord > reader = new
GenericDatumReader < GenericRecord >();
DataFileReader < GenericRecord > dataFileReader =
new DataFileReader < GenericRecord >( file , reader );
assertThat ( "Schema is the same" , schema ,
is ( dataFileReader . getSchema ()));
DataFileReader is a regular Java iterator, so we can iterate through its data objects by
calling its hasNext() and next() methods. The following snippet checks that there is
only one record and that it has the expected field values:
assertThat ( dataFileReader . hasNext (), is ( true ));
GenericRecord result = dataFileReader . next ();
assertThat ( result . get ( "left" ). toString (), is ( "L" ));
assertThat ( result . get ( "right" ). toString (), is ( "R" ));
assertThat ( dataFileReader . hasNext (), is ( false ));
Rather than using the usual next() method, however, it is preferable to use the over-
loaded form that takes an instance of the object to be returned (in this case, GenericRe-
cord ), since it will reuse the object and save allocation and garbage collection costs for
files containing many objects. The following is idiomatic:
GenericRecord record = null ;
while ( dataFileReader . hasNext ()) {
record = dataFileReader . next ( record );
// process record
}
If object reuse is not important, you can use this shorter form:
for ( GenericRecord record : dataFileReader ) {
// process record
}
For the general case of reading a file on a Hadoop filesystem, use Avro's FsInput to
specify the input file using a Hadoop Path object. DataFileReader actually offers
random access to Avro datafiles (via its seek() and sync() methods); however, in
many cases, sequential streaming access is sufficient, for which DataFileStream
should be used. DataFileStream can read from any Java InputStream .
Search WWH ::




Custom Search