Avro - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

DatumReader < GenericRecord > reader = new

GenericDatumReader < GenericRecord >();

DataFileReader < GenericRecord > dataFileReader =

new DataFileReader < GenericRecord >( file , reader );

assertThat ( "Schema is the same" , schema ,

is ( dataFileReader . getSchema ()));

DataFileReader is a regular Java iterator, so we can iterate through its data objects by

calling its hasNext() and next() methods. The following snippet checks that there is

only one record and that it has the expected field values:

assertThat ( dataFileReader . hasNext (), is ( true ));

GenericRecord result = dataFileReader . next ();

assertThat ( result . get ( "left" ). toString (), is ( "L" ));

assertThat ( result . get ( "right" ). toString (), is ( "R" ));

assertThat ( dataFileReader . hasNext (), is ( false ));

Rather than using the usual next() method, however, it is preferable to use the over-

loaded form that takes an instance of the object to be returned (in this case, GenericRe-

cord ), since it will reuse the object and save allocation and garbage collection costs for

files containing many objects. The following is idiomatic:

GenericRecord record = null ;

while ( dataFileReader . hasNext ()) {

record = dataFileReader . next ( record );

// process record

}

If object reuse is not important, you can use this shorter form:

for ( GenericRecord record : dataFileReader ) {

// process record

}

For the general case of reading a file on a Hadoop filesystem, use Avro's FsInput to

specify the input file using a Hadoop Path object. DataFileReader actually offers

random access to Avro datafiles (via its seek() and sync() methods); however, in

many cases, sequential streaming access is sufficient, for which DataFileStream

should be used. DataFileStream can read from any Java InputStream .

Search WWH ::

Custom Search

Home