Avro - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Avro Datafiles

Avro's object container file format is for storing sequences of Avro objects. It is very simil-

ar in design to Hadoop's sequence file format, described in SequenceFile . The main differ-

ence is that Avro datafiles are designed to be portable across languages, so, for example,

you can write a file in Python and read it in C (we will do exactly this in the next section).

A datafile has a header containing metadata, including the Avro schema and a sync marker ,

followed by a series of (optionally compressed) blocks containing the serialized Avro ob-

jects. Blocks are separated by a sync marker that is unique to the file (the marker for a par-

ticular file is found in the header) and that permits rapid resynchronization with a block

boundary after seeking to an arbitrary point in the file, such as an HDFS block boundary.

Thus, Avro datafiles are splittable, which makes them amenable to efficient MapReduce

processing.

Writing Avro objects to a datafile is similar to writing to a stream. We use a

DatumWriter as before, but instead of using an Encoder , we create a

DataFileWriter instance with the DatumWriter . Then we can create a new datafile

(which, by convention, has a .avro extension) and append objects to it:

File file = new File ( "data.avro" );

DatumWriter < GenericRecord > writer =

new GenericDatumWriter < GenericRecord >( schema );

DataFileWriter < GenericRecord > dataFileWriter =

new DataFileWriter < GenericRecord >( writer );

dataFileWriter . create ( schema , file );

dataFileWriter . append ( datum );

dataFileWriter . close ();

The objects that we write to the datafile must conform to the file's schema; otherwise, an

exception will be thrown when we call append() .

This example demonstrates writing to a local file ( java.io.File in the previous snip-

pet), but we can write to any java.io.OutputStream by using the overloaded cre-

ate() method on DataFileWriter . To write a file to HDFS, for example, we get an

OutputStream by calling create() on FileSystem (see Writing Data ).

Reading back objects from a datafile is similar to the earlier case of reading objects from an

in-memory stream, with one important difference: we don't have to specify a schema, since

it is read from the file metadata. Indeed, we can get the schema from the

DataFileReader instance, using getSchema() , and verify that it is the same as the

one we used to write the original object:

Search WWH ::

Custom Search

Home