Avro - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Chapter 12. Avro

Apache Avro [ 79 ] is a language-neutral data serialization system. The project was created by

Doug Cutting (the creator of Hadoop) to address the major downside of Hadoop Writables:

lack of language portability. Having a data format that can be processed by many languages

(currently C, C++, C#, Java, JavaScript, Perl, PHP, Python, and Ruby) makes it easier to

share datasets with a wider audience than one tied to a single language. It is also more

future-proof, allowing data to potentially outlive the language used to read and write it.

But why a new data serialization system? Avro has a set of features that, taken together,

differentiate it from other systems such as Apache Thrift or Google's Protocol Buffers. [ 80 ]

Like in these systems and others, Avro data is described using a language-independent

schema . However, unlike in some other systems, code generation is optional in Avro,

which means you can read and write data that conforms to a given schema even if your

code has not seen that particular schema before. To achieve this, Avro assumes that the

schema is always present — at both read and write time — which makes for a very com-

pact encoding, since encoded values do not need to be tagged with a field identifier.

Avro schemas are usually written in JSON, and data is usually encoded using a binary

format, but there are other options, too. There is a higher-level language called Avro IDL

for writing schemas in a C-like language that is more familiar to developers. There is also a

JSON-based data encoder, which, being human readable, is useful for prototyping and de-

bugging Avro data.

The Avro specification precisely defines the binary format that all implementations must

support. It also specifies many of the other features of Avro that implementations should

support. One area that the specification does not rule on, however, is APIs: implementa-

tions have complete latitude in the APIs they expose for working with Avro data, since

each one is necessarily language specific. The fact that there is only one binary format is

significant, because it means the barrier for implementing a new language binding is lower

and avoids the problem of a combinatorial explosion of languages and formats, which

would harm interoperability.

Avro has rich schema resolution capabilities. Within certain carefully defined constraints,

the schema used to read data need not be identical to the schema that was used to write the

data. This is the mechanism by which Avro supports schema evolution. For example, a

new, optional field may be added to a record by declaring it in the schema used to read the

old data. New and old clients alike will be able to read the old data, while new clients can

write new data that uses the new field. Conversely, if an old client sees newly encoded

Search WWH ::

Custom Search

Home