Database Reference
In-Depth Information
Chapter 12. Avro
Apache Avro [ 79 ] is a language-neutral data serialization system. The project was created by
Doug Cutting (the creator of Hadoop) to address the major downside of Hadoop Writables:
lack of language portability. Having a data format that can be processed by many languages
(currently C, C++, C#, Java, JavaScript, Perl, PHP, Python, and Ruby) makes it easier to
share datasets with a wider audience than one tied to a single language. It is also more
future-proof, allowing data to potentially outlive the language used to read and write it.
But why a new data serialization system? Avro has a set of features that, taken together,
differentiate it from other systems such as Apache Thrift or Google's Protocol Buffers. [ 80 ]
Like in these systems and others, Avro data is described using a language-independent
schema . However, unlike in some other systems, code generation is optional in Avro,
which means you can read and write data that conforms to a given schema even if your
code has not seen that particular schema before. To achieve this, Avro assumes that the
schema is always present — at both read and write time — which makes for a very com-
pact encoding, since encoded values do not need to be tagged with a field identifier.
Avro schemas are usually written in JSON, and data is usually encoded using a binary
format, but there are other options, too. There is a higher-level language called Avro IDL
for writing schemas in a C-like language that is more familiar to developers. There is also a
JSON-based data encoder, which, being human readable, is useful for prototyping and de-
bugging Avro data.
The Avro specification precisely defines the binary format that all implementations must
support. It also specifies many of the other features of Avro that implementations should
support. One area that the specification does not rule on, however, is APIs: implementa-
tions have complete latitude in the APIs they expose for working with Avro data, since
each one is necessarily language specific. The fact that there is only one binary format is
significant, because it means the barrier for implementing a new language binding is lower
and avoids the problem of a combinatorial explosion of languages and formats, which
would harm interoperability.
Avro has rich schema resolution capabilities. Within certain carefully defined constraints,
the schema used to read data need not be identical to the schema that was used to write the
data. This is the mechanism by which Avro supports schema evolution. For example, a
new, optional field may be added to a record by declaring it in the schema used to read the
old data. New and old clients alike will be able to read the old data, while new clients can
write new data that uses the new field. Conversely, if an old client sees newly encoded
Search WWH ::




Custom Search