Database Reference
In-Depth Information
Let's say you have some data and you want to share it with someone else. The first thing you
might do is write out the structure of your data, defining things like how many fields there
are and what kind of data those fields contain. In technical terms, that definition could be
called a schema . You would likely share that schema along with your data, and the folks who
are interested in your data might put together a little code to make sure they can read it.
Avro is a system that automates much of that work. You provide it with a schema, and it
builds the code you need to read and write data. Because Avro was designed from the start to
work with Hadoop and big data, it goes to great lengths to store your data as efficiently as
possible.
There are two unique behaviors that differentiate Avro from many other serialization systems
such as Thrift and Protocol Buffers (protobuf; described here ) :
Runtime assembled
Avro does not require special serialization code to be generated and shared beforehand.
This simplifies the process of deploying applications that span multiple platforms, but
comes at a cost to performance. In some cases, you can work around this and generate
the code beforehand, but you'll need to regenerate and reshare the code every time you
change the format of your data.
Schema-driven
Each data transfer consists of two parts: a schema describing the format of the data and
the data itself. Because the format of the data is defined in the schema, each item does
not need to be tagged. This allows for a dramatic reduction in the overhead associated
with transferring many complex objects, but can actually increase the overhead involved
with transferring a small number of large but simple objects.
Tutorial Links
The official Avro documentation page is a great place to get started and provides “getting
started” guides for both Java and Python. If you're more interested in diving straight into in-
tegrating Avro with MapReduce, you can't go wrong with the avro-mr-sample project on
GitHub.
Example Code
Avro supports two general models:
Search WWH ::




Custom Search