Database Reference
In-Depth Information
JSON's flexibility also leads to a significant downside: processing-time
validation. The lack of defined schemas and the tendency of different
packages to produce structurally different, but semantically identical
structures means that processing applications must often include large
amounts of validation code to ensure that their inputs are reasonable. The
fact that JSON essentially sends its schema along with every message also
leads to data streams that require relatively large bandwidth. Compression
of the data stream helps, often achieving compression rates well more than
80 percent, but this suggests that something could be done to improve
things further.
If the data is well structured and the problem fairly well understood, one of
the structured wire formats is a possibility. The two most popular are Thrift
and Protocol Buffers (usually called Protobuf). Both formats, the former
developed at Facebook and the latter at Google, are very similar in their
design (not surprising given that they also share developers). They use an
Interface Definition Language (IDL) to describe a data structure that is
translated into code that is usable in a variety of output languages. This
code is used to encode and decode messages coming over the wire. Both
formats provide a mechanism to extend the original message so that new
information can be added.
Another, less popular option is the Apache Avro format. In concept it is
quite similar to Protocol Buffers or Thrift. Schemas are defined using an
IDL, which happens to be JSON, much like Thrift or Protobuf. Rather than
using code generation, Avro tends to use dynamic encoders and decoders,
but the binary format is quite similar to Thrift. The big difference is that,
in addition to the binary format, Avro can also read and write to a JSON
representation of its schema. This allows for a transition path between an
existing JSON representation, whose informal schema can often be stated
as an explicit Avro schema, and the more compact and well-defined binary
representation.
For the bulk of applications, the collection process is directly integrated
into the edge servers themselves. For new servers, this integrated collection
mechanism likely communicates directly with the data-flow mechanisms
described in the next section. Older servers may or may not integrate
directly with the data-flow mechanism, with options available for both.
These servers are usually application specific, so this topic does not spend
Search WWH ::




Custom Search