Designing Real-Time Streaming Architectures - Real-Time Analytics

Database Reference

In-Depth Information

JSON's flexibility also leads to a significant downside: processing-time

validation. The lack of defined schemas and the tendency of different

packages to produce structurally different, but semantically identical

structures means that processing applications must often include large

amounts of validation code to ensure that their inputs are reasonable. The

fact that JSON essentially sends its schema along with every message also

leads to data streams that require relatively large bandwidth. Compression

of the data stream helps, often achieving compression rates well more than

80 percent, but this suggests that something could be done to improve

things further.

If the data is well structured and the problem fairly well understood, one of

the structured wire formats is a possibility. The two most popular are Thrift

and Protocol Buffers (usually called Protobuf). Both formats, the former

developed at Facebook and the latter at Google, are very similar in their

design (not surprising given that they also share developers). They use an

Interface Definition Language (IDL) to describe a data structure that is

translated into code that is usable in a variety of output languages. This

code is used to encode and decode messages coming over the wire. Both

formats provide a mechanism to extend the original message so that new

information can be added.

Another, less popular option is the Apache Avro format. In concept it is

quite similar to Protocol Buffers or Thrift. Schemas are defined using an

IDL, which happens to be JSON, much like Thrift or Protobuf. Rather than

using code generation, Avro tends to use dynamic encoders and decoders,

but the binary format is quite similar to Thrift. The big difference is that,

in addition to the binary format, Avro can also read and write to a JSON

representation of its schema. This allows for a transition path between an

existing JSON representation, whose informal schema can often be stated

as an explicit Avro schema, and the more compact and well-defined binary

representation.

For the bulk of applications, the collection process is directly integrated

into the edge servers themselves. For new servers, this integrated collection

mechanism likely communicates directly with the data-flow mechanisms

described in the next section. Older servers may or may not integrate

directly with the data-flow mechanism, with options available for both.

These servers are usually application specific, so this topic does not spend

Search WWH ::

Custom Search

Home