Database Reference
In-Depth Information
Data systems often make use of many machines across a network. Some machines
specialize in collecting data from a large number of inputs quickly. Others are tasked
with running batch processes that analyze the data. When building systems that move
data from one system to another, sending the output of one process to the input of
another is a common task. Unfortunately, the network is always going to be the slow-
est step in this process. To help provide the most efficient data transfer, it's beneficial
to represent data in the most compact way possible.
Earlier in this chapter, we saw that XML is great for converting from one docu-
ment format to another but is not always the best choice for data interoperability, espe-
cially when data sizes are very large. Recall that JSON also shares this characteristic
with XML. The markup in XML and JSON produces files larger than the data that
either format represents, resulting in more time being taken to physically move data
from one place to another. Although it's possible to compress files in these data for-
mats before sending them off, the developer still needs to handle the compression and
decompression steps himself.
As Internet companies began to deal with the challenges of Web scale data, they
quickly realized that the overhead of moving data between systems could result in
considerable cost and latency. Similarly, it was found that systems built with a variety
of technologies (for example, using C++ for some applications and Python for others)
benefitted from the use of a common language for passing data back and forth.
A naïve approach to this problem would be to convert data into some type of byte
array: basically, a binary representation of the data. This approach might reduce the
size of the data, but, unfortunately, each system would have to know beforehand
exactly how the data was serialized so that it could later be deserialized. Hard coding
the encoding and decoding functions into the application would work, but it would
be problematic if anything about the data model changed. If an application pipeline
depended on multiple systems built with different programming languages, the work
involved in changing these functions would be considerable.
Several solutions to this problem were created independently, but they all work
in similar ways. In general, the first step is to provide a description of the data, or
schema, and define it somewhere that is common to both sender and receiver. The
second step is to build a standard interface with programming language support to
serialize the data into both the message sender and receiver.
Apache Thrift and Protocol Buffers
When data sizes grow very large, the overhead of data transfer between systems can
add up. This might not be an issue at the megabyte scale, but as data grows to giga-
bytes and beyond, the cost and latency of moving data back and forth can become
a huge issue. Two different technologies that take similar approaches to solving this
problem are Apache Thrift and Protocol Buffers.
Apache Thrift is an open-source project originally developed at Facebook to pro-
vide a generic data solution for data serialization. Thrift allows developers to define
a configuration file that describes the data that will be serialized. A code generator is
run to produce a server that handles the data serialization in the language specified.
 
Search WWH ::




Custom Search