Hosting and Sharing Terabytes of Raw Data - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Data systems often make use of many machines across a network. Some machines

specialize in collecting data from a large number of inputs quickly. Others are tasked

with running batch processes that analyze the data. When building systems that move

data from one system to another, sending the output of one process to the input of

another is a common task. Unfortunately, the network is always going to be the slow-

est step in this process. To help provide the most efficient data transfer, it's beneficial

to represent data in the most compact way possible.

Earlier in this chapter, we saw that XML is great for converting from one docu-

ment format to another but is not always the best choice for data interoperability, espe-

cially when data sizes are very large. Recall that JSON also shares this characteristic

with XML. The markup in XML and JSON produces files larger than the data that

either format represents, resulting in more time being taken to physically move data

from one place to another. Although it's possible to compress files in these data for-

mats before sending them off, the developer still needs to handle the compression and

decompression steps himself.

As Internet companies began to deal with the challenges of Web scale data, they

quickly realized that the overhead of moving data between systems could result in

considerable cost and latency. Similarly, it was found that systems built with a variety

of technologies (for example, using C++ for some applications and Python for others)

benefitted from the use of a common language for passing data back and forth.

A naïve approach to this problem would be to convert data into some type of byte

array: basically, a binary representation of the data. This approach might reduce the

size of the data, but, unfortunately, each system would have to know beforehand

exactly how the data was serialized so that it could later be deserialized. Hard coding

the encoding and decoding functions into the application would work, but it would

be problematic if anything about the data model changed. If an application pipeline

depended on multiple systems built with different programming languages, the work

involved in changing these functions would be considerable.

Several solutions to this problem were created independently, but they all work

in similar ways. In general, the first step is to provide a description of the data, or

schema, and define it somewhere that is common to both sender and receiver. The

second step is to build a standard interface with programming language support to

serialize the data into both the message sender and receiver.

Apache Thrift and Protocol Buffers

When data sizes grow very large, the overhead of data transfer between systems can

add up. This might not be an issue at the megabyte scale, but as data grows to giga-

bytes and beyond, the cost and latency of moving data back and forth can become

a huge issue. Two different technologies that take similar approaches to solving this

problem are Apache Thrift and Protocol Buffers.

Apache Thrift is an open-source project originally developed at Facebook to pro-

vide a generic data solution for data serialization. Thrift allows developers to define

a configuration file that describes the data that will be serialized. A code generator is

run to produce a server that handles the data serialization in the language specified.

Data Just Right: Introduction to Large-Scale Data and Analytics

Search WWH ::

Custom Search

Home