Hosting and Sharing Terabytes of Raw Data - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Google's Protocol Buffers are very similar to Thrift, and the software to compile

Protocol Buffers is also available to use as open-source software.

Like Thrift, Protocol Buffers are used in a variety of projects. While Protocol Buf-

fers are generally used for moving data around, some organizations are using Protocol

Buffers for persistent files. For example, OpenStreetMap is currently converting its

available map data from XML to a static format known as PBF (protocol buffer binary

format).

Apache Avro

Apache Avro is a relatively new data serialization format that combines some of the

best features of all the technologies we have described previously. Unlike Apache

Thrift or Protocol Buffers, Avro data is self-describing and uses a JSON-schema

description to describe each individual field. However, unlike XML or JSON, this

schema is not incorporated into every field but instead is provided as a JSON object

included with the data itself. Avro also natively supports compression so that develop-

ers don't have to spend much time worrying about building this logic themselves.

Avro is currently used in the Hadoop community, but it deserves to be more widely

used in other applications. Some developers are exploring the possibility of reading and

writing data in the Avro format as static files. Programming language support for Avro

has been growing, and over time it's possible that it may become more of a de facto

standard for data serialization.

Summary

Sharing a large set of documents seems like a simple task, but the reality is that many

organizations that should be sharing data in the most accessible ways possible simply

aren't.

Are you hosting a massive collection of files? In order to maintain scalability and

affordability in the face of massive amounts of storage and bandwidth, distributed stor-

age services provided by infrastructure-as-a-service providers are generally the only

economical solution.

Choosing a file format for your hosted files can be a challenge, but not if you know

how your audience may use your data. CSV files are trivial to parse and are univer-

sally supported by everything from spreadsheets to database software to programming

languages. However, CSV is not able to model complex data structures easily. XML is

a great format for providing an unambiguous source of data that can be used for con-

verting documents from one format to another. It is well supported and is best used for

providing unambiguous structure for document data. However, it is not necessarily the

best format for software developers who wish to quickly serialize, transfer, and parse

data. Although JSON is not as extensible as XML or as simple as CSV, its ease of use

for data transfer certainly makes it a popular format for programmers and nonrelational

database administrators.

Both XML and JSON formats carry a lot of baggage from the markup used to

describe data. As data needs scale, more large data source providers are moving from

Search WWH ::

Custom Search

Home