Parquet - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Chapter 13. Parquet

Apache Parquet is a columnar storage format that can efficiently store nested data.

Columnar formats are attractive since they enable greater efficiency, in terms of both file

size and query performance. File sizes are usually smaller than row-oriented equivalents

since in a columnar format the values from one column are stored next to each other, which

usually allows a very efficient encoding. A column storing a timestamp, for example, can

be encoded by storing the first value and the differences between subsequent values (which

tend to be small due to temporal locality: records from around the same time are stored

next to each other). Query performance is improved too since a query engine can skip over

columns that are not needed to answer a query. (This idea is illustrated in Figure 5-4 .) This

chapter looks at Parquet in more depth, but there are other columnar formats that work with

Hadoop — notably ORCFile (Optimized Record Columnar File), which is a part of the

Hive project.

A key strength of Parquet is its ability to store data that has a deeply nested structure in true

columnar fashion. This is important since schemas with several levels of nesting are com-

mon in real-world systems. Parquet uses a novel technique for storing nested structures in a

flat columnar format with little overhead, which was introduced by Google engineers in the

Dremel paper. [ 86 ] The result is that even nested fields can be read independently of other

fields, resulting in significant performance improvements.

Another feature of Parquet is the large number of tools that support it as a format. The en-

gineers at Twitter and Cloudera who created Parquet wanted it to be easy to try new tools to

process existing data, so to facilitate this they divided the project into a specification

( parquet-format ), which defines the file format in a language-neutral way, and implementa-

tions of the specification for different languages (Java and C++) that made it easy for tools

to read or write Parquet files. In fact, most of the data processing components covered in

this topic understand the Parquet format (MapReduce, Pig, Hive, Cascading, Crunch, and

Spark). This flexibility also extends to the in-memory representation: the Java implementa-

tion is not tied to a single representation, so you can use in-memory data models for Avro,

Thrift, or Protocol Buffers to read your data from and write it to Parquet files.

Search WWH ::

Custom Search

Home