Database Reference
In-Depth Information
Chapter 13. Parquet
Apache Parquet is a columnar storage format that can efficiently store nested data.
Columnar formats are attractive since they enable greater efficiency, in terms of both file
size and query performance. File sizes are usually smaller than row-oriented equivalents
since in a columnar format the values from one column are stored next to each other, which
usually allows a very efficient encoding. A column storing a timestamp, for example, can
be encoded by storing the first value and the differences between subsequent values (which
tend to be small due to temporal locality: records from around the same time are stored
next to each other). Query performance is improved too since a query engine can skip over
columns that are not needed to answer a query. (This idea is illustrated in Figure 5-4 .) This
chapter looks at Parquet in more depth, but there are other columnar formats that work with
Hadoop — notably ORCFile (Optimized Record Columnar File), which is a part of the
Hive project.
A key strength of Parquet is its ability to store data that has a deeply nested structure in true
columnar fashion. This is important since schemas with several levels of nesting are com-
mon in real-world systems. Parquet uses a novel technique for storing nested structures in a
flat columnar format with little overhead, which was introduced by Google engineers in the
Dremel paper. [ 86 ] The result is that even nested fields can be read independently of other
fields, resulting in significant performance improvements.
Another feature of Parquet is the large number of tools that support it as a format. The en-
gineers at Twitter and Cloudera who created Parquet wanted it to be easy to try new tools to
process existing data, so to facilitate this they divided the project into a specification
( parquet-format ), which defines the file format in a language-neutral way, and implementa-
tions of the specification for different languages (Java and C++) that made it easy for tools
to read or write Parquet files. In fact, most of the data processing components covered in
this topic understand the Parquet format (MapReduce, Pig, Hive, Cascading, Crunch, and
Spark). This flexibility also extends to the in-memory representation: the Java implementa-
tion is not tied to a single representation, so you can use in-memory data models for Avro,
Thrift, or Protocol Buffers to read your data from and write it to Parquet files.
Search WWH ::




Custom Search