Parquet - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

as a single value and the count), and dictionary encoding (a dictionary of values is built

and itself encoded, then values are encoded as integers representing the indexes in the dic-

tionary). In most cases, it also applies techniques such as bit packing to save space by

storing several small values in a single byte.

When writing files, Parquet will choose an appropriate encoding automatically, based on

the column type. For example, Boolean values will be written using a combination of run-

length encoding and bit packing. Most types are encoded using dictionary encoding by de-

fault; however, a plain encoding will be used as a fallback if the dictionary becomes too

large. The threshold size at which this happens is referred to as the dictionary page size

and is the same as the page size by default (so the dictionary has to fit into one page if it is

to be used). Note that the encoding that is actually used is stored in the file metadata to en-

sure that readers use the correct encoding.

In addition to the encoding, a second level of compression can be applied using a standard

compression algorithm on the encoded page bytes. By default, no compression is applied,

but Snappy, gzip, and LZO compressors are all supported.

For nested data, each page will also store the definition and repetition levels for all the

values in the page. Since levels are small integers (the maximum is determined by the

amount of nesting specified in the schema), they can be very efficiently encoded using a

bit-packed run-length encoding.

Search WWH ::

Custom Search

Home