Database Reference
In-Depth Information
as a single value and the count), and dictionary encoding (a dictionary of values is built
and itself encoded, then values are encoded as integers representing the indexes in the dic-
tionary). In most cases, it also applies techniques such as bit packing to save space by
storing several small values in a single byte.
When writing files, Parquet will choose an appropriate encoding automatically, based on
the column type. For example, Boolean values will be written using a combination of run-
length encoding and bit packing. Most types are encoded using dictionary encoding by de-
fault; however, a plain encoding will be used as a fallback if the dictionary becomes too
large. The threshold size at which this happens is referred to as the dictionary page size
and is the same as the page size by default (so the dictionary has to fit into one page if it is
to be used). Note that the encoding that is actually used is stored in the file metadata to en-
sure that readers use the correct encoding.
In addition to the encoding, a second level of compression can be applied using a standard
compression algorithm on the encoded page bytes. By default, no compression is applied,
but Snappy, gzip, and LZO compressors are all supported.
For nested data, each page will also store the definition and repetition levels for all the
values in the page. Since levels are small integers (the maximum is determined by the
amount of nesting specified in the schema), they can be very efficiently encoded using a
bit-packed run-length encoding.
Search WWH ::




Custom Search