Parquet - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Parquet File Format

A Parquet file consists of a header followed by one or more blocks, terminated by a footer.

The header contains only a 4-byte magic number, PAR1 , that identifies the file as being in

Parquet format, and all the file metadata is stored in the footer. The footer's metadata in-

cludes the format version, the schema, any extra key-value pairs, and metadata for every

block in the file. The final two fields in the footer are a 4-byte field encoding the length of

the footer metadata, and the magic number again ( PAR1 ).

The consequence of storing the metadata in the footer is that reading a Parquet file requires

an initial seek to the end of the file (minus 8 bytes) to read the footer metadata length, then

a second seek backward by that length to read the footer metadata. Unlike sequence files

and Avro datafiles, where the metadata is stored in the header and sync markers are used to

separate blocks, Parquet files don't need sync markers since the block boundaries are

stored in the footer metadata. (This is possible because the metadata is written after all the

blocks have been written, so the writer can retain the block boundary positions in memory

until the file is closed.) Therefore, Parquet files are splittable, since the blocks can be loc-

ated after reading the footer and can then be processed in parallel (by MapReduce, for ex-

ample).

Each block in a Parquet file stores a row group , which is made up of column chunks con-

taining the column data for those rows. The data for each column chunk is written in pages ;

this is illustrated in Figure 13-1 .

Figure 13-1. The internal structure of a Parquet file

Each page contains values from the same column, making a page a very good candidate for

compression since the values are likely to be similar. The first level of compression is

achieved through how the values are encoded. The simplest encoding is plain encoding,

where values are written in full (e.g., an int32 is written using a 4-byte little-endian rep-

resentation), but this doesn't afford any compression in itself.

Parquet also uses more compact encodings, including delta encoding (the difference

between values is stored), run-length encoding (sequences of identical values are encoded

Search WWH ::

Custom Search

Home