Database Reference
In-Depth Information
Parquet Configuration
Parquet file properties are set at write time. The properties listed in Table 13-3 are appro-
priate if you are creating Parquet files from MapReduce (using the formats discussed in
Parquet MapReduce ) , Crunch, Pig, or Hive.
Table 13-3. ParquetOutputFormat properties
Property name
Type
Default value Description
134217728
(128 MB)
The size in bytes of a block (row group).
parquet.block.size
int
1048576 (1
MB)
The size in bytes of a page.
parquet.page.size
int
1048576 (1
MB)
The maximum allowed size in bytes of a dictionary be-
fore falling back to plain encoding for a page.
parquet.dictionary.page.size int
Whether to use dictionary encoding.
parquet.enable.dictionary
boolean true
String UNCOMPRESSED The type of compression to use for Parquet files:
UNCOMPRESSED , SNAPPY , GZIP , or LZO . Used instead of
mapreduce.output.fileoutputformat.compress .
parquet.compression
Setting the block size is a trade-off between scanning efficiency and memory usage. Larger
blocks are more efficient to scan through since they contain more rows, which improves se-
quential I/O (as there's less overhead in setting up each column chunk). However, each
block is buffered in memory for both reading and writing, which limits how large blocks
can be. The default block size is 128 MB.
The Parquet file block size should be no larger than the HDFS block size for the file so that
each Parquet block can be read from a single HDFS block (and therefore from a single
datanode). It is common to set them to be the same, and indeed both defaults are for 128
MB block sizes.
A page is the smallest unit of storage in a Parquet file, so retrieving an arbitrary row (with a
single column, for the sake of illustration) requires that the page containing the row be de-
compressed and decoded. Thus, for single-row lookups, it is more efficient to have smaller
pages, so there are fewer values to read through before reaching the target value. However,
smaller pages incur a higher storage and processing overhead, due to the extra metadata
(offsets, dictionaries) resulting from more pages. The default page size is 1 MB.
Search WWH ::




Custom Search