Parquet - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

WARNING

Parquet is a columnar format, so it buffers rows in memory. Even though the mapper in this example just

passes values through, it must have sufficient memory for the Parquet writer to buffer each block (row

group), which is by default 128 MB. If you get job failures due to out of memory errors, you can adjust

the Parquet file block size for the writer with parquet.block.size (see Table 13-3 ). You may also

need to change the MapReduce task memory allocation (when reading or writing) using the settings dis-

cussed in Memory settings in YARN and MapReduce .

The following command runs the program on the four-line text file quangle.txt :

% hadoop jar parquet-examples.jar TextToParquetWithAvro \

input/docs/quangle.txt output

We can use the Parquet command-line tools to dump the output Parquet file for inspec-

tion:

% parquet-tools dump output/part-m-00000.parquet

INT64 offset

--------------------------------------------------------------------------------

*** row group 1 of 1, values 1 to 4 ***

value 1: R:0 D:0 V:0

value 2: R:0 D:0 V:33

value 3: R:0 D:0 V:57

value 4: R:0 D:0 V:89

BINARY line

--------------------------------------------------------------------------------

*** row group 1 of 1, values 1 to 4 ***

value 1: R:0 D:0 V:On the top of the Crumpetty Tree

value 2: R:0 D:0 V:The Quangle Wangle sat,

value 3: R:0 D:0 V:But his face you could not see,

value 4: R:0 D:0 V:On account of his Beaver Hat.

Notice how the values within a row group are shown together. V indicates the value, R the

repetition level, and D the definition level. For this schema, the latter two are zero since

there is no nesting.

[ 86 ] Sergey Melnik et al., Dremel: Interactive Analysis of Web-Scale Datasets , Proceedings of the 36th In-

ternational Conference on Very Large Data Bases, 2010.

[ 87 ] This is based on the model used in Protocol Buffers , where groups are used to define complex types like

lists and maps.

[ 88 ] Julien Le Dem's exposition is excellent.

Search WWH ::

Custom Search

Home