Database Reference
In-Depth Information
WARNING
Parquet is a columnar format, so it buffers rows in memory. Even though the mapper in this example just
passes values through, it must have sufficient memory for the Parquet writer to buffer each block (row
group), which is by default 128 MB. If you get job failures due to out of memory errors, you can adjust
the Parquet file block size for the writer with parquet.block.size (see Table 13-3 ). You may also
need to change the MapReduce task memory allocation (when reading or writing) using the settings dis-
cussed in Memory settings in YARN and MapReduce .
The following command runs the program on the four-line text file quangle.txt :
% hadoop jar parquet-examples.jar TextToParquetWithAvro \
input/docs/quangle.txt output
We can use the Parquet command-line tools to dump the output Parquet file for inspec-
tion:
% parquet-tools dump output/part-m-00000.parquet
INT64 offset
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 4 ***
value 1: R:0 D:0 V:0
value 2: R:0 D:0 V:33
value 3: R:0 D:0 V:57
value 4: R:0 D:0 V:89
BINARY line
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 4 ***
value 1: R:0 D:0 V:On the top of the Crumpetty Tree
value 2: R:0 D:0 V:The Quangle Wangle sat,
value 3: R:0 D:0 V:But his face you could not see,
value 4: R:0 D:0 V:On account of his Beaver Hat.
Notice how the values within a row group are shown together. V indicates the value, R the
repetition level, and D the definition level. For this schema, the latter two are zero since
there is no nesting.
[ 86 ] Sergey Melnik et al., Dremel: Interactive Analysis of Web-Scale Datasets , Proceedings of the 36th In-
ternational Conference on Very Large Data Bases, 2010.
[ 87 ] This is based on the model used in Protocol Buffers , where groups are used to define complex types like
lists and maps.
[ 88 ] Julien Le Dem's exposition is excellent.
Search WWH ::




Custom Search