Parquet - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

to the Parquet file. The file is a regular Parquet file — it is identical to the one written in

the previous section using ParquetWriter with GroupWriteSupport , except for

an extra piece of metadata to store the Avro schema. We can see this by inspecting the

file's metadata using Parquet's command-line tools: [ 89 ]

% parquet-tools meta data.parquet

...

extra: avro.schema = {"type":"record","name":"StringPair", ...

...

Similarly, to see the Parquet schema that was generated from the Avro schema, we can use

the following:

% parquet-tools schema data.parquet

message StringPair {

required binary left (UTF8);

required binary right (UTF8);

}

To read the Parquet file back, we use an AvroParquetReader and get back Avro

GenericRecord objects:

AvroParquetReader < GenericRecord > reader =

new AvroParquetReader < GenericRecord >( path );

GenericRecord result = reader . read ();

assertNotNull ( result );

assertThat ( result . get ( "left" ). toString (), is ( "L" ));

assertThat ( result . get ( "right" ). toString (), is ( "R" ));

assertNull ( reader . read ());

Projection and read schemas

It's often the case that you only need to read a few columns in the file, and indeed this is

the raison d'être of a columnar format like Parquet: to save time and I/O. You can use a

projection schema to select the columns to read. For example, the following schema will

read only the right field of a StringPair :

{

"type" : "record" ,

"name" : "StringPair" ,

"doc" : "The right field of a pair of strings." ,

"fields" : [

{ "name" : "right" , "type" : "string" }

]

}

Search WWH ::

Custom Search

Home