Database Reference
In-Depth Information
to the Parquet file. The file is a regular Parquet file — it is identical to the one written in
the previous section using ParquetWriter with GroupWriteSupport , except for
an extra piece of metadata to store the Avro schema. We can see this by inspecting the
file's metadata using Parquet's command-line tools: [ 89 ]
% parquet-tools meta data.parquet
...
extra: avro.schema = {"type":"record","name":"StringPair", ...
...
Similarly, to see the Parquet schema that was generated from the Avro schema, we can use
the following:
% parquet-tools schema data.parquet
message StringPair {
required binary left (UTF8);
required binary right (UTF8);
}
To read the Parquet file back, we use an AvroParquetReader and get back Avro
GenericRecord objects:
AvroParquetReader < GenericRecord > reader =
new AvroParquetReader < GenericRecord >( path );
GenericRecord result = reader . read ();
assertNotNull ( result );
assertThat ( result . get ( "left" ). toString (), is ( "L" ));
assertThat ( result . get ( "right" ). toString (), is ( "R" ));
assertNull ( reader . read ());
Projection and read schemas
It's often the case that you only need to read a few columns in the file, and indeed this is
the raison d'être of a columnar format like Parquet: to save time and I/O. You can use a
projection schema to select the columns to read. For example, the following schema will
read only the right field of a StringPair :
{
"type" : "record" ,
"name" : "StringPair" ,
"doc" : "The right field of a pair of strings." ,
"fields" : [
{ "name" : "right" , "type" : "string" }
]
}
Search WWH ::




Custom Search