Database Reference
In-Depth Information
to the Parquet file. The file is a regular Parquet file — it is identical to the one written in
the previous section using
ParquetWriter
with
GroupWriteSupport
, except for
an extra piece of metadata to store the Avro schema. We can see this by inspecting the
%
parquet-tools meta data.parquet
...
extra: avro.schema = {"type":"record","name":"StringPair", ...
...
Similarly, to see the Parquet schema that was generated from the Avro schema, we can use
the following:
%
parquet-tools schema data.parquet
message StringPair {
required binary left (UTF8);
required binary right (UTF8);
}
To read the Parquet file back, we use an
AvroParquetReader
and get back Avro
GenericRecord
objects:
AvroParquetReader
<
GenericRecord
>
reader
=
new
AvroParquetReader
<
GenericRecord
>(
path
);
GenericRecord result
=
reader
.
read
();
assertNotNull
(
result
);
assertThat
(
result
.
get
(
"left"
).
toString
(),
is
(
"L"
));
assertThat
(
result
.
get
(
"right"
).
toString
(),
is
(
"R"
));
assertNull
(
reader
.
read
());
Projection and read schemas
It's often the case that you only need to read a few columns in the file, and indeed this is
the raison d'être of a columnar format like Parquet: to save time and I/O. You can use a
projection schema to select the columns to read. For example, the following schema will
read only the
right
field of a
StringPair
:
{
"type"
:
"record"
,
"name"
:
"StringPair"
,
"doc"
:
"The right field of a pair of strings."
,
"fields"
: [
{
"name"
:
"right"
,
"type"
:
"string"
}
]
}