Database Reference
In-Depth Information
In order to use a projection schema, set it on the configuration using the setReques-
tedProjection() static convenience method on AvroReadSupport :
Schema projectionSchema = parser . parse (
getClass (). getResourceAsStream ( "ProjectedStringPair.avsc" ));
Configuration conf = new Configuration ();
AvroReadSupport . setRequestedProjection ( conf , projectionSchema );
Then pass the configuration into the constructor for AvroParquetReader :
AvroParquetReader < GenericRecord > reader =
new AvroParquetReader < GenericRecord >( conf , path );
GenericRecord result = reader . read ();
assertNull ( result . get ( "left" ));
assertThat ( result . get ( "right" ). toString (), is ( "R" ));
Both the Protocol Buffers and Thrift implementations support projection in a similar man-
ner. In addition, the Avro implementation allows you to specify a reader's schema by call-
ing setReadSchema() on AvroReadSupport . This schema is used to resolve Avro
records according to the rules listed in Table 12-4 .
The reason that Avro has both a projection schema and a reader's schema is that the pro-
jection must be a subset of the schema used to write the Parquet file, so it cannot be used
to evolve a schema by adding new fields.
The two schemas serve different purposes, and you can use both together. The projection
schema is used to filter the columns to read from the Parquet file. Although it is expressed
as an Avro schema, it can be viewed simply as a list of Parquet columns to read back. The
reader's schema, on the other hand, is used only to resolve Avro records. It is never trans-
lated to a Parquet schema, since it has no bearing on which columns are read from the
Parquet file. For example, if we added a description field to our Avro schema (like in
Schema Resolution ) and used it as the Avro reader's schema, then the records would con-
tain the default value of the field, even though the Parquet file has no such field.
Search WWH ::




Custom Search