Database Reference
In-Depth Information
Writing and Reading Parquet Files
Most of the time Parquet files are processed using higher-level tools like Pig, Hive, or Im-
pala, but sometimes low-level sequential access may be required, which we cover in this
section.
Parquet has a pluggable in-memory data model to facilitate integration of the Parquet file
format with a wide range of tools and components. ReadSupport and WriteSupport
are the integration points in Java, and implementations of these classes do the conversion
between the objects used by the tool or component and the objects used to represent each
Parquet type in the schema.
To demonstrate, we'll use a simple in-memory model that comes bundled with Parquet in
the parquet.example.data and parquet.example.data.simple packages.
Then, in the next section, we'll use an Avro representation to do the same thing.
NOTE
As the names suggest, the example classes that come with Parquet are an object model for demonstrating
how to work with Parquet files; for production, one of the supported frameworks should be used (Avro,
Protocol Buffers, or Thrift).
To write a Parquet file, we need to define a Parquet schema, represented by an instance of
parquet.schema.MessageType :
MessageType schema = MessageTypeParser . parseMessageType (
"message Pair {\n" +
" required binary left (UTF8);\n" +
" required binary right (UTF8);\n" +
"}" );
Next, we need to create an instance of a Parquet message for each record to be written to
the file. For the parquet.example.data package, a message is represented by an in-
stance of Group , constructed using a GroupFactory :
GroupFactory groupFactory = new SimpleGroupFactory ( schema );
Group group = groupFactory . newGroup ()
. append ( "left" , "L" )
. append ( "right" , "R" );
Notice that the values in the message are UTF8 logical types, and Group provides a natur-
al conversion from a Java String for us.
Search WWH ::




Custom Search