Parquet - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Writing and Reading Parquet Files

Most of the time Parquet files are processed using higher-level tools like Pig, Hive, or Im-

pala, but sometimes low-level sequential access may be required, which we cover in this

section.

Parquet has a pluggable in-memory data model to facilitate integration of the Parquet file

format with a wide range of tools and components. ReadSupport and WriteSupport

are the integration points in Java, and implementations of these classes do the conversion

between the objects used by the tool or component and the objects used to represent each

Parquet type in the schema.

To demonstrate, we'll use a simple in-memory model that comes bundled with Parquet in

the parquet.example.data and parquet.example.data.simple packages.

Then, in the next section, we'll use an Avro representation to do the same thing.

NOTE

As the names suggest, the example classes that come with Parquet are an object model for demonstrating

how to work with Parquet files; for production, one of the supported frameworks should be used (Avro,

Protocol Buffers, or Thrift).

To write a Parquet file, we need to define a Parquet schema, represented by an instance of

parquet.schema.MessageType :

MessageType schema = MessageTypeParser . parseMessageType (

"message Pair {\n" +

" required binary left (UTF8);\n" +

" required binary right (UTF8);\n" +

"}" );

Next, we need to create an instance of a Parquet message for each record to be written to

the file. For the parquet.example.data package, a message is represented by an in-

stance of Group , constructed using a GroupFactory :

GroupFactory groupFactory = new SimpleGroupFactory ( schema );

Group group = groupFactory . newGroup ()

. append ( "left" , "L" )

. append ( "right" , "R" );

Notice that the values in the message are UTF8 logical types, and Group provides a natur-

al conversion from a Java String for us.

Search WWH ::

Custom Search

Home