Serialization - Field Guide to Hadoop

Database Reference

In-Depth Information

Tutorial Links

The GitHub page for the Parquet format project is a great place to start if you're interested in

learning a bit more about how the technology works. If, on the other hand, you'd like to dive

straight into examples, move over to the GitHub page for the parquet m/r project .

Example Code

The Parquet file format is supported by many of the standard Hadoop tools, including Hive

(described here ) and Pig (described here ) . Using the Parquet data format is typically as easy

as adding a couple lines to your CREATE TABLE command or changing a few words in your

Pig script.

For example, to change our Hive example to use Parquet instead of the delimited textfile

format, we simply refer to Parquet when we create the table:

CREATE EXTERNAL TABLE movie_reviews

( reviewer STRING, title STRING, rating INT)

ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'

STORED

INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"

OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"

LOCATION '/data/reviews';

We can similarly modify our Pig example to load a review file that is stored in the Parquet

format instead of CSV:

reviews = load 'reviews.pqt' using parquet.pig.ParquetLoader

as (reviewer:chararray, title:chararray, rating:int);

Search WWH ::

Custom Search

Home