Parquet - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Parquet MapReduce

Parquet comes with a selection of MapReduce input and output formats for reading and

writing Parquet files from MapReduce jobs, including ones for working with Avro, Pro-

tocol Buffers, and Thrift schemas and data.

The program in Example 13-1 is a map-only job that reads text files and writes Parquet

files where each record is the line's offset in the file (represented by an int64 — conver-

ted from a long in Avro) and the line itself (a string). It uses the Avro Generic API for its

in-memory data model.

Example 13-1. MapReduce program to convert text files to Parquet files using AvroPar-

quetOutputFormat

public class TextToParquetWithAvro extends Configured implements Tool {

private static final Schema SCHEMA = new Schema . Parser (). parse (

"{\n" +

" \"type\": \"record\",\n" +

" \"name\": \"Line\",\n" +

" \"fields\": [\n" +

" {\"name\": \"offset\", \"type\": \"long\"},\n" +

" {\"name\": \"line\", \"type\": \"string\"}\n" +

" ]\n" +

"}" );

public static class TextToParquetMapper

extends Mapper < LongWritable , Text , Void , GenericRecord > {

private GenericRecord record = new GenericData . Record ( SCHEMA );

@Override

protected void map ( LongWritable key , Text value , Context context )

throws IOException , InterruptedException {

record . put ( "offset" , key . get ());

record . put ( "line" , value . toString ());

context . write ( null , record );

}

@Override

public int run ( String [] args ) throws Exception {

if ( args . length != 2 ) {

System . err . printf ( "Usage: %s [generic options] <input> <output>\n" ,

getClass (). getSimpleName ());

Search WWH ::

Custom Search

Home