Database Reference
In-Depth Information
Parquet MapReduce
Parquet comes with a selection of MapReduce input and output formats for reading and
writing Parquet files from MapReduce jobs, including ones for working with Avro, Pro-
tocol Buffers, and Thrift schemas and data.
The program in Example 13-1 is a map-only job that reads text files and writes Parquet
files where each record is the line's offset in the file (represented by an int64 — conver-
ted from a long in Avro) and the line itself (a string). It uses the Avro Generic API for its
in-memory data model.
Example 13-1. MapReduce program to convert text files to Parquet files using AvroPar-
quetOutputFormat
public class TextToParquetWithAvro extends Configured implements Tool {
private static final Schema SCHEMA = new Schema . Parser (). parse (
"{\n" +
" \"type\": \"record\",\n" +
" \"name\": \"Line\",\n" +
" \"fields\": [\n" +
" {\"name\": \"offset\", \"type\": \"long\"},\n" +
" {\"name\": \"line\", \"type\": \"string\"}\n" +
" ]\n" +
"}" );
public static class TextToParquetMapper
extends Mapper < LongWritable , Text , Void , GenericRecord > {
private GenericRecord record = new GenericData . Record ( SCHEMA );
@Override
protected void map ( LongWritable key , Text value , Context context )
throws IOException , InterruptedException {
record . put ( "offset" , key . get ());
record . put ( "line" , value . toString ());
context . write ( null , record );
}
}
@Override
public int run ( String [] args ) throws Exception {
if ( args . length != 2 ) {
System . err . printf ( "Usage: %s [generic options] <input> <output>\n" ,
getClass (). getSimpleName ());
Search WWH ::




Custom Search