Database Reference
In-Depth Information
Parquet MapReduce
Parquet comes with a selection of MapReduce input and output formats for reading and
writing Parquet files from MapReduce jobs, including ones for working with Avro, Pro-
tocol Buffers, and Thrift schemas and data.
The program in
Example 13-1
is a map-only job that reads text files and writes Parquet
files where each record is the line's offset in the file (represented by an
int64
— conver-
ted from a
long
in Avro) and the line itself (a string). It uses the Avro Generic API for its
in-memory data model.
Example 13-1. MapReduce program to convert text files to Parquet files using AvroPar-
quetOutputFormat
public class
TextToParquetWithAvro
extends
Configured
implements
Tool
{
private static final
Schema SCHEMA
=
new
Schema
.
Parser
().
parse
(
"{\n"
+
" \"type\": \"record\",\n"
+
" \"name\": \"Line\",\n"
+
" \"fields\": [\n"
+
" {\"name\": \"offset\", \"type\": \"long\"},\n"
+
" {\"name\": \"line\", \"type\": \"string\"}\n"
+
" ]\n"
+
"}"
);
public static class
TextToParquetMapper
extends
Mapper
<
LongWritable
,
Text
,
Void
,
GenericRecord
> {
private
GenericRecord record
=
new
GenericData
.
Record
(
SCHEMA
);
@Override
protected
void
map
(
LongWritable key
,
Text value
,
Context context
)
throws
IOException
,
InterruptedException
{
record
.
put
(
"offset"
,
key
.
get
());
record
.
put
(
"line"
,
value
.
toString
());
context
.
write
(
null
,
record
);
}
}
@Override
public
int
run
(
String
[]
args
)
throws
Exception
{
if
(
args
.
length
!=
2
) {
System
.
err
.
printf
(
"Usage: %s [generic options] <input> <output>\n"
,
getClass
().
getSimpleName
());