Hadoop I/O - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

95 One, two, buckle my shoe

94 Three, four, shut the door

93 Five, six, pick up sticks

92 Seven, eight, lay them straight

91 Nine, ten, a big fat hen

Sorting and merging SequenceFiles

The most powerful way of sorting (and merging) one or more sequence files is to use

MapReduce. MapReduce is inherently parallel and will let you specify the number of re-

ducers to use, which determines the number of output partitions. For example, by specify-

ing one reducer, you get a single output file. We can use the sort example that comes with

Hadoop by specifying that the input and output are sequence files and by setting the key

and value types:

% hadoop jar \

$HADOOP_HOME/share/hadoop/mapreduce/

hadoop-mapreduce-examples-*.jar \

sort -r 1 \

-inFormat

org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \

-outFormat

org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat \

-outKey org.apache.hadoop.io.IntWritable \

-outValue org.apache.hadoop.io.Text \

numbers.seq sorted

% hadoop fs -text sorted/part-r-00000 | head

1 Nine, ten, a big fat hen

2 Seven, eight, lay them straight

3 Five, six, pick up sticks

4 Three, four, shut the door

5 One, two, buckle my shoe

6 Nine, ten, a big fat hen

7 Seven, eight, lay them straight

8 Five, six, pick up sticks

9 Three, four, shut the door

10 One, two, buckle my shoe

Sorting is covered in more detail in Sorting .

An alternative to using MapReduce for sort/merge is the SequenceFile.Sorter

class, which has a number of sort() and merge() methods. These functions predate

MapReduce and are lower-level functions than MapReduce (for example, to get parallel-

ism, you need to partition your data manually), so in general MapReduce is the preferred

approach to sort and merge sequence files.

Search WWH ::

Custom Search

Home