Database Reference
In-Depth Information
95 One, two, buckle my shoe
94 Three, four, shut the door
93 Five, six, pick up sticks
92 Seven, eight, lay them straight
91 Nine, ten, a big fat hen
Sorting and merging SequenceFiles
The most powerful way of sorting (and merging) one or more sequence files is to use
MapReduce. MapReduce is inherently parallel and will let you specify the number of re-
ducers to use, which determines the number of output partitions. For example, by specify-
ing one reducer, you get a single output file. We can use the sort example that comes with
Hadoop by specifying that the input and output are sequence files and by setting the key
and value types:
% hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/
hadoop-mapreduce-examples-*.jar \
sort -r 1 \
-inFormat
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \
-outFormat
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat \
-outKey org.apache.hadoop.io.IntWritable \
-outValue org.apache.hadoop.io.Text \
numbers.seq sorted
% hadoop fs -text sorted/part-r-00000 | head
1 Nine, ten, a big fat hen
2 Seven, eight, lay them straight
3 Five, six, pick up sticks
4 Three, four, shut the door
5 One, two, buckle my shoe
6 Nine, ten, a big fat hen
7 Seven, eight, lay them straight
8 Five, six, pick up sticks
9 Three, four, shut the door
10 One, two, buckle my shoe
Sorting is covered in more detail in Sorting .
An alternative to using MapReduce for sort/merge is the SequenceFile.Sorter
class, which has a number of sort() and merge() methods. These functions predate
MapReduce and are lower-level functions than MapReduce (for example, to get parallel-
ism, you need to partition your data manually), so in general MapReduce is the preferred
approach to sort and merge sequence files.
Search WWH ::




Custom Search