Database Reference
In-Depth Information
if
(
job
==
null
) {
return
-
1
;
}
job
.
setInputFormatClass
(
WholeFileInputFormat
.
class
);
job
.
setOutputFormatClass
(
SequenceFileOutputFormat
.
class
);
job
.
setOutputKeyClass
(
Text
.
class
);
job
.
setOutputValueClass
(
BytesWritable
.
class
);
job
.
setMapperClass
(
SequenceFileMapper
.
class
);
return
job
.
waitForCompletion
(
true
) ?
0
:
1
;
}
public static
void
main
(
String
[]
args
)
throws
Exception
{
int
exitCode
=
ToolRunner
.
run
(
new
SmallFilesToSequenceFileConverter
(),
args
);
System
.
exit
(
exitCode
);
}
}
Because the input format is a
WholeFileInputFormat
, the mapper only has to find
the filename for the input file split. It does this by casting the
InputSplit
from the
context to a
FileSplit
, which has a method to retrieve the file path. The path is stored
in a
Text
object for the key. The reducer is the identity (not explicitly set), and the output
format is a
SequenceFileOutputFormat
.
Here's a run on a few small files. We've chosen to use two reducers, so we get two output
sequence files:
%
hadoop jar hadoop-examples.jar SmallFilesToSequenceFileConverter \
-conf conf/hadoop-localhost.xml -D mapreduce.job.reduces=2 \
input/smallfiles output
Two part files are created, each of which is a sequence file. We can inspect these with the
-text
option to the filesystem shell:
%
hadoop fs -conf conf/hadoop-localhost.xml -text output/part-r-00000
hdfs://localhost/user/tom/input/smallfiles/a 61 61 61 61 61
61 61 61 61 61
hdfs://localhost/user/tom/input/smallfiles/c 63 63 63 63 63
63 63 63 63 63
hdfs://localhost/user/tom/input/smallfiles/e
%
hadoop fs -conf conf/hadoop-localhost.xml -text output/part-r-00001
hdfs://localhost/user/tom/input/smallfiles/b 62 62 62 62 62