Database Reference
In-Depth Information
File-Based Data Structures
For some applications, you need a specialized data structure to hold your data. For doing
MapReduce-based processing, putting each blob of binary data into its own file doesn't
scale, so Hadoop developed a number of higher-level containers for these situations.
SequenceFile
Imagine a logfile where each log record is a new line of text. If you want to log binary
types, plain text isn't a suitable format. Hadoop's
SequenceFile
class fits the bill in this
situation, providing a persistent data structure for binary key-value pairs. To use it as a log-
file format, you would choose a key, such as timestamp represented by a
LongWritable
,
and the value would be a
Writable
that represents the quantity being logged.
SequenceFile
s also work well as containers for smaller files. HDFS and MapReduce
are optimized for large files, so packing files into a
SequenceFile
makes storing and
processing the smaller files more efficient (
Processing a whole file as a record
contains a
Writing a SequenceFile
To create a
SequenceFile
, use one of its
createWriter()
static methods, which re-
turn a
SequenceFile.Writer
instance. There are several overloaded versions, but
they all require you to specify a stream to write to (either an
FSDataOutputStream
or
a
FileSystem
and
Path
pairing), a
Configuration
object, and the key and value
types. Optional arguments include the compression type and codec, a
Progressable
callback to be informed of write progress, and a
Metadata
instance to be stored in the
SequenceFile
header.
The keys and values stored in a
SequenceFile
do not necessarily need to be
Writ-
able
s. Any types that can be serialized and deserialized by a
Serialization
may be
used.
Once you have a
SequenceFile.Writer
, you then write key-value pairs using the
append()
method. When you've finished, you call the
close()
method
(
SequenceFile.Writer
implements
java.io.Closeable
).
Example 5-10
shows a short program to write some key-value pairs to a
SequenceFile
using the API just described.
Example 5-10. Writing a SequenceFile