Database Reference
In-Depth Information
The format for record compression is almost identical to that for no compression, except
the value bytes are compressed using the codec defined in the header. Note that keys are
not compressed.
Block compression (
Figure 5-3
) compresses multiple records at once; it is therefore more
compact than and should generally be preferred over record compression because it has
the opportunity to take advantage of similarities between records. Records are added to a
block until it reaches a minimum size in bytes, defined by the
io.seqfile.compress.blocksize
property; the default is one million bytes. A
sync marker is written before the start of every block. The format of a block is a field in-
dicating the number of records in the block, followed by four compressed fields: the key
lengths, the keys, the value lengths, and the values.
Figure 5-3. The internal structure of a sequence file with block compression
MapFile
A
MapFile
is a sorted
SequenceFile
with an index to permit lookups by key. The in-
dex is itself a
SequenceFile
that contains a fraction of the keys in the map (every
128th key, by default). The idea is that the index can be loaded into memory to provide
fast lookups from the main data file, which is another
SequenceFile
containing all the
map entries in sorted key order.
MapFile
offers a very similar interface to
SequenceFile
for reading and writing —
the main thing to be aware of is that when writing using
MapFile.Writer
, map entries
must be added in order, otherwise an
IOException
will be thrown.