Database Reference
In-Depth Information
The SequenceFile format
A sequence file consists of a header followed by one or more records (see Figure 5-2 ).
The first three bytes of a sequence file are the bytes SEQ , which act as a magic number;
these are followed by a single byte representing the version number. The header contains
other fields, including the names of the key and value classes, compression details, user-
defined metadata, and the sync marker. [ 48 ] Recall that the sync marker is used to allow a
reader to synchronize to a record boundary from any position in the file. Each file has a
randomly generated sync marker, whose value is stored in the header. Sync markers ap-
pear between records in the sequence file. They are designed to incur less than a 1% stor-
age overhead, so they don't necessarily appear between every pair of records (such is the
case for short records).
Figure 5-2. The internal structure of a sequence file with no compression and with record compression
The internal format of the records depends on whether compression is enabled, and if it is,
whether it is record compression or block compression.
If no compression is enabled (the default), each record is made up of the record length (in
bytes), the key length, the key, and then the value. The length fields are written as 4-byte
integers adhering to the contract of the writeInt() method of
java.io.DataOutput . Keys and values are serialized using the Serialization
defined for the class being written to the sequence file.
Search WWH ::




Custom Search