Databases Reference
In-Depth Information
The SequenceFile format seems to be well suited for storing log messages and
processing them with map-reduce jobs; but the direct access to specific log messages
is very slow. The API to-read data from a SequenceFile is iterator based, so that it is
necessary to jump from entry to entry until the target entry is reached. One of the most
important use cases is searching for log messages in real time, as slow random access
performance is a showstopper.
In contrast to SequenceFiles, MapFiles uses two files; the index file stores seek
positions for every n-th key in the datafile. The data file stores data as binary key-value
pairs. However, using MapFiles comes with a disadvantage, which is that any instance of
a random access needs to read from two separate files. This process seems to be slow, but
the indexes that store the seek positions for log entries are small enough to be cached in
memory (Figure 8-10 ). Once the seek position is identified; only relevant portions of the
data file are read.
Figure 8-10. Index and data mapping
Since MapFiles and SequenceFiles use binary key-value pairs we need a data
format to store log messages in these files. In order to be able to search efficiently for
log messages, you need to store data fields as separate entities. Google protocol buffers
provide excellent functionalities to transfer and store log messages. Protocol buffers are
encoded structured data.
Below listed are few most important reasons for choosing the Google protocol buffer
format:
Speed: Deserialization speed is one of the most important factors
when evaluating file formats. Especially map-reduce jobs that
crunch through the whole data set stored in the HDFS rely on fast
object deserialization. Protocol buffers make up one of the fastest
frameworks. Object deserialization with Protocol buffers is sixteen
times faster than with pure Java serialization.
Size: Ability to store billions of serialized objects is another key
factor. Protocol buffers produce serialized objects that are around
four times smaller than those produced by the standard Java
serialization.
Migrations: One unique feature of protocol buffers is the ability to
change the file format without losing backward compatibility. It is
possible to add or remove fields from an object without breaking
working implementations. This is a very important feature when
serializing objects for long-time storage.
 
Search WWH ::




Custom Search