Extracting Value From Big Data: In-Memory Solutions, Real Time Analytics, And Recommendation Systems - Big Data Imperatives

Databases Reference

In-Depth Information

The SequenceFile format seems to be well suited for storing log messages and

processing them with map-reduce jobs; but the direct access to specific log messages

is very slow. The API to-read data from a SequenceFile is iterator based, so that it is

necessary to jump from entry to entry until the target entry is reached. One of the most

important use cases is searching for log messages in real time, as slow random access

performance is a showstopper.

In contrast to SequenceFiles, MapFiles uses two files; the index file stores seek

positions for every n-th key in the datafile. The data file stores data as binary key-value

pairs. However, using MapFiles comes with a disadvantage, which is that any instance of

a random access needs to read from two separate files. This process seems to be slow, but

the indexes that store the seek positions for log entries are small enough to be cached in

memory (Figure 8-10 ). Once the seek position is identified; only relevant portions of the

data file are read.

Figure 8-10. Index and data mapping

Since MapFiles and SequenceFiles use binary key-value pairs we need a data

format to store log messages in these files. In order to be able to search efficiently for

log messages, you need to store data fields as separate entities. Google protocol buffers

provide excellent functionalities to transfer and store log messages. Protocol buffers are

encoded structured data.

Below listed are few most important reasons for choosing the Google protocol buffer

format:

• Speed: Deserialization speed is one of the most important factors

when evaluating file formats. Especially map-reduce jobs that

crunch through the whole data set stored in the HDFS rely on fast

object deserialization. Protocol buffers make up one of the fastest

frameworks. Object deserialization with Protocol buffers is sixteen

times faster than with pure Java serialization.

• Size: Ability to store billions of serialized objects is another key

factor. Protocol buffers produce serialized objects that are around

four times smaller than those produced by the standard Java

serialization.

• Migrations: One unique feature of protocol buffers is the ability to

change the file format without losing backward compatibility. It is

possible to add or remove fields from an object without breaking

working implementations. This is a very important feature when

serializing objects for long-time storage.

Search WWH ::

Custom Search

Home