Databases Reference
In-Depth Information
Log data is generated continuously, and so they need to be captured the moment
they are generated; modifications of log data might happen but are very rare. In an
environment where hundreds or thousands of log messages are stored every second
this task can be difficult to accomplish. It will quickly become necessary to partition
the database schema to distribute the load across multiple hard disks or even servers.
Partitioning log data does not work well with a time-based partitioning scheme, because
most of the data is inserted in the partition that contains data for the current date. To
distribute the load, it is necessary to use another schema, for example by partitioning via
the applications. But here we have to deal with the problem that different applications
produce different amounts of log data. Finding a balanced partitioning scheme is a
challenge and partitioning schemes may change over time.
To develop a real-time log management system, our first task is to find a file format
that allows fast and direct access to a single log message while storing hundreds of
thousands log message in a single file. Hadoop/HDFS could be the solution we are
looking for (Figure 8-9 ).
Figure 8-9. Log processing, Hadoop, and search conceptual architecture
Hadoop provides two file formats for grouping multiple entries in a single file:
SequenceFile: A flat file which stores binary key-value pairs. The
output of map-reduce tasks is usually written into a SequenceFile.
MapFile: Consists of two SequenceFiles. The data file is identical
to the SequenceFile and contains the data stored as binary
key-value pairs. The second file is an index file, which contains a
key-value map with seek positions inside the data file to quickly
access the data.
Search WWH ::




Custom Search