Extracting Value From Big Data: In-Memory Solutions, Real Time Analytics, And Recommendation Systems - Big Data Imperatives

Databases Reference

In-Depth Information

Log data is generated continuously, and so they need to be captured the moment

they are generated; modifications of log data might happen but are very rare. In an

environment where hundreds or thousands of log messages are stored every second

this task can be difficult to accomplish. It will quickly become necessary to partition

the database schema to distribute the load across multiple hard disks or even servers.

Partitioning log data does not work well with a time-based partitioning scheme, because

most of the data is inserted in the partition that contains data for the current date. To

distribute the load, it is necessary to use another schema, for example by partitioning via

the applications. But here we have to deal with the problem that different applications

produce different amounts of log data. Finding a balanced partitioning scheme is a

challenge and partitioning schemes may change over time.

To develop a real-time log management system, our first task is to find a file format

that allows fast and direct access to a single log message while storing hundreds of

thousands log message in a single file. Hadoop/HDFS could be the solution we are

looking for (Figure 8-9 ).

Figure 8-9. Log processing, Hadoop, and search conceptual architecture

Hadoop provides two file formats for grouping multiple entries in a single file:

• SequenceFile: A flat file which stores binary key-value pairs. The

output of map-reduce tasks is usually written into a SequenceFile.

• MapFile: Consists of two SequenceFiles. The data file is identical

to the SequenceFile and contains the data stored as binary

key-value pairs. The second file is an index file, which contains a

key-value map with seek positions inside the data file to quickly

access the data.

Search WWH ::

Custom Search

Home