Extracting Value From Big Data: In-Memory Solutions, Real Time Analytics, And Recommendation Systems - Big Data Imperatives

Databases Reference

In-Depth Information

A typical search requirement is that 95 percent of all search queries should display the

results in less than 10 seconds . These requirements are difficult to implement with a pure

Hadoop solution. In Hadoop we can use map-reduce jobs to retrieve data. A map-reduce

job that has to read all the data may run for about hours. The only way to be able to search

in real time is to build a search index on the stored data. Lucene , a specialized search

framework seems to be a very good partner for Hadoop. It is implemented in Java, which

means a very good integration with Java-based application. Lucene is also highly scalable

and has a powerful query syntax.

Now let's look at how our solution works!

Lucene is able to distinguish multiple indexed fields in a single document. Log data

can be split up in distinct fields like the timestamp of the message, the log level, the message

itself, etc. A Lucene index consists of documents. Each document has a number of fields.

The contents of a field can consist of one or more terms. The number of unique terms is

on criteria for the memory requirements of an index.

Each document needs to have a primary key field, which specifies how the document

can be retrieved. The primary key field contains the full path inside the HDFS to the

MapFile, which contains the log message, followed by the index of the log message inside

this MapFile (Figure 8-11 ). This enables us to directly access the referenced log message.

Figure 8-11. Document indexing example

Lucene can build up a full text index of rather large files. However, one must pay

attention to the memory requirements. The memory requirements depend heavily on the

number of indexed fields, the type of the indexed fields, and whether the contents of a

field have to be stored in Lucene or not.

Search WWH ::

Custom Search

Home