Databases Reference
In-Depth Information
A typical search requirement is that 95 percent of all search queries should display the
results in less than 10 seconds . These requirements are difficult to implement with a pure
Hadoop solution. In Hadoop we can use map-reduce jobs to retrieve data. A map-reduce
job that has to read all the data may run for about hours. The only way to be able to search
in real time is to build a search index on the stored data. Lucene , a specialized search
framework seems to be a very good partner for Hadoop. It is implemented in Java, which
means a very good integration with Java-based application. Lucene is also highly scalable
and has a powerful query syntax.
Now let's look at how our solution works!
Lucene is able to distinguish multiple indexed fields in a single document. Log data
can be split up in distinct fields like the timestamp of the message, the log level, the message
itself, etc. A Lucene index consists of documents. Each document has a number of fields.
The contents of a field can consist of one or more terms. The number of unique terms is
on criteria for the memory requirements of an index.
Each document needs to have a primary key field, which specifies how the document
can be retrieved. The primary key field contains the full path inside the HDFS to the
MapFile, which contains the log message, followed by the index of the log message inside
this MapFile (Figure 8-11 ). This enables us to directly access the referenced log message.
Figure 8-11. Document indexing example
Lucene can build up a full text index of rather large files. However, one must pay
attention to the memory requirements. The memory requirements depend heavily on the
number of indexed fields, the type of the indexed fields, and whether the contents of a
field have to be stored in Lucene or not.
 
Search WWH ::




Custom Search