Databases Reference
In-Depth Information
By analyzing typical search queries you can identify fields that need to be indexed in
order to run 95 percent of all search queries in real time. At a broad level you only need
six fields in the Lucene index:
timestamp of the log message
numeric ID of the application that created this log message
numeric log level
name of application server the application is running on
host name of the server the application is running on
path in the Hadoop file system where the complete log message
can be read
The next step is to optimize the memory requirements for each field. The timestamp
field can become a bit tricky to manage especially if you are capturing to the level of
milliseconds. This will lead to a huge number of unique values inside the timestamp
field and will impact memory requirements. On the other hand, in the search queries
timestamps are only specified up to an accuracy of minutes. You need the higher
accuracy to sort the search results.
A solution could be to split the timestamp field into two separate fields in the Lucene
index. One field stores the timestamp rounded up to minutes and is indexed. The other
field stores the timestamp with full accuracy and is only stored in Lucene, not indexed.
With this solution you can reduce the number of unique terms that Lucene needs to
handle and therefore greatly reduce the impact on memory requirements. Another
benefit of this approach is increased performance when searching for date ranges. The
downside is that you need to sort the result set yourself using the detailed timestamp field
after getting the search results from Lucene.
In order to have high availability, we can split up the Lucene index into smaller parts
which can be served on each datanode. We can further allocate 6 GB of heap memory on
each data node to Lucene so that each data node is able to run the index for up to 1 billion
documents. Solr is a search platform based on Lucene. It provides a web-based interface
to access the index. This means we can use a simple HTTP/REST request to index
documents, perform queries, and even move an index from one data node to another.
Each data node is running a single Solr server that can host multiple Lucene indexes. New
log messages are indexed into different shards, so that each index has approximately the
same number of documents. This approach balances the load on each shard and enables
scalability. When a new data node is integrated into the cluster, the index shard on this
data node will be primarily used for indexing new documents.
For performance reasons, the index data files are stored on the local file system of
each data node (Figure 8-12 ). Each time an index has been modified, it will be backed up
into the Hadoop file system. Now we are able to quickly redeploy this index onto another
data node, in case the data node which originally hosted this index has failed.
Search WWH ::




Custom Search