Extracting Value From Big Data: In-Memory Solutions, Real Time Analytics, And Recommendation Systems - Big Data Imperatives

Databases Reference

In-Depth Information

By analyzing typical search queries you can identify fields that need to be indexed in

order to run 95 percent of all search queries in real time. At a broad level you only need

six fields in the Lucene index:

•

timestamp of the log message

•

numeric ID of the application that created this log message

•

numeric log level

•

name of application server the application is running on

•

host name of the server the application is running on

•

path in the Hadoop file system where the complete log message

can be read

The next step is to optimize the memory requirements for each field. The timestamp

field can become a bit tricky to manage especially if you are capturing to the level of

milliseconds. This will lead to a huge number of unique values inside the timestamp

field and will impact memory requirements. On the other hand, in the search queries

timestamps are only specified up to an accuracy of minutes. You need the higher

accuracy to sort the search results.

A solution could be to split the timestamp field into two separate fields in the Lucene

index. One field stores the timestamp rounded up to minutes and is indexed. The other

field stores the timestamp with full accuracy and is only stored in Lucene, not indexed.

With this solution you can reduce the number of unique terms that Lucene needs to

handle and therefore greatly reduce the impact on memory requirements. Another

benefit of this approach is increased performance when searching for date ranges. The

downside is that you need to sort the result set yourself using the detailed timestamp field

after getting the search results from Lucene.

In order to have high availability, we can split up the Lucene index into smaller parts

which can be served on each datanode. We can further allocate 6 GB of heap memory on

each data node to Lucene so that each data node is able to run the index for up to 1 billion

documents. Solr is a search platform based on Lucene. It provides a web-based interface

to access the index. This means we can use a simple HTTP/REST request to index

documents, perform queries, and even move an index from one data node to another.

Each data node is running a single Solr server that can host multiple Lucene indexes. New

log messages are indexed into different shards, so that each index has approximately the

same number of documents. This approach balances the load on each shard and enables

scalability. When a new data node is integrated into the cluster, the index shard on this

data node will be primarily used for indexing new documents.

For performance reasons, the index data files are stored on the local file system of

each data node (Figure 8-12 ). Each time an index has been modified, it will be backed up

into the Hadoop file system. Now we are able to quickly redeploy this index onto another

data node, in case the data node which originally hosted this index has failed.

Big Data Imperatives

Search WWH ::

Custom Search

Home