Database Reference
In-Depth Information
Security information that is received from target data sources is ingested
by the Processing Layer and also included in the indexes that Data Explorer
builds for each target data source. This enables the granular role-based secu-
rity capabilities that we described earlier, ensuring that users receive only the
information that they are authorized to view, based on their security permis-
sions with each target data source.
Like the other main components of the IBM Big Data platform, Data Explorer
is designed to handle extremely high volumes of data by scaling out its foot-
print to large numbers of servers. It's been used in production settings to
index trillions of records and petabytes of data.
From a high-availability perspective, the Data Explorer servers feature
master-master replication, and failover capability. Whenever a server is taken
offline, all search and ingestion traffic is redirected to the remaining live server.
When the original server is put back online, each of its collections auto-
matically synchronizes with a peer. If a collection has been corrupted, it is
automatically restored. For planned outages, Data Explorer servers can be
upgraded, replaced, or taken out of the configuration without any interrup-
tion of service (indexing or searching).
The Secret Sauce: Positional Indexes
An index is at the core of any search system and is a leading factor in query
performance. In Big Data implementations, differences in index structure,
size, management, and other characteristics are magnified because of the
higher scale and increased data complexity. Data Explorer has a distinct
advantage because it features a unique positional index structure that is more
compact and versatile than other search solutions on the market today.
To truly appreciate why a positional index makes Data Explorer a superior
enterprise search platform, you need to understand the limitations of con-
ventional indexes, known as vector space indexes (see Figure 7-2).
When text is indexed using the vector space approach, all of the extracted
terms are weighted according to their frequency within the document
(weight is positively correlated with frequency). At query time, this weight-
ing is also influenced by the uniqueness of the search term in relation to the
full set of documents (weight is negatively correlated with the number of
occurrences across the full set of documents—quite simply, if a word doesn't
occur often, it's “special.”) This balance between frequency and uniqueness
Search WWH ::




Custom Search