If Data Is the New Oil—You Need Data Exploration and Discovery - Harness the Power of Big Data - page 170

Database Reference

In-Depth Information

1. Documents

2. Tokens

3. Vectors

4. Inverted Index

1 ›4, 4 ›7, 5 ›10, ...

4

3 ›1, 7 ›2

3

1

7

4

10 ›4, 11 ›7, 12 ›1 ...

1 ›2, 2 ›1, 5 ›3, ...

2 ›4, 3 ›2, 6 ›10, ...

1 ›4, 2 ›7, 3 ›10, ...

Figure 7-2 A vector space index

is important, so that frequently used terms like “the,” which can occur often

in individual documents—but also occur often in a whole set of documents—

don't skew search results. Results from a complete set of documents are used

to build an inverted index, which maps each term and its weight to locations

of the term in the documents. When queries are issued against a search

engine using a vector space index, a similar vector is calculated using just the

terms in the search query. The documents whose vectors most closely match

the search term's vectors are included in the top-ranked search results.

There are a number of limitations with the vector space approach, most of

which stem from the fact that after a document has been reduced to a vector,

it's impossible to reconstruct the full document flow. For example, it's impos-

sible to consider portions of a document as separate units, and the only clues

provided about such documents are the frequency and uniqueness of their

indexed terms.

Big Data and modern search applications require more than just informa-

tion about term frequency and uniqueness. Positioning information is

required to efficiently perform phrase or proximity searches, to use proximity

as a ranking factor, or to generate dynamic summaries. So, when keeping

track of document positions (for example, the proximity of multiple search

terms within a document), it's necessary for conventional index solutions to

create a structure in addition to their vector space index—usually a docu-

ment-specific positional space index. Of course, as with most things in life,

nothing is free: this additional index comes at a cost as it takes more time to

index documents, and the resulting index requires a significantly larger vol-

ume footprint.

As we mentioned earlier, Data Explorer also uses a positional space

index—but the difference here is that there is no underlying vector space index.

The positional space index is more compact than the traditional vector space

Next Page

Harness the Power of Big Data

Search WWH ::

Custom Search

Home