Database Reference
In-Depth Information
1. Documents
2. Tokens
3. Vectors
4. Inverted Index
1 ›4, 4 ›7, 5 ›10, ...
4
3 ›1, 7 ›2
3
1
7
4
10 ›4, 11 ›7, 12 ›1 ...
1 ›2, 2 ›1, 5 ›3, ...
2 ›4, 3 ›2, 6 ›10, ...
1 ›4, 2 ›7, 3 ›10, ...
Figure 7-2 A vector space index
is important, so that frequently used terms like “the,” which can occur often
in individual documents—but also occur often in a whole set of documents—
don't skew search results. Results from a complete set of documents are used
to build an inverted index, which maps each term and its weight to locations
of the term in the documents. When queries are issued against a search
engine using a vector space index, a similar vector is calculated using just the
terms in the search query. The documents whose vectors most closely match
the search term's vectors are included in the top-ranked search results.
There are a number of limitations with the vector space approach, most of
which stem from the fact that after a document has been reduced to a vector,
it's impossible to reconstruct the full document flow. For example, it's impos-
sible to consider portions of a document as separate units, and the only clues
provided about such documents are the frequency and uniqueness of their
indexed terms.
Big Data and modern search applications require more than just informa-
tion about term frequency and uniqueness. Positioning information is
required to efficiently perform phrase or proximity searches, to use proximity
as a ranking factor, or to generate dynamic summaries. So, when keeping
track of document positions (for example, the proximity of multiple search
terms within a document), it's necessary for conventional index solutions to
create a structure in addition to their vector space index—usually a docu-
ment-specific positional space index. Of course, as with most things in life,
nothing is free: this additional index comes at a cost as it takes more time to
index documents, and the resulting index requires a significantly larger vol-
ume footprint.
As we mentioned earlier, Data Explorer also uses a positional space
index—but the difference here is that there is no underlying vector space index.
The positional space index is more compact than the traditional vector space
Search WWH ::




Custom Search