Database Reference
In-Depth Information
4. Retain response i if probing a reachability index , described in 10.4.2.2,
certifies that i is-a a . This consumes some more time and eliminates a
fraction of responses. We realize that whales are not scientists (as far
as WordNet is concerned) and discard it.
5. In case fewer than k results survive, repeat with a larger k . Thisisvery
expensive and is best avoided.
The central issue is how to choose the registered subset R . Another issue is
the choice of k . We address these issues in the rest of this section.
(While selecting R , we pretend all roots of A are included in R as sentinels,
but we can avoid actually indexing these. While processing a query, in case no
g can be found, we can pretend every word is a potential candidate, a situation
that will essentially never arise given a reasonable algorithm for selecting R .)
In addition to the registered atype index, we need to build two other indices,
which we discuss at this point.
10.4.2.1
Forward index
The task of the forward index is to store the corpus in a highly compact
format on disk, and, given a document ID and a token offset (or range of
offsets), quickly return the tokens or token IDs at those offsets in the specified
document, using very few disk seeks. The forward index should occupy no
more space on disk than, say, a compressed (using gzip , say) version of the
original corpus. We cannot just use the original corpus as-is, because it is too
large, and ordinary compression inhibits random access.
In a first pass, we build the corpus lexicon, and count the frequency of each
token. Next we assign a byte-aligned code to each token. The most frequent
254 tokens get a 1-byte code, the next most frequent 65534 tokens get a 2-byte
code, etc. We use codes of sizes that are multiples of 8 bits because decoding
variable-length codes that are not byte-aligned, with random access, would
be messy. Our codes are suitably escaped so that we can read one byte and
decide if we need to read more bytes to complete the code.
The forward index is used for two purposes: to set up the context required
for scoring the candidate token, and to report a snippet with every hit in a
typical search engine. For these applications, we typically need to access short
contiguous token segments. We partition each document into segments of W
(configurable at indexing time) tokens.
In the second pass, we dump token codes linearly to a random-access file
without regard to their variable lengths. Then we build a persistent map
from (document ID, segment number) to begin and end offsets of code bytes
in the random access file. If W is configured suitably, 1-2 seeks are enough
to retrieve a token segment.
In case of ad hoc additions of documents to the corpus, long codes can be
assigned to new tokens starting from the end of the allocated range, and once
the counts of new tokens get fairly large, codes can be reassigned and the
Search WWH ::




Custom Search