Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

4. Retain response i if probing a reachability index , described in 10.4.2.2,

certifies that i is-a a . This consumes some more time and eliminates a

fraction of responses. We realize that whales are not scientists (as far

as WordNet is concerned) and discard it.

5. In case fewer than k results survive, repeat with a larger k . Thisisvery

expensive and is best avoided.

The central issue is how to choose the registered subset R . Another issue is

the choice of k . We address these issues in the rest of this section.

(While selecting R , we pretend all roots of A are included in R as sentinels,

but we can avoid actually indexing these. While processing a query, in case no

g can be found, we can pretend every word is a potential candidate, a situation

that will essentially never arise given a reasonable algorithm for selecting R .)

In addition to the registered atype index, we need to build two other indices,

which we discuss at this point.

10.4.2.1

Forward index

The task of the forward index is to store the corpus in a highly compact

format on disk, and, given a document ID and a token offset (or range of

offsets), quickly return the tokens or token IDs at those offsets in the specified

document, using very few disk seeks. The forward index should occupy no

more space on disk than, say, a compressed (using gzip , say) version of the

original corpus. We cannot just use the original corpus as-is, because it is too

large, and ordinary compression inhibits random access.

In a first pass, we build the corpus lexicon, and count the frequency of each

token. Next we assign a byte-aligned code to each token. The most frequent

254 tokens get a 1-byte code, the next most frequent 65534 tokens get a 2-byte

code, etc. We use codes of sizes that are multiples of 8 bits because decoding

variable-length codes that are not byte-aligned, with random access, would

be messy. Our codes are suitably escaped so that we can read one byte and

decide if we need to read more bytes to complete the code.

The forward index is used for two purposes: to set up the context required

for scoring the candidate token, and to report a snippet with every hit in a

typical search engine. For these applications, we typically need to access short

contiguous token segments. We partition each document into segments of W

(configurable at indexing time) tokens.

In the second pass, we dump token codes linearly to a random-access file

without regard to their variable lengths. Then we build a persistent map

from (document ID, segment number) to begin and end offsets of code bytes

in the random access file. If W is configured suitably, 1-2 seeks are enough

to retrieve a token segment.

In case of ad hoc additions of documents to the corpus, long codes can be

assigned to new tokens starting from the end of the allocated range, and once

the counts of new tokens get fairly large, codes can be reassigned and the

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home