Database Reference
In-Depth Information
forward index rebuilt.
10.4.2.2 Reachability index
The task of the reachability index is to preprocess A with all its is-a
(hypernym) links and all corpus tokens and prepare a small index to be able to
quickly answer queries of the form “is type t 1 a generalization or specialization
of type t 2 ” or “is some sense of the string token w an instance of type t .” If the
index is very small we can store it in RAM, and we prefer to do so. Otherwise
it must be on disk.
Reachability indexing is a well-explored area (10; 34). The extreme points
of the storage/time trade-off are 1. doing nothing at indexing time and
initiating a shortest-path search at query time, and 2. precomputing and
storing reachability for all node pairs and answering queries by a table lookup.
If the is-a graph on the whole atype set A is a tree, a suitable prefix numbering
of nodes (15) enables O (1)-time query processing with O (1) storage overhead
per node. In case of general DAGs the problem is more dicult, with non-
trivial lower bounds (10).
The WordNet noun hierarchy is “almost” a tree. For our prototype we
just replicated nodes and numbered them multiple times to effectively make
the graph a tree. The blowup of storage was negligible. Figure 10.17 shows
the space taken by the forward and reachability index in comparison to the
corpus and a regular inverted index. Our overheads are very reasonable. The
forward index would be needed anyway by any text search system to be able
to provide snippets with query responses.
Size (GB)
Corpus/index
Original corpus
5.72
Gzipped corpus
1.33
Stem index
0.91
Reachability index
0.005
Forward index
1.16
FIGURE 10.21 : Sizes of the additional indices needed for pre-generalize and
post-filter query processing, compared to the usual indices for TREC 2000.
10.4.3 Atype Subset Index Space Model
In Section 10.4.5 we will propose a greedy cost-benefit style atype
registration approach that will trade off between the extra index space
required if an atype r is included in R , against the average query time saved
if it is included. In this section we tackle the space cost; in Section 10.4.4 we
Search WWH ::




Custom Search