Database Reference
In-Depth Information
the test set. However, the other rows for training years 1999 and 2001, while
showing slightly lower accuracy than year 2000, are still far above the IR
baseline. We should also note that TREC 1999, 2000 and 2001 questions vary
quite a bit in their style and distribution of atypes and words, so Figure 10.16
is also indicative of the robustness of our system.
10.4 Indexing and Query Processing
At this stage we have solved two problems.
We presented an algorithm for analyzing the question syntax to identify
the target answer type from a large type hierarchy.
We designed a machine learning technique to fit a scoring function that
rewards proximity between instances of the desired answer type and
syntactic matches between other question words and the snippet around
the mentions of the instances.
In this section we address two remaining issues related to system
performance.
We propose a workload-guided system for preparing additional indexes
to be used in type-cognizant proximity search.
We outline the query execution algorithm that exploits the new indexes.
(These are actually interdependent. Index preparation is optimized for the
query execution algorithm and query execution is dependent on what indexes
are available.)
In Sections 10.2.3.2 and 10.3.3.1, on encountering a token, we pretended
that all hypernym ancestors of (all senses of) the token appear at the same
token position. In Section 10.3.3.1 we then indexed these together with the
original token. Naturally this increases the size of the inverted index; the
deeper the type hierarchy, the larger the bloat in index size.
Limited-domain semantic search applications need to index a handful of
named entities such as person, place and time. For these applications, the cost
of indexing type tags along with tokens is not prohibitive. However, large and
deep type hierarchies are essential to support open-domain semantic search.
Consequently, the index space required for the type annotations becomes
very large compared to the standard inverted index (see Figure 10.17 ). The
overhead appears especially large because standard inverted indexes can be
compressed significantly (39).
For a reader who is familiar with large skew in the frequency of words
in query logs, the natural questions at this point are whether similar skew
 
Search WWH ::




Custom Search