Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

the test set. However, the other rows for training years 1999 and 2001, while

showing slightly lower accuracy than year 2000, are still far above the IR

baseline. We should also note that TREC 1999, 2000 and 2001 questions vary

quite a bit in their style and distribution of atypes and words, so Figure 10.16

is also indicative of the robustness of our system.

10.4 Indexing and Query Processing

At this stage we have solved two problems.

•

We presented an algorithm for analyzing the question syntax to identify

the target answer type from a large type hierarchy.

•

We designed a machine learning technique to fit a scoring function that

rewards proximity between instances of the desired answer type and

syntactic matches between other question words and the snippet around

the mentions of the instances.

In this section we address two remaining issues related to system

performance.

•

We propose a workload-guided system for preparing additional indexes

to be used in type-cognizant proximity search.

•

We outline the query execution algorithm that exploits the new indexes.

(These are actually interdependent. Index preparation is optimized for the

query execution algorithm and query execution is dependent on what indexes

are available.)

In Sections 10.2.3.2 and 10.3.3.1, on encountering a token, we pretended

that all hypernym ancestors of (all senses of) the token appear at the same

token position. In Section 10.3.3.1 we then indexed these together with the

original token. Naturally this increases the size of the inverted index; the

deeper the type hierarchy, the larger the bloat in index size.

Limited-domain semantic search applications need to index a handful of

named entities such as person, place and time. For these applications, the cost

of indexing type tags along with tokens is not prohibitive. However, large and

deep type hierarchies are essential to support open-domain semantic search.

Consequently, the index space required for the type annotations becomes

very large compared to the standard inverted index (see Figure 10.17 ). The

overhead appears especially large because standard inverted indexes can be

compressed significantly (39).

For a reader who is familiar with large skew in the frequency of words

in query logs, the natural questions at this point are whether similar skew

Search WWH ::

Custom Search

Home