Practical Approaches to the Many-Answer Problem - Advanced Database Query Systems

Database Reference

In-Depth Information

to each feature in an overall ranking. As one can observe, a fundamental requirement of top-k based ap-

proaches is that they require a scoring function that specifies which result from a large set of potential

answers to a query is most relevant to the user. Achieving this requirement is generally not a straight-

forward endeavor, especially when users do not really know what might be useful or relevant for them.

Automatic methods of ranking answers of database (DB) queries have been recently investigated to

overcome this problem, most of them being an adaptation of those employed in Information Retrieval

(IR). Before discussing these methods in detail, we first review existing Information Retrieval ranking

techniques. Then, we discuss related work on top-k query processing techniques in relational database

systems.

2.2.1.1 IR Ranking Techniques

Automated ranking of query results has been extensively investigated in Information Retrieval (Salton

& McGill, 1986; Robertson, 1997; Brin & Page, 1998; Bharat & Henzinger, 1998; Kleinberg, 1999;

Borodin, Roberts, Rosenthal & Tsaparas, 2001; Ng, Zheng & Jordan, 2001). Indeed, the Web pages that

are returned as a result to a query expressed by a set of keywords are automatically ranked so that the

more 'relevant' the page is to the query, the higher it is ranked. Furthermore, among the Web pages that

are equally relevant, those that are more 'important' should precede the less 'important' ones. Models

that have been used for this purpose include vector space and probabilistic information retrieval models

which are eventually combined with link-based ranking methods to offer the user not only relevant

documents but also high quality Web pages.

Vector Space Model

Vector space model (Salton, & McGill, 1986) represents documents as term vectors in an N -dimension-

al space where N corresponds to the number of terms in the collection. Each element in a vector captures

a term and its weight, defined as w

= * , where tf is the term frequency in the document and idf

is the inverse document frequency reflecting the general importance of the term in the entire document

collection. Specifically, idf decreases the weight of terms that occur in numerous documents and, there-

fore, have low discriminating value. Queries are also represented as vectors of keywords. Hence, given

a keyword query, the retrieved documents are ranked based on their similarity to the query. A range of

measures exists to calculate this similarity; the most common one is the cosine similarity measure defined

as the Cosine of the angle between the query and the document vector representations.

tf

idf

Probabilistic Model

In the probabilistic model (Robertson, 1997), documents and queries are also viewed as vectors, whereas

the vector space similarity measure is replaced by a probabilistic matching function. More precisely,

given a document collection D and a query q , probabilistic-based approaches formally rank documents

by decreasing order of their odds of relevance to non-relevance using a score function defined as follows:

P d

( | Re ) (Re )

( )

( | Re ) (Re )

l P

l

P

(Re | )

l d

P d

( | Re )

l

score d

( )

=

≈

P

l d

P d

l P

l

P d

l

P d

( )

Advanced Database Query Systems

Search WWH ::

Custom Search

Home