Database Reference
In-Depth Information
to each feature in an overall ranking. As one can observe, a fundamental requirement of top-k based ap-
proaches is that they require a scoring function that specifies which result from a large set of potential
answers to a query is most relevant to the user. Achieving this requirement is generally not a straight-
forward endeavor, especially when users do not really know what might be useful or relevant for them.
Automatic methods of ranking answers of database (DB) queries have been recently investigated to
overcome this problem, most of them being an adaptation of those employed in Information Retrieval
(IR). Before discussing these methods in detail, we first review existing Information Retrieval ranking
techniques. Then, we discuss related work on top-k query processing techniques in relational database
systems.
2.2.1.1 IR Ranking Techniques
Automated ranking of query results has been extensively investigated in Information Retrieval (Salton
& McGill, 1986; Robertson, 1997; Brin & Page, 1998; Bharat & Henzinger, 1998; Kleinberg, 1999;
Borodin, Roberts, Rosenthal & Tsaparas, 2001; Ng, Zheng & Jordan, 2001). Indeed, the Web pages that
are returned as a result to a query expressed by a set of keywords are automatically ranked so that the
more 'relevant' the page is to the query, the higher it is ranked. Furthermore, among the Web pages that
are equally relevant, those that are more 'important' should precede the less 'important' ones. Models
that have been used for this purpose include vector space and probabilistic information retrieval models
which are eventually combined with link-based ranking methods to offer the user not only relevant
documents but also high quality Web pages.
Vector Space Model
Vector space model (Salton, & McGill, 1986) represents documents as term vectors in an N -dimension-
al space where N corresponds to the number of terms in the collection. Each element in a vector captures
a term and its weight, defined as w
= * , where tf is the term frequency in the document and idf
is the inverse document frequency reflecting the general importance of the term in the entire document
collection. Specifically, idf decreases the weight of terms that occur in numerous documents and, there-
fore, have low discriminating value. Queries are also represented as vectors of keywords. Hence, given
a keyword query, the retrieved documents are ranked based on their similarity to the query. A range of
measures exists to calculate this similarity; the most common one is the cosine similarity measure defined
as the Cosine of the angle between the query and the document vector representations.
tf
idf
Probabilistic Model
In the probabilistic model (Robertson, 1997), documents and queries are also viewed as vectors, whereas
the vector space similarity measure is replaced by a probabilistic matching function. More precisely,
given a document collection D and a query q , probabilistic-based approaches formally rank documents
by decreasing order of their odds of relevance to non-relevance using a score function defined as follows:
P d
( | Re ) (Re )
( )
( | Re ) (Re )
l P
l
P
(Re | )
(Re | )
l d
P d
P d
( | Re )
( | Re )
l
score d
( )
=
=
P
l d
P d
l P
l
P d
l
P d
( )
 
Search WWH ::




Custom Search