Database Reference
In-Depth Information
sos, Sycara & Payne, 2000) and (MacArthur, Brodley, Ka & Broderick, 2002) use relevance feedback
from the user.
Chaudhuri et al's System
In (Chaudhuri & Das, 2003), the authors proposed a ranking function ( QF W ) that leverages workload
information to rank the answers to a database query. It is based on the frequency of occurrence of the
values of unspecified ix attributes. For example, consider a home-buyer searching for houses in HouseDB.
A query with a not very selective condition such as 'City = Paris AND #Bedrooms = 2' may result in too
many tuples in the answer, since there are many houses with two bedrooms in Paris. The proposed system
uses workload information and examines attributes other than City and #Bedrooms (i.e., attributes that
are not specified in the query) to rank the result set. Thus, if the workload contains many more queries
for houses in Paris's 15 th arrondissement (precinct) than for houses in Paris's 18 th arrondissement, the
system ranks two bedroom houses in the 15 th arrondissement higher than two bedroom houses in the
18 th arrondissement. The intuition is that if the 15 th arrondissement is a wonderful location, the workload
will contain many more queries for houses in the 15 th than for houses in the 18 th .
More formally, consider one relation R . Each tuple in R has N attributes A 1 ,…, A N . Further, let q be a
user's query with some number of attributes specified (e.g., A 1 , A 2 ,…, A i and i < N ) and the rest of them
unspecified (e.g., A i+1 ,…, A N ). The relevance score of an answer t is defined as follows:
F(t A )
F
.
N
å
QF (t) =
k
W
k=i+
1
max
where F(t.A k ) is the frequency of occurrence of value t.A k of attribute A k in the workload W and F max the
frequency of the most frequently occurring value in W .
PIR
In the PIR x system (Chaudhuri, Das, Hristidis & Weikum, 2004), the authors adapted and applied principles
of probabilistic models from Information Retrieval to structured data. Given a query, the proposed ranking
function depends on two factors: (a) a global score that captures the global importance of unspecified
attribute values, and (b) a conditional score that captures the strengths of dependencies (or correlations)
between specified and unspecified attribute values. For example, for the query 'City = Marseille AND
View = Waterfront', a house with 'SchoolDistrict = Excellent' gets a high rank because good school
districts are globally desirable. A house with also 'BoatDock = Yes' gets a high rank because people
desiring a waterfront are likely to want a boat dock. These scores are estimated using past workloads as
well as data analysis, e.g., past workload may reveal that a large fraction of users seeking houses with a
waterfront view have also requested boat docks. More precisely, under the same notations and hypotheses
that we have used for the previous approach, the relevance score of a tuple t is computed as follows:
P(t A |W)
P(t A | R)
.
.
P(t A | t
A ,W)
P(t A | t A ,R)
.
.
PIR(t) = P(
Re
Re
l | t)
N
N
i
k
l
k
.
.
P(
l | t)
k=i+
1
k=i+
1
l=
1
k
l
k
k . are simply the relative frequencies of each distinct
value t.Ak respectively in the workload W and in the relation R , while the quantities P(t A | t A ,W)
l
where the quantities P(t A |W)
k
.
and P(t A | R)
. .
k
Search WWH ::




Custom Search