Practical Approaches to the Many-Answer Problem - Advanced Database Query Systems

Database Reference

In-Depth Information

sos, Sycara & Payne, 2000) and (MacArthur, Brodley, Ka & Broderick, 2002) use relevance feedback

from the user.

Chaudhuri et al's System

In (Chaudhuri & Das, 2003), the authors proposed a ranking function ( QF W ) that leverages workload

information to rank the answers to a database query. It is based on the frequency of occurrence of the

values of unspecified ix attributes. For example, consider a home-buyer searching for houses in HouseDB.

A query with a not very selective condition such as 'City = Paris AND #Bedrooms = 2' may result in too

many tuples in the answer, since there are many houses with two bedrooms in Paris. The proposed system

uses workload information and examines attributes other than City and #Bedrooms (i.e., attributes that

are not specified in the query) to rank the result set. Thus, if the workload contains many more queries

for houses in Paris's 15 th arrondissement (precinct) than for houses in Paris's 18 th arrondissement, the

system ranks two bedroom houses in the 15 th arrondissement higher than two bedroom houses in the

18 th arrondissement. The intuition is that if the 15 th arrondissement is a wonderful location, the workload

will contain many more queries for houses in the 15 th than for houses in the 18 th .

More formally, consider one relation R . Each tuple in R has N attributes A 1 ,…, A N . Further, let q be a

user's query with some number of attributes specified (e.g., A 1 , A 2 ,…, A i and i < N ) and the rest of them

unspecified (e.g., A i+1 ,…, A N ). The relevance score of an answer t is defined as follows:

F(t A )

QF (t) =

k=i+

max

where F(t.A k ) is the frequency of occurrence of value t.A k of attribute A k in the workload W and F max the

frequency of the most frequently occurring value in W .

PIR

In the PIR x system (Chaudhuri, Das, Hristidis & Weikum, 2004), the authors adapted and applied principles

of probabilistic models from Information Retrieval to structured data. Given a query, the proposed ranking

function depends on two factors: (a) a global score that captures the global importance of unspecified

attribute values, and (b) a conditional score that captures the strengths of dependencies (or correlations)

between specified and unspecified attribute values. For example, for the query 'City = Marseille AND

View = Waterfront', a house with 'SchoolDistrict = Excellent' gets a high rank because good school

districts are globally desirable. A house with also 'BoatDock = Yes' gets a high rank because people

desiring a waterfront are likely to want a boat dock. These scores are estimated using past workloads as

well as data analysis, e.g., past workload may reveal that a large fraction of users seeking houses with a

waterfront view have also requested boat docks. More precisely, under the same notations and hypotheses

that we have used for the previous approach, the relevance score of a tuple t is computed as follows:

P(t A |W)

P(t A | R)

P(t A | t

A ,W)

P(t A | t A ,R)

PIR(t) = P(

l | t)

∏

≈

∗

l | t)

k=i+

k . are simply the relative frequencies of each distinct

value t.Ak respectively in the workload W and in the relation R , while the quantities P(t A | t A ,W)

where the quantities P(t A |W)

and P(t A | R)

. .

Advanced Database Query Systems

Search WWH ::

Custom Search

Home