Overview - Probabilistic Databases

Database Reference

In-Depth Information

out using “thresholding”, while Zhang et al. [ 2009 ] describe continuous skyline queries over sliding

windows on uncertain data elements regarding given probability thresholds. Jestes et al. [ 2010 ] ex-

tend the string similarity problem, which is used in many database queries, to probabilistic strings;

they consider both the “string level model”, consisting of a complete distribution on the possible

strings, and the “character level model”, where characters are independent events, and derive solutions

for the Expected Edit Distance (EED). Xu et al. [ 2010 ] generalize the simple selection problem to

probabilistic databases: the attribute in the data is uncertain and given by a probabilistic histogram,

and the value being searched is also uncertain. They use the Earth Mover's Distance to define the

similarity between the two uncertain values and describe techniques for computing it.

A class of applications of probabilistic databases is in inferring missing attribute values in a

deterministic database by mining portions of the data where those values are present. The result

is a probabilistic database since the missing values cannot be inferred exactly, but one can derive

a probability distribution on their possible values. Wolf et al. [ 2009 ] develop methods for mining

attribute correlations (in terms of Approximate Functional Dependencies), value distributions (in

the form of Naïve Bayes Classifiers), and selectivity estimates for that purpose. Stoyanovich et al.

[ 2011 ] use ensembles and develop an elegant and effective theory for inferring missing values from

various subsets of the defined attributes. Dasgupta et al. [ 2009 ] describe an interesting application

of probabilistic data for acquiring unbiased samples from online hidden database, which offer query

interfaces that return restricted answers (e.g., only top-k of the selected tuples), accompanied by a

total count of the total number of tuples.

Finally, we mention an important subarea of probabilistic databases that we do not cover in

this topic: ranking the query answers by using both a user defined scoring criterion and the tuple

probability, e.g., [ Cormode et al. , 2009b , Ge et al. , 2009 , Li et al. , 2009a , b , Soliman et al. , 2008 ,

2010 , Zhang and Chomicki , 2008 ]. It is often the case that the user can specify a particular ranking

criteria, for example, rank products by prices or rank locations by some distance, which has a well

defined semantics even on a deterministic database. If the database is probabilistic, then ranking

becomes quite challenging because the system needs to account both for the user defined criterion

and for the output probability.

1.4

BIBLIOGRAPHIC AND HISTORICAL NOTES

1.4.0.1 Early Work on Probabilistic Databases

Probabilistic databases are almost as old as traditional databases. Early work from the

80's [ Cavallo and Pittarelli , 1987 , Gelenbe and Hébrail , 1986 , Ghosh , 1986 , Lefons et al. , 1983 ]

described attributes as random variables. Attribute-level uncertainty as we understand it today, as

an uncertain value of an attribute, was popularized by the work of Barbará et al. [ 1992 ] who also

considered query processing and described a simple evaluation method for selection-join queries.

Motivated by the desire to merge databases with information retrieval, Fuhr [ 1990 ] and

Fuhr and Rölleke [ 1997 ] defined a more elaborate probabilistic data model, which is essentially

equivalent to the possible worlds semantics. A similar semantics is described by Zimányi [ 1997 ].

Search WWH ::

Custom Search

Home