Database Reference
In-Depth Information
out using “thresholding”, while Zhang et al. [ 2009 ] describe continuous skyline queries over sliding
windows on uncertain data elements regarding given probability thresholds. Jestes et al. [ 2010 ] ex-
tend the string similarity problem, which is used in many database queries, to probabilistic strings;
they consider both the “string level model”, consisting of a complete distribution on the possible
strings, and the “character level model”, where characters are independent events, and derive solutions
for the Expected Edit Distance (EED). Xu et al. [ 2010 ] generalize the simple selection problem to
probabilistic databases: the attribute in the data is uncertain and given by a probabilistic histogram,
and the value being searched is also uncertain. They use the Earth Mover's Distance to define the
similarity between the two uncertain values and describe techniques for computing it.
A class of applications of probabilistic databases is in inferring missing attribute values in a
deterministic database by mining portions of the data where those values are present. The result
is a probabilistic database since the missing values cannot be inferred exactly, but one can derive
a probability distribution on their possible values. Wolf et al. [ 2009 ] develop methods for mining
attribute correlations (in terms of Approximate Functional Dependencies), value distributions (in
the form of Naïve Bayes Classifiers), and selectivity estimates for that purpose. Stoyanovich et al.
[ 2011 ] use ensembles and develop an elegant and effective theory for inferring missing values from
various subsets of the defined attributes. Dasgupta et al. [ 2009 ] describe an interesting application
of probabilistic data for acquiring unbiased samples from online hidden database, which offer query
interfaces that return restricted answers (e.g., only top-k of the selected tuples), accompanied by a
total count of the total number of tuples.
Finally, we mention an important subarea of probabilistic databases that we do not cover in
this topic: ranking the query answers by using both a user defined scoring criterion and the tuple
probability, e.g., [ Cormode et al. , 2009b , Ge et al. , 2009 , Li et al. , 2009a , b , Soliman et al. , 2008 ,
2010 , Zhang and Chomicki , 2008 ]. It is often the case that the user can specify a particular ranking
criteria, for example, rank products by prices or rank locations by some distance, which has a well
defined semantics even on a deterministic database. If the database is probabilistic, then ranking
becomes quite challenging because the system needs to account both for the user defined criterion
and for the output probability.
1.4
BIBLIOGRAPHIC AND HISTORICAL NOTES
1.4.0.1 Early Work on Probabilistic Databases
Probabilistic databases are almost as old as traditional databases. Early work from the
80's [ Cavallo and Pittarelli , 1987 , Gelenbe and Hébrail , 1986 , Ghosh , 1986 , Lefons et al. , 1983 ]
described attributes as random variables. Attribute-level uncertainty as we understand it today, as
an uncertain value of an attribute, was popularized by the work of Barbará et al. [ 1992 ] who also
considered query processing and described a simple evaluation method for selection-join queries.
Motivated by the desire to merge databases with information retrieval, Fuhr [ 1990 ] and
Fuhr and Rölleke [ 1997 ] defined a more elaborate probabilistic data model, which is essentially
equivalent to the possible worlds semantics. A similar semantics is described by Zimányi [ 1997 ].
Search WWH ::




Custom Search