Overview - Probabilistic Databases

Database Reference

In-Depth Information

1. OVERVIEW

Scientific data management is a major application domain for probabilistic databases. One of the

early works recognizing this potential is by Nierman and Jagadish [ 2002 ]. They describe a system,

ProTDB (Probabilistic Tree Data Base) based on a probabilistic XML data model and they apply it to

protein chemistry data from the bioinformatics domain. Detwiler et al. [ 2009 ] describe BioRank, a

mediator-based data integration systems for exploratory queries that keeps track of the uncertainties

introduced by joining data elements across sources and the inherent uncertainty in scientific data.

The system uses the uncertainty for ranking uncertain query results, in particular for predicting

protein functions. They use the uncertainty in scientific data integration for ranking uncertain query

results, and they apply this to protein function prediction. They show that the use of probabilities

increases the system's ability to predict less-known or previously unknown functions but is not more

effective for predicting well-known functions than deterministic methods. Potamias et al. [ 2010 ]

describe an application of probabilistic databases for the study of protein-protein interaction. They

consider the protein-protein interaction network (PPI) created by Krogan et al. [ 2006 ] where two

proteins are linked if it is likely that they interact and model it as a probabilistic graph. Another

application of probabilistic graph databases to protein prediction is described by Zouetal. [ 2010 ].

Voronoi diagrams on uncertain data are considered by Cheng et al. [ 2010b ].

Dong et al. [ 2009 ] consider uncertainty in data integration ; they introduce the concept of

probabilistic schema mappings and analyze their formal foundations. They consider two possible

semantics, by-table and by-tuple. Gal et al. [ 2009 ] study how to answer aggregate queries with

COUNT, AVG, SUM, MIN, and MAX over such mappings, by considering both by-table and by-

tuple semantics. Cheng et al. [ 2010a ] study the problem of managing possible mappings between

two heterogeneous XML schemas, and they propose a data structure for representing these mappings

that takes advantage of their high degree of overlap. van Keulen and de Keijzer [ 2009 ] consider user

feedback in probabilistic data integration. Fagin et al. [ 2010 ] consider probabilistic data exchange

and establish a foundational framework for this problem.

Several researchers have recognized the need to redesign major components of data

management systems in order to cope with uncertain data. Cormode et al. [ 2009a ] and

Cormode and Garofalakis [ 2009 ] redesign the histogram synopses, both for internal DBMS deci-

sions (such as indexing and query planning) and for approximate query processing. Their histograms

retain the possible-worlds semantics of probabilistic data, allowing for more accurate, yet concise, rep-

resentation of the uncertainty characteristics of data and query results. Zhang et al. [ 2008 ] describe

a data mining algorithm on probabilistic data. They consider a collection of X-tuples and search for

approximately likely frequent items, with guaranteed high probability and accuracy. Rastogi et al.

[ 2008 ] describe how to redesign access control to data when the database is probabilistic. They

observe that access is often controlled by data, for example, a physician may access a patient's data

only if the database has a record that the physician treats that patient; but in probabilistic databases

the grant/deny decision is uncertain. The authors described a new access control method that adds a

degree of noise to the data that is proportional to the degree of uncertainty of the access condition.

Atallah and Qi [ 2009 ] describe how to extend skyline computation to probabilistic databases, with-

Search WWH ::

Custom Search

Home