Advanced Techniques - Probabilistic Databases

Database Reference

In-Depth Information

Even with these techniques, query evaluation remains a major challenge. It can be observed

that a key challenge to efficiency is avoiding the selectivity trap : database queries are often highly

selective, through selections and joins, and many of the tuples of the input database(s) do not

survive the path through the query to the result. This has been already observed in early work on

online aggregation [ Hellerstein et al. , 1997 ]. The same issue applies to MCDBs, and a promising

solution is to postpone sampling as long as possible during query evaluation, working with summary

representations of all possible worlds until operators such as tuple probability computations force

the system to produce samples to proceed. The PIP system [ Kennedy and Koch , 2010 ] does just

this using pc-table representations of probabilistic databases. As shown there, pc-tables generalize

tuple bundles in their power to compactly represent many samples (in fact, all possible worlds).

Since, as discussed earlier, pc-tables are a strong representation system for relational algebra, the

relational algebra part of a query can be evaluated by transforming the c-table representation, without

touching the representation of the probability distribution modeled by the random variables in the

tuple conditions. Samples are only generated after the evaluation of the relational algebra operations

of the query is finished, and the costly and unnecessary sampling of data that would fall victim to

filtering in the relational algebra can be suitably counteracted while preserving the correctness of the

overall query result.

6.4 INDEXES AND MATERIALIZED VIEWS

In relational databases, indexes and materialized views are two powerful and popular techniques

to improve query performance. Achieving high query performance is a major technical challenge

for probabilistic databases, and researchers have naturally adapted indexing and materialized view

techniques to probabilistic databases. Probabilistic data, however, presents new challenges that are

not found in relational database management systems. A first conceptual reason that probabilistic

indexing differs from relational indexing is that probabilistic databases allow new types of queries.

For example, consider an environmental monitoring application where we are measuring the tem-

perature of a physical space. The sensors can report the temperature in the room only to within some

confidence. In this setting, one may ask “return the ids of all sensors whose temperature is in some critical

range with probability greater than 0 . 6 ” . To answer this query, one alternative is to probe each of the

sensors and ask them to report their temperature reading. Alternatively, one could use an index and

quickly identify those sensors that meet the above criteria.

Indexing and materialized view techniques also need to be rethought for technical reasons. For

example, in an RDBMS, each tuple can be processed independently. In turn, the RDBMS leverages

this freedom to layout the data to make query retrieval efficient. In contrast, tuples in a probabilistic

databases may be correlated in non-obvious ways.

6.4.1 INDEXES FOR PROBABILISTIC DATA

The first set of indexing for probabilistic databases were concerned with continuous probabilistic

databases to support queries called probabilistic threshold queries (PTQ) [ Cheng et al. , 2004 , Qi et al. ,

Search WWH ::

Custom Search

Home