Overview - Probabilistic Databases

Database Reference

In-Depth Information

application of probabilistic databases to the Named Entity Recognition (NER) problem. In NER,

each token in a text document must be labeled with an entity, such as PER (person entity such as Bill),

ORG (organization such as IBM), LOC (location such as New York City), MISC (miscellaneous

entity-none of the above), and O (not a named entity). By combining Markov Chain Monte Carlo

with incremental view update techniques, they show considerable speedups on a corpus of 1788

New York Times articles from the year 2004. Fink et al. [ 2011a ] describe a system that can answer

relational queries on probabilistic tables constructed by aggregating Web data using Google Squared

and on other online data that can be brought in tabular form.

A related application is wrapper induction . Dalvi et al. [ 2009 ] describe an approach for robust

wrapper induction that uses a probabilistic change model for the data. The goal of the wrapper is to

remain robust under likely changes to the data sources.

RFID data management extracts and queries complex events over streams of readings of RFID

tags. Due to the noisy nature of the RFID tag readings these are usually converted into probabilistic

data, using techniques such particle filters, then are stored in a probabilistic database [ Diao et al. ,

2009 , Khoussainova et al. , 2008 , Ré et al. , 2008 , Tran et al. , 2009 ].

Probabilistic data is also used in data cleaning . Andritsos et al. [ 2006 ] show how to use a

simple BID data model to capture key violations in databases, which occur often when integrating

data from multiple sources. Antova et al. [ 2009 ] and Antova et al. [ 2007c ] study data cleaning in a

general-purpose uncertain resp. probabilistic database system, by iterative removal of possible worlds

from a representation of a large set of possible worlds. Given that a limited amount of resources

is available to clean the database, Cheng et al. [ 2008 ] describe a technique for choosing the set of

uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query

answers. They develop a quality metric for a probabilistic database, and they investigate how such a

metric can be used for data cleaning purposes.

In entity resolution , entities from two different databases need to be matched, and the challenge

is that the same object may be represented differently in the two databases. In deduplication, we need

to eliminate duplicates from a collection of objects, while facing the same challenge as before, namely

that an object may occur repeatedly, using different representations. Probabilistic databases have been

proposed to deal with this problem too. Hassanzadeh and Miller [ 2009 ] keep duplicates when the

correct cleaning strategy is not certain and utilize an efficient probabilistic query-answering technique

to return query results along with probabilities of each answer being correct. Sismanis et al. [ 2009 ]

propose an approach that maintains the data in an unresolved state and dynamically deals with entity

uncertainty at query time. Beskales et al. [ 2010 ] describe ProbClean, a duplicate elimination system

that encodes compactly the space of possible repairs.

Arumugam et al. [ 2010 ] and Jampani et al. [ 2008 ], Xu et al. [ 2009 ] describe applications of

probabilistic databases to business intelligence and financial risk assessment . Deutch et al. [ 2010b ],

Deutch and Milo [ 2010 ], and Deutch [ 2011 ] consider applications of probabilistic data to business

processes .

Search WWH ::

Custom Search

Home