Database Reference
In-Depth Information
application of probabilistic databases to the Named Entity Recognition (NER) problem. In NER,
each token in a text document must be labeled with an entity, such as PER (person entity such as Bill),
ORG (organization such as IBM), LOC (location such as New York City), MISC (miscellaneous
entity-none of the above), and O (not a named entity). By combining Markov Chain Monte Carlo
with incremental view update techniques, they show considerable speedups on a corpus of 1788
New York Times articles from the year 2004. Fink et al. [ 2011a ] describe a system that can answer
relational queries on probabilistic tables constructed by aggregating Web data using Google Squared
and on other online data that can be brought in tabular form.
A related application is wrapper induction . Dalvi et al. [ 2009 ] describe an approach for robust
wrapper induction that uses a probabilistic change model for the data. The goal of the wrapper is to
remain robust under likely changes to the data sources.
RFID data management extracts and queries complex events over streams of readings of RFID
tags. Due to the noisy nature of the RFID tag readings these are usually converted into probabilistic
data, using techniques such particle filters, then are stored in a probabilistic database [ Diao et al. ,
2009 , Khoussainova et al. , 2008 , RĂ© et al. , 2008 , Tran et al. , 2009 ].
Probabilistic data is also used in data cleaning . Andritsos et al. [ 2006 ] show how to use a
simple BID data model to capture key violations in databases, which occur often when integrating
data from multiple sources. Antova et al. [ 2009 ] and Antova et al. [ 2007c ] study data cleaning in a
general-purpose uncertain resp. probabilistic database system, by iterative removal of possible worlds
from a representation of a large set of possible worlds. Given that a limited amount of resources
is available to clean the database, Cheng et al. [ 2008 ] describe a technique for choosing the set of
uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query
answers. They develop a quality metric for a probabilistic database, and they investigate how such a
metric can be used for data cleaning purposes.
In entity resolution , entities from two different databases need to be matched, and the challenge
is that the same object may be represented differently in the two databases. In deduplication, we need
to eliminate duplicates from a collection of objects, while facing the same challenge as before, namely
that an object may occur repeatedly, using different representations. Probabilistic databases have been
proposed to deal with this problem too. Hassanzadeh and Miller [ 2009 ] keep duplicates when the
correct cleaning strategy is not certain and utilize an efficient probabilistic query-answering technique
to return query results along with probabilities of each answer being correct. Sismanis et al. [ 2009 ]
propose an approach that maintains the data in an unresolved state and dynamically deals with entity
uncertainty at query time. Beskales et al. [ 2010 ] describe ProbClean, a duplicate elimination system
that encodes compactly the space of possible repairs.
Arumugam et al. [ 2010 ] and Jampani et al. [ 2008 ], Xu et al. [ 2009 ] describe applications of
probabilistic databases to business intelligence and financial risk assessment . Deutch et al. [ 2010b ],
Deutch and Milo [ 2010 ], and Deutch [ 2011 ] consider applications of probabilistic data to business
processes .
Search WWH ::




Custom Search