Overview - Probabilistic Databases

Database Reference

In-Depth Information

CHAPTER

1

Overview

1.1 TWO EXAMPLES

NELL 1 , the Never-Ending Language Learner, is a research project from CMU that learns over

time to read the Web. It has been running continuously since January 2010. It crawls hundreds of

millions of Web pages and extracts facts of the form ( entity, relation, value ). Some facts

are shown in Figure 1.1 . For example, NELL believes that “Mozart is a person who died at the age

of 35” and that “biscutate_swift is an animal”.

NELL is an example of a large scale Information Extraction (IE) system. An IE system

extracts structured data , such as triples in the case of NELL, from a collection of unstructured data ,

such as Web pages, blogs, emails, twitter feeds, etc. Data analytics tools today reach out to such

external sources because it contains valuable and timely information.

The data extracted by an IE system is structured and therefore can be imported in a standard

relational database system. For example, as of February 2011, NELL had extracted 537K triples of the

form ( entity , relation , value ), which can be downloaded (in CSV format) and imported in, say,

PostgreSQL. The relational schema for NELL can be either a single table of triples, or the data can

be partitioned into distinct tables, one table for each distinct relation. For presentation purposes, we

took the latter approach in Figure 1.2 and show a few tuples in two relations, ProducesProduct

and HeadquarteredIn . For example, the triple ( sony , ProducesProduct , walkman ) extracted

by Nell is inserted in the database table ProducesProduct as the tuple ( sony , walkman ). Data

analytics can now be performed by merging the NELL data with other, offline database instances.

Most IE systems, including NELL, produce data that are probabilistic . Each fact has a prob-

ability, representing the system's confidence that the extraction is correct. While some facts have

probability 1 . 0, most tuples have a probability that is < 1 . 0. In fact 87% of the 537K tuples in

NELL have a probability that is less than 1 . 0. Most of the data in NELL is uncertain . Traditional

data cleaning methods simply remove tuples that are uncertain and cannot be repaired; this is clearly

not applicable to large scale IE systems because it would remove a lot of valuable data items. To use

such data at its full potential, a database system must understand and process data with probabilistic

semantics.

Consider a simple query over the NELL database: “Retrieve all products manufactured by a

company headquartered in San Jose”:

select x.Product, x.Company

1 http://rtw.ml.cmu.edu/rtw/

Search WWH ::

Custom Search

Home