Database Reference
In-Depth Information
CHAPTER
1
Overview
1.1 TWO EXAMPLES
NELL 1 , the Never-Ending Language Learner, is a research project from CMU that learns over
time to read the Web. It has been running continuously since January 2010. It crawls hundreds of
millions of Web pages and extracts facts of the form ( entity, relation, value ). Some facts
are shown in Figure 1.1 . For example, NELL believes that “Mozart is a person who died at the age
of 35” and that “biscutate_swift is an animal”.
NELL is an example of a large scale Information Extraction (IE) system. An IE system
extracts structured data , such as triples in the case of NELL, from a collection of unstructured data ,
such as Web pages, blogs, emails, twitter feeds, etc. Data analytics tools today reach out to such
external sources because it contains valuable and timely information.
The data extracted by an IE system is structured and therefore can be imported in a standard
relational database system. For example, as of February 2011, NELL had extracted 537K triples of the
form ( entity , relation , value ), which can be downloaded (in CSV format) and imported in, say,
PostgreSQL. The relational schema for NELL can be either a single table of triples, or the data can
be partitioned into distinct tables, one table for each distinct relation. For presentation purposes, we
took the latter approach in Figure 1.2 and show a few tuples in two relations, ProducesProduct
and HeadquarteredIn . For example, the triple ( sony , ProducesProduct , walkman ) extracted
by Nell is inserted in the database table ProducesProduct as the tuple ( sony , walkman ). Data
analytics can now be performed by merging the NELL data with other, offline database instances.
Most IE systems, including NELL, produce data that are probabilistic . Each fact has a prob-
ability, representing the system's confidence that the extraction is correct. While some facts have
probability 1 . 0, most tuples have a probability that is < 1 . 0. In fact 87% of the 537K tuples in
NELL have a probability that is less than 1 . 0. Most of the data in NELL is uncertain . Traditional
data cleaning methods simply remove tuples that are uncertain and cannot be repaired; this is clearly
not applicable to large scale IE systems because it would remove a lot of valuable data items. To use
such data at its full potential, a database system must understand and process data with probabilistic
semantics.
Consider a simple query over the NELL database: “Retrieve all products manufactured by a
company headquartered in San Jose”:
select x.Product, x.Company
1 http://rtw.ml.cmu.edu/rtw/
Search WWH ::




Custom Search