Statistical Relational Data Integration for Information Extraction - Reasoning Web

Databases Reference

In-Depth Information

This skepticism manifests itself in a recent surge of information extraction

projects such as the open information extraction [23] (OIE) and the never ending

language learning [13] (NELL) projects. Indeed, the OIE project explicitly defines

itself as open , meaning that it does not leverage ontologies or relational schemas.

The major argument supporting this position is that a relational schema or on-

tology unnecessarily constrains what can be extracted from large web corpora.

The NELL project leverages a type system and a fixed set of relations, even

though recent work has moved towards (semi-)automatically extending the set

of relations. However, insight and expertise accumulated in the semantic web

community over the last 10 years is largely ignored. For instance, the project

does not employ canonical labels for its entities ('Argentina' refers to both, the

national soccer team and the country itself) and makes no use of existing knowl-

edge representation formalisms even though it actually uses notions such as

range and domain restrictions implicitly. While this could be explained with the

specific applications the creators have in mind (improved keyword search and

natural language question answering, for instance) there are some reasonable

arguments in favor of not completely ignoring the existing body of work and ex-

perience of the semantic web community. Other information extraction projects

such as DBpedia [4,59] and YAGO [87,36] are more in line with semantic web

technologies as they use unique canonical identifiers for entities (derived from

the URIs of the corresponding Wikipedia articles) and notions such as range

and domain restrictions that closely resemble the RDF standard. The advantage

of using these standardized RDF formalisms is that they enable the creation

of links across heterogeneous data sets and a unifying syntactic and semantic

framework for knowledge bases. DBpedia, for instance, has established itself as a

linking hub for the linked open data cloud. The existence of a relational schema

or ontology also facilitates relational query processing and the use of statistical

relational approaches such as Markov logic [80].

The present lecture notes provide a brief overview of existing information

extraction projects ranging from those with a predetermined ontology, that is,

a relational schema, high precision extractions, and limited coverage, to those

without any kind of schema, low precision extractions, and broader coverage.

We do not take sides and instead focus on possible synergies that arise when we

consider each of the projects as disparate and heterogeneous knowledge bases

whose integration would not only broaden the amount of extracted knowledge

but also increase the extraction quality and provide relational schemas for facts

that were previously schema-less. We provide an overview of the problem areas

ontology matching and object reconciliation from a semantic web perspective.

We then show how both the relational schema and the data can be jointly

modeled with statistical relational formalisms.

Ontology matching, or ontology alignment, is the problem of determining

correspondences between concepts, properties, and individuals of two or more

different formal ontologies [26]. The alignment of ontologies allows semantic ap-

plications to exchange and enrich the data expressed in the respective ontolo-

gies. An important results of the yearly ontology alignment evaluation initiative

Reasoning Web

Search WWH ::

Custom Search

Home