Statistical Relational Data Integration for Information Extraction - Reasoning Web

Databases Reference

In-Depth Information

integer linear programming. Finally, we discuss some recent advances in the

development of ecient algorithms for probabilistic inference.

2 Information Extraction - The State of the Art

There are numerous information extraction projects each with foci on particular

subproblems of information extraction and knowledge base construction. We

selected several representative projects without making a claim of completeness.

Other IE projects we are aware of and that we are not able to cover here due to

space considerations are Freebase [10] and DeepDive [72].

The following descriptions of the information extraction projects demonstrate

that all use a combination of statistical and logical formalism to extract facts

and to improve the quality of the derived knowledge. Hence, information extrac-

tion projects are prime examples where statistical relational learning and joint

inference proves tremendously useful and is naturally applicable. It is also inter-

esting to observe that many of these projects have strong commonalities despite

their different objectives and premises. The main motivation for presenting the

various approaches to knowledge base extraction is to demonstrate the impor-

tance of methods that combine probability and logic and to excite the reader

with a semantic web background about the data that these projects continu-

ously aggregate. There are numerous research directions for young researchers

to pursue.

2.1 YAGO

YAGO was introduced with the publication [87]. Each entity in YAGO corre-

sponds to an article in Wikipedia. Whenever Wikipedia's volunteer editors deem

an entity worthy of a Wikipedia article, YAGO will create the corresponding en-

tity in its knowledge base. The taxonomic backbone of YAGO is based on a

hierarchy of user-created Wikipedia categories. YAGO establishes links between

Wikipedia categories and synsets in WordNet [28].

YAGO has roughly 100 manually defined relations, such as locatedIn and

hasPopulation. YAGO extracts instances of these relations from Wikipedia in-

foboxes (meta-data boxes). These instances are commonly denoted as facts:

triples of an entity (the subject), a relation (the predicate), and another en-

tity (the object). YAGO utilizes a set of manually created patterns that map

categories and infobox attributes to fact templates. YAGO contains more than

80 million facts involving more than 9 million entities [36].

The YAGO knowledge base also utilizes a set of deterministic and probabilistic

rules. These declarative rules are used to ensure that facts do not contradict each

other in certain ways. For instance, some of these declarative rules specify the

domains and ranges of the relations and the definition of the classes of the YAGO

concept hierarchy. The rules can, for instance, be used to enforce that instances

Search WWH ::

Custom Search

Home