Extracting Relations from Text: From Word Sequences to Dependency Paths - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

is trained based on the subsequence kernel from [2]. This kernel is further gen-

eralized so that words can be replaced with word classes, thus enabling the use

of information coming from POS tagging, named entity recognition, chunking,

or Wordnet [3].

2. In the second approach (Section 3.3), the representation is centered on the short-

est dependency path between the two entities in the dependency graph of the

sentence. Because syntactic analysis is essential in this method, its applicability

is limited to domains where syntactic parsing gives reasonable accuracy.

Entity recognition, a prerequisite for relation extraction, is usually cast as a sequence

tagging problem, in which words are tagged as being either outside any entity, or

inside a particular type of entity. Most approaches to entity tagging are therefore

based on probabilistic models for labeling sequences, such as Hidden Markov Mod-

els [4], Maximum Entropy Markov Models [5], or Conditional Random Fields [6],

and obtain a reasonably high accuracy. In the two information extraction methods

presented in this chapter, we assume that the entity recognition task was done and

focus only on the relation extraction part.

3.2 Subsequence Kernels for Relation Extraction

One of the first approaches to extracting interactions between proteins from biomed-

ical abstracts is that of Blaschke et al. , described in [7, 8]. Their system is based on

a set of manually developed rules, where each rule (or frame) is a sequence of words

(or POS tags) and two protein-name tokens. Between every two adjacent words is a

number indicating the maximum number of intervening words allowed when match-

ing the rule to a sentence. An example rule is “ interaction of (3) (3) with (3)

”, where ' ' is used to denote a protein name. A sentence matches the rule

if and only if it satisfies the word constraints in the given order and respects the

respective word gaps.

In [9] the authors described a new method ELCS (Extraction using Longest

Common Subsequences) that automatically learns such rules. ELCS' rule represen-

tation is similar to that in [7, 8], except that it currently does not use POS tags,

but allows disjunctions of words. An example rule learned by this system is “ -(7)

interaction (0) [between | of] (5) (9) (17) . ” Words in square brackets

separated by ' | ' indicate disjunctive lexical constraints, i.e., one of the given words

must match the sentence at that position. The numbers in parentheses between ad-

jacent constraints indicate the maximum number of unconstrained words allowed

between the two.

3.2.1 Capturing Relation Patterns with a String Kernel

Both Blaschke and ELCS do relation extraction based on a limited set of match-

ing rules, where a rule is simply a sparse (gappy) subsequence of words or POS

tags anchored on the two protein-name tokens. Therefore, the two methods share

a common limitation: either through manual selection (Blaschke), or as a result of

a greedy learning procedure (ELCS), they end up using only a subset of all pos-

sible anchored sparse subsequences. Ideally, all such anchored sparse subsequences

would be used as features, with weights reflecting their relative accuracy. However,

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home