Information Technology Reference
In-Depth Information
3
Extracting Relations from Text:
From Word Sequences to Dependency Paths
Razvan C. Bunescu and Raymond J. Mooney
3.1 Introduction
Extracting semantic relationships between entities mentioned in text documents is
an important task in natural language processing. The various types of relationships
that are discovered between mentions of entities can provide useful structured infor-
mation to a text mining system [1]. Traditionally, the task specifies a predefined set
of entity types and relation types that are deemed to be relevant to a potential user
and that are likely to occur in a particular text collection. For example, information
extraction from newspaper articles is usually concerned with identifying mentions
of people, organizations, locations, and extracting useful relations between them.
Relevant relation types range from social relationships, to roles that people hold
inside an organization, to relations between organizations, to physical locations of
people and organizations. Scientific publications in the biomedical domain offer a
type of narrative that is very different from the newspaper discourse. A significant
effort is currently spent on automatically extracting relevant pieces of information
from Medline, an online collection of biomedical abstracts. Proteins, genes, and cells
are examples of relevant entities in this task, whereas subcellular localizations and
protein-protein interactions are two of the relation types that have received signif-
icant attention recently. The inherent di culty of the relation extraction task is
further compounded in the biomedical domain by the relative scarcity of tools able
to analyze the corresponding type of narrative. Most existing natural language pro-
cessing tools, such as tokenizers, sentence segmenters, part-of-speech (POS) taggers,
shallow or full parsers are trained on newspaper corpora, and consequently they inc-
cur a loss in accuracy when applied to biomedical literature. Therefore, information
extraction systems developed for biological corpora need to be robust to POS or
parsing errors, or to give reasonable performance using shallower but more reliable
information, such as chunking instead of full parsing.
In this chapter, we present two recent approaches to relation extraction that
differ in terms of the kind of linguistic information they use:
1. In the first method (Section 3.2), each potential relation is represented implicitly
as a vector of features, where each feature corresponds to a word sequence an-
chored at the two entities forming the relationship. A relation extraction system
Search WWH ::




Custom Search