A Logical Framework for Template Creation and Information Extraction - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

Every word has a set of attributes, such as its part-of-speech or its member-

ship of a semantic class, which we now discuss. Although particular attributes

are not a formal part of the framework, they are used in various illustrative

examples throughout this paper.

Words that share a common stem, or root, typically share a common mean-

ing, such as the words “sit”, “sitting” and “sits”. It is therefore common prac-

tice in information retrieval to index words according to their stem to improve

the performance [21]. Similarly in information extraction, it is often helpful to

identify words that share a common stem. The most common approach is to

remove su xes to produce a single stem for each word [21], although in prin-

ciple, each word could have multiple stems, such as if prefixes were removed

independently of su xes.

Words may also belong to pre-defined semantic categories, such as “busi-

ness”, “country” or “protein”. One common way to define these semantic

categories is by using gazetteers. A gazetteer is a named list of words and

phrases that belong to the same category. Rather than simple lists, some on-

tologies are based on hierarchies or directed acyclic graphs, such as MeSH 1

and GO 2 respectively. In this framework, we are not concerned with the na-

ture of such categories, but assume only that there exists some method for

assigning such attributes to individual words.

The role of each word in a sentence is defined by its part of speech ,orlexical

category. Common examples are noun, verb and adjective, although these

are often subdivided into more precise categories such as “singular common

noun”, “plural common noun”, “past tense verb” and so on. The part of

speech can usually only be ascertained for a word in a given context. For

example, compare “He cut the bread” to “The cut was deep”. In practice,

an implementation may limit this to exactly one label per word, based on

the context of that word. Following the Penn Treebank tags [17], in some

examples we use the symbol “DT” to represent determiners such as “the”,

“a” and “this”; “VB” to represent verbs in their base form, such as “sit”

and “walk”; “VBD” to represent past-tense verbs, such as “sat” and “walked”;

“NN” to represent common singular nouns, such as “cat” and “shed” and

so on.

We also introduce wildcards as an extension to the idea of word attributes.

In regular expressions, a wildcard can “stand in” for a range of characters,

and we use the same notion here to represent ranges of words. For example,

we use the symbol “*” as the universal wildcard which can be replaced by

any word in the lexicon. Then every word has the attribute “*”. We also use

the symbol “?” to represent any word or no word at all . We discuss these

wildcards further in Sect. 4.3.

1 MeSH, Medical Subject Headings, http://www.nlm.nih.gov/mesh

2 Gene Ontology, http://www.geneontology.org

Search WWH ::

Custom Search

Home