Databases Reference
In-Depth Information
Every word has a set of attributes, such as its part-of-speech or its member-
ship of a semantic class, which we now discuss. Although particular attributes
are not a formal part of the framework, they are used in various illustrative
examples throughout this paper.
Words that share a common stem, or root, typically share a common mean-
ing, such as the words “sit”, “sitting” and “sits”. It is therefore common prac-
tice in information retrieval to index words according to their stem to improve
the performance [21]. Similarly in information extraction, it is often helpful to
identify words that share a common stem. The most common approach is to
remove su xes to produce a single stem for each word [21], although in prin-
ciple, each word could have multiple stems, such as if prefixes were removed
independently of su xes.
Words may also belong to pre-defined semantic categories, such as “busi-
ness”, “country” or “protein”. One common way to define these semantic
categories is by using gazetteers. A gazetteer is a named list of words and
phrases that belong to the same category. Rather than simple lists, some on-
tologies are based on hierarchies or directed acyclic graphs, such as MeSH 1
and GO 2 respectively. In this framework, we are not concerned with the na-
ture of such categories, but assume only that there exists some method for
assigning such attributes to individual words.
The role of each word in a sentence is defined by its part of speech ,orlexical
category. Common examples are noun, verb and adjective, although these
are often subdivided into more precise categories such as “singular common
noun”, “plural common noun”, “past tense verb” and so on. The part of
speech can usually only be ascertained for a word in a given context. For
example, compare “He cut the bread” to “The cut was deep”. In practice,
an implementation may limit this to exactly one label per word, based on
the context of that word. Following the Penn Treebank tags [17], in some
examples we use the symbol “DT” to represent determiners such as “the”,
“a” and “this”; “VB” to represent verbs in their base form, such as “sit”
and “walk”; “VBD” to represent past-tense verbs, such as “sat” and “walked”;
“NN” to represent common singular nouns, such as “cat” and “shed” and
so on.
We also introduce wildcards as an extension to the idea of word attributes.
In regular expressions, a wildcard can “stand in” for a range of characters,
and we use the same notion here to represent ranges of words. For example,
we use the symbol “*” as the universal wildcard which can be replaced by
any word in the lexicon. Then every word has the attribute “*”. We also use
the symbol “?” to represent any word or no word at all . We discuss these
wildcards further in Sect. 4.3.
1 MeSH, Medical Subject Headings, http://www.nlm.nih.gov/mesh
2 Gene Ontology, http://www.geneontology.org
Search WWH ::




Custom Search