Information Technology Reference
In-Depth Information
and “dirty”) that an application might be presented with, and constructing test
suites to explore this feature space. In field linguistics, an unknown language is
approached by constructing questions to be answered about the language that
allow us to determine the elements of the language on all levels—phonemic and
phonetic (sounds), morphological (word formation), lexicon (words), syntactic
(phrasal structure)—and the ways in which they can combine. These questions
are formulated in sets called schedules that are assembled to elucidate specific
aspects of the language, in a procedure known as scheduled elicitation .Thesoft-
ware tester's test suites have a clear analogue in the “schedules” of the field
linguist. Like test suites, schedules include “dirty” data, as well—for example,
in studying the syntax of a language, the linguist will test the acceptability of
sentences that his or her theory of the language predicts to be ungrammati-
cal. Thus, even though there has not been extensive research into the field of
software testing of natural language processing applications, we already have a
well-developed methodology available to us for doing so, provided by the tech-
niques of descriptive linguistics.
An example of how the techniques of software testing and descriptive lin-
guistics can be merged in this way is provided in [6]. This paper looked at the
problem of testing named entity recognition systems. Named entity recognition
is the task of finding mentions of items of a specific semantic type in text. Com-
monly addressed semantic types have been human names, company names, and
locations (hence the term “named entity” recognition). [6] looked at the appli-
cation of named entity recognition to gene names. They constructed a test suite
based on analyzing the linguistic characteristics of gene names and the contexts
in which they can appear in a sentence. Linguistic characteristics of gene names
included orthographic and typographic features on the level of individual char-
acters, such as letter case, the presence or absence of punctuation marks (gene
names may contain hyphens, parentheses, and apostrophes), and the presence or
absence of numerals. (Gene names and symbols often contain numbers or letters
that indicate individual members of a family of genes. For example, the HSP
family of genes contains the genes HSP1, HSP2, HSP3, and HSP4 .) Morphosyn-
tactic features addressed characteristics of the morpheme or word, such as the
presence or absence of participles, the presence or absence of genitives, and the
presence or absence of function words. The contextual features included whether
or not a gene name was an element of a list, its position in the sentence, and
whether or not it was part of an appositive construction. (Gene names can have
a dizzying variety of forms, as they may reflect the physical or behavioral char-
acteristics of an organism in which they are mutated, the normal function of the
gene when it is not mutated, or conditions with which they are associated. Thus,
we see gene names like pizza (reflecting the appearance of a fly's brain when the
gene is mutated), heat shock protein 60 (reflecting the function of the gene), and
muscular dystrophy (reflecting a disease with which the gene is associated). This
high range of variability adds greatly to the diculty of gene name recognition.)
Five different gene name recognition systems were then examined. These fea-
tures of gene names and features of contexts were sucient to find errors in
 
Search WWH ::




Custom Search