Assessment of Software Testing and Quality Assurance in Natural Language Processing Applications and a Linguistically Inspired Approach to Improving It - Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Information Technology Reference

In-Depth Information

and “dirty”) that an application might be presented with, and constructing test

suites to explore this feature space. In field linguistics, an unknown language is

approached by constructing questions to be answered about the language that

allow us to determine the elements of the language on all levels—phonemic and

phonetic (sounds), morphological (word formation), lexicon (words), syntactic

(phrasal structure)—and the ways in which they can combine. These questions

are formulated in sets called schedules that are assembled to elucidate specific

aspects of the language, in a procedure known as scheduled elicitation .Thesoft-

ware tester's test suites have a clear analogue in the “schedules” of the field

linguist. Like test suites, schedules include “dirty” data, as well—for example,

in studying the syntax of a language, the linguist will test the acceptability of

sentences that his or her theory of the language predicts to be ungrammati-

cal. Thus, even though there has not been extensive research into the field of

software testing of natural language processing applications, we already have a

well-developed methodology available to us for doing so, provided by the tech-

niques of descriptive linguistics.

An example of how the techniques of software testing and descriptive lin-

guistics can be merged in this way is provided in [6]. This paper looked at the

problem of testing named entity recognition systems. Named entity recognition

is the task of finding mentions of items of a specific semantic type in text. Com-

monly addressed semantic types have been human names, company names, and

locations (hence the term “named entity” recognition). [6] looked at the appli-

cation of named entity recognition to gene names. They constructed a test suite

based on analyzing the linguistic characteristics of gene names and the contexts

in which they can appear in a sentence. Linguistic characteristics of gene names

included orthographic and typographic features on the level of individual char-

acters, such as letter case, the presence or absence of punctuation marks (gene

names may contain hyphens, parentheses, and apostrophes), and the presence or

absence of numerals. (Gene names and symbols often contain numbers or letters

that indicate individual members of a family of genes. For example, the HSP

family of genes contains the genes HSP1, HSP2, HSP3, and HSP4 .) Morphosyn-

tactic features addressed characteristics of the morpheme or word, such as the

presence or absence of participles, the presence or absence of genitives, and the

presence or absence of function words. The contextual features included whether

or not a gene name was an element of a list, its position in the sentence, and

whether or not it was part of an appositive construction. (Gene names can have

a dizzying variety of forms, as they may reflect the physical or behavioral char-

acteristics of an organism in which they are mutated, the normal function of the

gene when it is not mutated, or conditions with which they are associated. Thus,

we see gene names like pizza (reflecting the appearance of a fly's brain when the

gene is mutated), heat shock protein 60 (reflecting the function of the gene), and

muscular dystrophy (reflecting a disease with which the gene is associated). This

high range of variability adds greatly to the diculty of gene name recognition.)

Five different gene name recognition systems were then examined. These fea-

tures of gene names and features of contexts were sucient to find errors in

Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Search WWH ::

Custom Search

Home