Assessment of Software Testing and Quality Assurance in Natural Language Processing Applications and a Linguistically Inspired Approach to Improving It - Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Information Technology Reference

In-Depth Information

It will be noted that even with the structured test suite, our code coverage

was less than 60% overall, as predicted by Wieger's work, which shows that

when software is developed without monitoring code coverage, typically only

50-60% of the code is executed by test suites [15] (p. 526). However, as soon as

we tried to increase our code coverage, we almost immediately uncovered two

“showstopper” bugs.

6 Discussion

Although our assay of the software testing status of biomedical natural language

processing applications was crude, the findings are consistent with the claim that

7/20 biomedical natural language processing web sites have not been subjected

to even the lowest, most superficial level of software testing. For the rest, we

cannot conclude that they have been adequately tested—only that they appear

to have benefited from at least the lowest, most superficial level of testing.

This absence of software testing and quality assurance comes despite the fact

that like the mainstream NLP community, the biomedical natural language pro-

cessing community has paid considerable attention to software evaluation .Some

clarification of terminology is useful here. [10] distinguish between gold-standard-

based evaluation and feature-based evaluation . This is completely analogous to

the distinction between what we are referring to as evaluating software with

respect to some metric (gold-standard-based evaluation) and what we are re-

ferring to as testing it, or attempting to find bugs (feature-based evaluation).

The biomedical natural language processing community has participated enthu-

siastically in software evaluation via shared tasks—agreed-upon task definitions

used to evaluate systems against a shared data set using centralized, third-party

evaluation with a corpus (or a document collection) as input and with an agreed-

upon implementation of a scoring metric. However, the community's investment

in testing its products has apparently been much smaller. It has been suggested

[20] that biomedical natural language processing applications are ready for use

by working bioscientists. If this is the case, we argue that there is a moral obli-

gation on the part of biomedical natural language processing practitioners to

exercise due diligence and ensure that their applications do not just perform

well against arbitrary metrics, but also behave as intended.

We showed in our experiments with building linguistically motivated test

suites that such test suites, informed by the techniques of descriptive linguistics,

are effective at granular characterization of performance across a wide variety of

named entity recognition systems. We also demonstrated the surprising finding

that such test suites could be used to predict global performance scores such

as precision, recall, and F-measure (although only recall was predicted in our

experiment) for specific equivalence classes (or, as linguists call them, natural

classes) of inputs.

Drawing directly on a software engineering technique, we used a test suite

to test the commonly held, if tacit, assumption that large corpora are the best

testing material for natural language processing applications. We demonstrated

Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Search WWH ::

Custom Search

Home