Information Technology Reference
In-Depth Information
It will be noted that even with the structured test suite, our code coverage
was less than 60% overall, as predicted by Wieger's work, which shows that
when software is developed without monitoring code coverage, typically only
50-60% of the code is executed by test suites [15] (p. 526). However, as soon as
we tried to increase our code coverage, we almost immediately uncovered two
“showstopper” bugs.
6 Discussion
Although our assay of the software testing status of biomedical natural language
processing applications was crude, the findings are consistent with the claim that
7/20 biomedical natural language processing web sites have not been subjected
to even the lowest, most superficial level of software testing. For the rest, we
cannot conclude that they have been adequately tested—only that they appear
to have benefited from at least the lowest, most superficial level of testing.
This absence of software testing and quality assurance comes despite the fact
that like the mainstream NLP community, the biomedical natural language pro-
cessing community has paid considerable attention to software evaluation .Some
clarification of terminology is useful here. [10] distinguish between gold-standard-
based evaluation and feature-based evaluation . This is completely analogous to
the distinction between what we are referring to as evaluating software with
respect to some metric (gold-standard-based evaluation) and what we are re-
ferring to as testing it, or attempting to find bugs (feature-based evaluation).
The biomedical natural language processing community has participated enthu-
siastically in software evaluation via shared tasks—agreed-upon task definitions
used to evaluate systems against a shared data set using centralized, third-party
evaluation with a corpus (or a document collection) as input and with an agreed-
upon implementation of a scoring metric. However, the community's investment
in testing its products has apparently been much smaller. It has been suggested
[20] that biomedical natural language processing applications are ready for use
by working bioscientists. If this is the case, we argue that there is a moral obli-
gation on the part of biomedical natural language processing practitioners to
exercise due diligence and ensure that their applications do not just perform
well against arbitrary metrics, but also behave as intended.
We showed in our experiments with building linguistically motivated test
suites that such test suites, informed by the techniques of descriptive linguistics,
are effective at granular characterization of performance across a wide variety of
named entity recognition systems. We also demonstrated the surprising finding
that such test suites could be used to predict global performance scores such
as precision, recall, and F-measure (although only recall was predicted in our
experiment) for specific equivalence classes (or, as linguists call them, natural
classes) of inputs.
Drawing directly on a software engineering technique, we used a test suite
to test the commonly held, if tacit, assumption that large corpora are the best
testing material for natural language processing applications. We demonstrated
 
Search WWH ::




Custom Search