Information Technology Reference
In-Depth Information
Assessment of Software Testing and Quality
Assurance in Natural Language Processing
Applications and a Linguistically Inspired
Approach to Improving It
K. Bretonnel Cohen , Lawrence E. Hunter, and Martha Palmer
Computational Bioscience Program,
University of Colorado School of Medicine,
Aurora, Colorado, USA
Department of Linguistics,
University of Colorado at Boulder,
Boulder, Colorado, USA
Abstract. Significant progress has been made in addressing the scien-
tific challenges of biomedical text mining. However, the transition from a
demonstration of scientific progress to the production of tools on which
a broader community can rely requires that fundamental software en-
gineering requirements be addressed. In this paper we characterize the
state of biomedical text mining software with respect to software testing
and quality assurance. Biomedical natural language processing software
was chosen because it frequently specifically claims to offer production-
quality services, rather than just research prototypes.
We examined twenty web sites offering a variety of text mining ser-
vices. On each web site, we performed the most basic software test known
to us and classified the results. Seven out of twenty web sites returned
either bad results or the worst class of results in response to this sim-
ple test. We conclude that biomedical natural language processing tools
require greater attention to software quality.
We suggest a linguistically motivated approach to granular evaluation
of natural language processing applications, and show how it can be used
to detect performance errors of several systems and to predict overall
performance on specific equivalence classes of inputs.
We also assess the ability of linguistically-motivated test suites to
provide good software testing, as compared to large corpora of naturally-
occurring data. We measure code coverage and find that it is considerably
higher when even small structured test suites are utilized than when large
corpora are used.
1
Introduction
Biomedical natural language processing tools and data generated by their appli-
cation are beginning to gain widespread use in biomedical research. Significant
Corresponding author.
 
Search WWH ::




Custom Search