Assessment of Software Testing and Quality Assurance in Natural Language Processing Applications and a Linguistically Inspired Approach to Improving It - Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Information Technology Reference

In-Depth Information

Tabl e 3. Application- and package-level coverage statistics using the test suite, the

full corpus with the full set of rules, and the full corpus with two reduced sets of rules.

The highest value in a row is bolded. The last three columns are intentionally identical

[7].

Metric

Functional tests Corpus, all rules nominal rules verbal rules

Overall line coverage

56%

41%

Overall branch coverage

41%

28%

Parser line coverage

55%

41%

Parser branch coverage

57%

29%

Rules line coverage

63%

42%

Rules branch coverage

71%

24%

Parser class coverage

88% (22/25)

80% (20/25)

Rules class coverage

100% (20/20)

90% (18/20)

coverage—sometimes much higher coverage, as in the case of branch coverage

for the rules components, where the corpus achieved 24% code coverage and the

test suite achieved 71% code coverage. The last three columns show the results

of an experiment in which we varied the size of the rule set. As can be seen

from the fact that the coverage for the entire rule set, a partition of the rule

set that only covered nominals, and a partition of the rule set that covered only

verbs, are all equal, the number of rules processed was not a determiner of code

coverage.

In a further experiment, we examined how code coverage is affected by vari-

ations in the size of the corpus. We monitored coverage as increasingly larger

portions of the the corpus were processed. The results for line coverage are shown

in Figure 1. (The results for branch coverage are very similar and are not shown.)

The x axis shows the number of sentences processed. The thick solid line indi-

cates line coverage for the entire application. The thin solid line indicates line

coverage for the rules package. The broken line and the right y axis indicate the

number of pattern matches.

As the figure shows quite clearly, increasing the size of the corpus does not lead

to increasing code coverage. It is 39% when a single sentence has been processed,

40% when 51 sentences have been processed, and 41%—the highest value that

it will reach—when 1,000 sentences have been processed. The coverage after

processing 191,478 sentences—the entire corpus of almost 4,000,000 words—is no

higher than it was at 1,000 sentences, and is barely higher than after processing

a single sentence.

Thus, we see that the “naturally occurring data assumption” does not hold—

from an engineering perspective, there is a clear advantage to using structured

test suites.

This should not be taken as a claim that running an application against a

large corpus is bad. In fact, we routinely do this, and have found bugs that were

not uncovered in other ways. However, testing with a structured test suite should

remain a primary element of natural language processing software testing.

Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Search WWH ::

Custom Search

Home