Is Branch Coverage a Good Measure of Testing Effectiveness? - Empirical Software Engineering and Verification

Information Technology Reference

In-Depth Information

4.2

Branch Coverage as Measure of Testing Effectiveness

To assess branch coverage as a measure of testing effectiveness, one must understand

that running random testing longer is the same as adding new test cases into a test suite.

The reason is that testing for a longer time means that more routine calls are executed on

the class under test. Each routine call is actually the last line of a test case that contains

all previous calls participating to the state of data used in the call (see [14] for a detailed

explanation of test case construction and simplification). To push the analogy further,

testing a class in different runs is the same as providing different test suites for that

class.

Our experiments test production code in which the existing number of faults is un-

known. They do not seed faults in the code but merely tested the discrepancy between

the contracts and the code. As a result, it is not possible to use the ratio of detected

faults against the total number of faults to measure the effectiveness of testing. Instead,

we assess testing effectiveness through two parameters: the number of faults detected

and the speed at which those faults are detected.

Two results show that different faults can be detected at the same level of branch

coverage: (1) in a test run, new faults were detected in a period where branch cover-

age hardly changes; (2) in different test runs for the same class, different faults were

detected while almost the same branches were exercised. In other words, different test

suites satisfying the same branch coverage criterion may detect different faults.

These two observations indicate that test adequacy in terms of branch coverage level

is highly predictable, not only in how many branches are covered, but also in what

the covered branches are. Applying random testing to a class always yields the same

level of branch coverage adequacy. Also, for all the tested classes, the branch coverage

adequacy level stabilizes after some time (1 hour in our case).

Although we do not know how many faults remain in tested classes, it was aston-

ishing to discover that over 50% of found faults only appear in the period when branch

coverage stagnates.

These results provide evidence of the lack of reliability [8] of branch coverage crite-

rion achieved by random testing. Reliability requires that a test criterion always produce

consistent results. In the experiments reported here, this goal requires that two test runs

achieving the same branch coverage of a class should deliver similar numbers of faults.

But the results show that the number of faults found in different test runs will differ

from each other by at least

.

What about the speed of fault detection? In the first few minutes of random test-

ing, branch coverage increases quickly, and the number of faults increases accordingly,

with a strong correlation. This means that branch coverage is good in measuring test-

ing effectiveness in the first few minutes. But after a while, the branch coverage level

hardly increases, the fault detection speed also slows down but less dramatically than

the branch coverage level. In fact, many faults are detected in the period where the

branch coverage hardly changes. This means in the later period, branch coverage is not

a good measure for testing effectiveness.

In general, to detect as many faults as possible, branch coverage is necessary but not

sufficient.

50%

Empirical Software Engineering and Verification

Search WWH ::

Custom Search

Home