Software verification: Testing vs. model checking

Dirk Beyer, Thomas Lemberger
[doi] [ISBN] [Google Scholar] [DBLP] [Citeseer]
Read: 23 June 2020

Hardware and Software: Verification and Testing
Springer International Publishing
Cham
Pages 99-114
2017
Note(s): FQL, Fuzz testing, Test Competition, SV Competition, KLEE Verifier, CBMC Verifier
Papers: holzer:cav:2008, holzer:hvc:2010, cadar:cacm:2013

Compare test generators against model checkers. The test generators are AFL-Fuzz, CPATiger, CREST-PPC, FShell, KLEE verifier and PRTest. The model checkers are CBMC verifier, CPA-seq, ESBMC-incr, ESBMC-kInd.

The benchmarks are all from the SV competition. These consist of 1490 programs with exactly one bug and 4293 programs with no bugs. The model checkers were developed using this benchmark suite to evaluate them and the suite was designed to evaluate model checkers so there is some bias in the choice of benchmarks. Of the programs with one bug, 375 of them have some symbolic stub functions that make them unusable with test programs.

To enable the comparision, all the tools were given a common interface to convert tests into a standard format and to generate witnesses.

The results are presented in a cactus plot (as in SV competition) and then broken down in more detail in a table. The clear winners are the model checkers although KLEE verifier is competitive. Well regarded fuzz testers like AFL-Fuzz find less than 80% of the bugs found by the works of the model checkers. (That is 80% after excluding the benchmarks that include symbolic stub functions.)

The results also present the total number of bugs found by all testers, by all model checkers and by all tools.

  • All the testers find around 7% more bugs than the best tester (KLEE verifier).
  • All the model checkers find about 12-15% more bugs than the best model checker (ESBMC-incr).
  • All the tools find 20-25% more bugs than the best tool (ESBMC-incr).

The observation that all the tools are better than the best tools is an argument for portfolio test generators and hybrid fuzzer/model-checkers.


  • TestCov: Robust test-suite execution and coverage measurement [beyer:ase:2019]