Finding code that explodes under symbolic evaluation

James Bornholt, Emina Torlak
[doi] [Google Scholar] [DBLP] [Citeseer] [url]
Read: 10 February 2021

Proc. ACM Program. Lang. 2(OOPSLA)
Association for Computing Machinery
New York, NY, USA
October 2018
Note(s): symbolic evaluation, symbolic execution, bounded model checking, Rosette solver, case splitting, state merging, verification profiling, verification performance
Papers: galea:arxiv:2018, torlak:pldi:2014, kuznetsov:pldi:2012

Like galea:arxiv:2018 (which is also worth a read), this paper looks at how to find verification bottlenecks in symbolic evaluation. They propose a symbolic profiling approach that works for any symbolic evaluator (symbolic executor, BMC or hybrids) and, using a simple but effective metric to rank hotspots, is effective at highlighting the root cause of the problem.

Symbolic profiling (SymPro)

  • Two resources: symbolic heap and symbolic evaluation graph
    • Heap = constants, terms, etc. (with hash-consing used to avoid duplicates)
    • Graph = paths and any merge points (DAG) from evaluation strategy
    • Track evolution: where are values created, accessed, sent to solver; how are paths merged at ctrl-flow joins.
    • Rank procedure calls by metrics about evolution
  • Model is general (to any forward symbolic evaluator: symbolic execution, BMC and hybrids), explainable (independent of internal details) and actionable (finds root causes)
  • Key insight: “effective symbolic evaluation involves maximizing (opportunities for) concrete evaluation while minimizing path explosion”.
  • Challenge: naïve time-profiling highlights the wrong part of the code: the consumption of symbolic terms, not the construction. (This reminds me of profiling lazy evaluation in Haskell.)

List of antipatterns in solver-aided code

  • Algorithmic
  • Representation
    • Irregular data structures that cause divergence in control flow. (IIRC, Rosette groups data structures by shape (they call it “type”) and merges flows that have the same shape.)
  • Concreteness
    • Failure leads to large evaluation graphs with many infeasible paths.
    • Introducing case splits can concretize symbolic values. Eg “if x == 0 { … 0 … } elsif x == 1 { … 1 … } else { … x … }” makes x concrete in the first two branches (and maybe the third option is an error case?) (Rosette’s for/all form does this (section 5.1))

Profiling

  • Considered input-sensitive profiling [Coppa et al., 2012] but correlation between input size and performance is often poor for symbolic evaluation: inaccurate and noisy.
  • Considered path-based profiling kuznetsov:pldi:2012: rank functions based on number of infeasible paths explored and size of path conditions. But infeasible paths are not always the problem and measuring feasibility is expensive (uses solver); and does not work for BMC.
  • These lead them to heap + graph model (above).
    • Symbolic profiler interface (I have transliterated the interface into pseudo-Rust).
      • mkTerm(Vec) -> term
      • step(state, Vec) -> Vec
      • merge(srcloc, Vec) -> state
      • solve(srcloc, expr) -> bool

Visualization

  • Flamegraph: time vs. call stack colour-coded by a score
  • Score computed from five statistics
    • wall-clock time; #terms created; #unused terms; union size (sum of out-degrees); merge cases (sum of in-degrees). (See figure 3.)
    • Calculation of overall score: normalize the five statistics to range 0.0 .. 1.0; sum results
    • Experiment with machine-learning ranking scheme had high recall but poor precision (many false positives) - so manual scheme is default.

Extensive empirical evaluation

  • Ferrite crash-safe file system
  • Cosette SQL query system
  • Neutrons (radiotherapy system)
  • Also Bonsai (checks soundness of type systems), Quivela (verifies security of crypto protocols), Fluidics (synthesizes microfluidics protocols), RTR (Refinement typechecker for Ruby).
  • Evaluates which of the metrics are most reliable in ranking bottlenecks.
  • User study: time taken to identify performance issue: 4 benchmarks, with/without SymPro. (A bit small for statistical significance
  • Implemented on both Rosette and Jalangi. Tried on red-black tree, calculator parser and BDD.
    • Highlights idea of using SymPro to develop efficient formal models of libraries and frameworks. (Because optimizing for reasoning and optimizing for concrete execution can require conflicting changes.)

Bounded model-checking, Case splitting, Symbolic evaluation, Symbolic execution, Verification performance of code, Verification profiling of code

  • Evaluating manual intervention to address the challenges of bug finding with KLEE [galea:arxiv:2018]