Research risks in the Rust verification project
In larger, long-running research projects, it can be useful to make a list of all the risks and research questions you have and use that as one way to guide what you tackle first. This is a list that we considered as we were planning the Rust verification project. Although some of the questions are specific to our particular context (what we want to achieve, who we want to help, etc.), I think that many of the questions apply to any project that aims to help developers to verify their own code.
Here is the list of risks that we considered in this project and some of what we learned. Some of these have been tackled in “risk sprints” and are reported in other blog posts on this site; or in the paper about usability issues Towards making formal methods normal: meeting developers where they are; or in our 2021 and 2020 surveys of Rust verification tools. The majority of the others have not been tackled yet.
Complexity of Rust language or programs that tools can handle
Will tools be able to handle the full range of features found in the Rust language, libstd, libcore and commonly used libraries?
An incomplete list of potentially tricky features
- Unsafe code
- C code
- Assembly code
- Calls to C libraries, libc, system calls
- Loops and recursion
- Lazy initialization
- Dynamic traits
- Unsupported features
- Unsupported LLVM intrinsics
- Non-portable and processor-specific Rust code (e.g., hand-vectorized code)
- Concurrent code
- Async code
We mostly focused on the LLVM symbolic execution tool KLEE which meant that many basic features were handled from the beginning. But we had to work on foreign function interface (FFI) code, weak initializers and vector intrinsics.
The main thing that we have not made progress on are
- Concurrent code: this needs to be built into the verification tool and KLEE is only just adding concurrency support.
- Assembly code: it is possible that a binary lifter such as Trail of Bits McSema/Remill might be the way to solve it.
- Although we think that KLEE will probably accept LLVM code generated from async Rust, we don’t know how well it will be able to reason about it.
Scaling of tools
Will tools be able to handle large complex applications?
- Can it handle the Oak1 secure enclave?
- Can it handle the Oak secure enclave plus all of the crates that it transitively depends on?
- Can we parallelize verification to make effective use of modern computers? (e.g., swarm testing / swarm verification)
We found some promising ideas but have not investigated this yet.
Verification tool issues (tool questions)
There were many questions about the capabilities of tools. At the start of the project, we only understood a little about different software verification tools and Rust verification tools were in their infancy.
We chose to do most of our work with KLEE because we knew that it was a mature tool and the team worked hard to support extensions by other groups. This choice has worked well for us so far.
Are the differences between tools too large to create a continuum of tools: bounded model checkers, symbolic execution, model checkers, abstract interpreters, fuzzers, …
We want to be able to swap and combine tools so we need a common interface for using a broad range of tools. Fortunately, the SV-COMP competition has lead to some standardization (for C verification tools) although more may be needed to support report generation, etc.
Soundness of tools
Do tools give
These are often (confusingly) expressed in terms of false positives and false negatives and occasionally using the terms “recall / precision / specificity”.
Do the classes of issues detected match the bugs that user cares about?
This is really important. If the user disagrees that a reported ‘bug’ is worth fixing, they will not get much benefit from your tool.
Robustness/stability of tools
- Do the tools tend to crash?
- Are they buggy?
- Do they silently fail?
Fonseca’s empirical analysis of the correctness of verified systems is a cautionary tale.
Tools that are fast enough for occasional use may be unacceptably slow for daily use.
Ease of installing tools
We found that some tools were painful to install. Mostly because they depended on many other packages that were not part of standard Linux distributions. In theory, this is an easy problem to resolve; in practice, not so much.
Ease of use
Ease of configuration
Some research tools have 100s of configuration options. Are there a few useful sets of configurations? Can configuration be automated?
Ease of understanding results and other output
- Is it easy to distinguish success, failure and timeout?
- Do error reports have line numbers, traces, sample input values, etc.?
- If a tool checks 1000 verification conditions, is each result reported separately?
Differences between tools (incurring re-learning/training/practice costs)
Can we have a common interface for configuring tools and reporting results?
Do tricks that help one tool get results help tools of other types?
Cost of investment in tools (cf. robustness, etc.)
- Do we need to write our own tools?
- Can we work with developers of existing tools?
- Just how big is the gap between tools and where they need to be for reliable use in industry?
Is deployment of tools within a large company too hard?
Large companies rightly worry about introducing too many new dependencies into production flows.
The most effective way to improve tool scaling is modular verification but
Do all tools support modular verification?
(May be harder for fuzzing and symbolic execution.)
Can we focus the attention of tools on areas of particular interest such as
- Recently written code
- Code being re-reviewed
- My code vs. your code vs. 3rd party code
Do we lose either soundness of bugs or soundness of proofs?
Creating function contracts and/or mock implementations
This is essential to support modular verification.
- Tedious to write by hand (making cost/benefit ratio worse)?
- Granularity required too small?
Interfaces unstable → expensive to maintain?
This is especially likely for internal interfaces within a single project.
How easy is it to replace crates with mock crates?
In principle, cargo makes this easy - but writing a mock crate can be a large effort if the crate has a large API.
- How to focus bug search in new code?
- How to rank bugs?
- How to triage (group) bugs?
- How to avoid reporting previously found bugs?
- Localization of bugs
- Finding causes of bugs
- Tracing the infection chain
- Fix? Adding annotations or new specifications proofs during defect tracing?
- the number in codebase accumulates over time
- dealing with “Won’t fix” (true bugs that developers decide not to fix)
- Unfixable bugs in 3rd party software
These will be a real problem in practice. How can we deal with them?
How to check/improve the accuracy of environmental models? (e.g., if we have a broken model of Linux system calls, the proof could succeed but the code is broken)
- How to check/improve the accuracy of the model of the language semantics?
- Are there bugs in verifier?
- Are there bugs in compiler?
Is it too hard to write effective specifications?
Creating and maintaining specifications can be quite expensive.
Common mode failure
If the specification and the code are written by the same person/team, there is the risk of common mode failure where the same misunderstanding or oversight appears in both.
- Is bug quality too low?
- Are we swamped by too many unimportant bugs?
- Is the effort to find bugs too high?
- Is the effort to localize, diagnose and fix bugs too high?
- Is formal the most effective way of increasing quality and assurance? In particular, can we find as many bugs as cheaply as fuzzing?
- Too slow vs fuzzing?
Will it be used in engineering teams?
Our ultimate measure of success is whether engineering teams adopt some/all of our techniques and tools because they find that they help them achieve their goals more easily.
What if the Oak project re-re-writes Oak s/w stack in C++ to meet the needs of their users?
- How to avoid a bad first impression?
- Overselling / Mis-selling leads to disappointment?
- How to avoid a steep learning curve?
Gathering data in pilot studies
To persuade management that it is worth investing in the tools, training and use of a new technique, we will probably have to perform pilot studies to measure the cost-benefit.
- Is gathering data too much burden on software engineers?
- Hard to collect data as tools are being developed?
- Too much noise in the data?
Creates false sense of security and complacency?
If the tool initially catches lots of bugs, is there a risk that other complementary techniques like testing are neglected?
Cost of tool stack (to buy / to maintain / to support)?
Preconceptions of engineers and engineering management
Some of the concerns / preconceptions that we have encountered in the past are
- They have been previously oversold on formal - leading to disappointment as the inevitable costs, benefits and limitations become clear.
- Concern that formal requires expertise that most teams don’t have
- Formal verification is “only for evangelical zealots / pointy heads / academics”
- All cost, no benefit?
- Only gives one bit of data: correct or incorrect.
- Doesn’t scale to real systems?
- Are proofs maintainable as the system continues to evolve?
- Is continuous reverification sustainable as the system continues to evolve and as the original verification starts to move to other projects?
- Tells you what you already knew (i.e., no bugs found)
- Like a bad linter: only detects things that are not problems
Whether these concerns are justified or not, we will need to make it clear that our approach avoids these issues. And then to make sure that it does!
That’s a long list…
Does this mean that I think that projects working on (usability of) verification tools are doomed to failure? Not at all. Having a long list like this can help guide what parts of the problem we work on, what we encourage others to work on, what collaborations we seek, etc. For example, we decided early on that we should not develop our own tools but, instead, we should try to work with as many different verification tools as possible to spread the various risks around.
I hope you find the list useful in guiding your research.
The opinions expressed are my own views and not my employer's.