Addresses the problem of generating fuzzing test harnesses for libraries by automatically generating harnesses based on example uses in existing clients. The goal is to increase the productivity of users and it is expected that a human will examine the generated harnesses for legality, etc. before they are used to report bugs.
The approach has three stages: slicing, synthesis and evaluation.
-
Slicing analyzes functions and extracts code related to the library under test using the ClangMR frontend hook.
-
Synthesis turns slices into runnable code. Results of calls to other libraries are replaced with placeholders and several candidate test harnesses are generated with different combinations of concrete and fuzzable values.
-
Evaluation weeds out candidates that fail within a time bound since they are considered low-value. The remainder are passed to users who can rank them, observe increase in coverage, modify them (e.g., to choose better variable names and other readability improvements), etc.
This tooling is embedded in a flow that
-
uses heuristics to try to identify APIs worth fuzzing (e.g., it looks for names like “parse”, “load”, “open”, …).
-
ranks APIs by number of uses within Google’s internal codebase
-
presents generated harnesses as “code findings” next to the code that they test.
The paper reports several case studies with open-source (non-Google) code and they report several lessons learned.
-
Choosing a suitable fuzz target (still) requires a human.
-
API call sites present good locality. (This allows the use of an intraprocedural analysis most of the time.)
-
Analyzing C++ is challenging due to a long tail of language features.
-
Randomized algorithms have a good cost-value tradeoff for program synthesis.
-
Program synthesis artifacts can be presented as code findings to developers.
A lot of the previous tools in this space were for Java.
Papers related to FUDGE: Fuzz driver generation at scale
- Lessons from building static analysis tools at Google [sadowski:cacm:2018]