It's been a bit quiet on this mailing list, but I've been quite busy
working on the plugin, so I thought I'd give an update on things.
I've spent the last month or so focusing on the "libcpychecker" code
embedded within plugin sources. This is a substantial body of Python
code that uses the plugin, implementing a set of checks for C code,
which can detect mistakes in the handling of object reference-counts
within extension code for CPython itself.
I've been working on raising the quality of this code, and the best way
to do that seems to be to run the code on itself, or rather, on the
gcc-python-plugin C code.
This has been interesting, in that the code was working OK on the simple
examples I'd written for the test suite (see tests/cpychecker in the
source tree), but throwing a large body of real-world C code at it
exposed lots of bugs (for example, I'd entirely forgotten to handle the
"switch" statement!)
So I've been slowly fixing issues in the analysis code - and it has
found real bugs in the underlying plugin code, which is satisfying!
I've also started generalizing the analyzer with a pluggable interface
for writing new types of static analysis that need to track the possible
flows through a C/C++ function, using a Python interface. It should
eventually be possible to write Python hooks that describe testing for
e.g. glibc memory/fd leaks, kernel object refcounting and memory
management, etc etc, though I'm trying to focus my own efforts on the
CPython reference-leak detector. The system is intended to support
multiple plugins running at once, so that you can have e.g. both glibc
checking and Python checking happening at once.
So, before, the project looked like this:
+----------------------------------------------------------------------+
| cpychecker: analyzer for refcounts in CPython extension modules |
| (in Python) |
+---------------------------+ +----------------------------------------+
| gcc-python-plugin (in C) |-| libpython2.7.so or libpython3.2.so |
+---------------------------+ +----------------------------------------+
| gcc's cc1 (the compiler) |
+---------------------------+
and it is now beginning to look like this:
+----------------------------------------------------------+ +-----+
| analyzer for refcounts in CPython extension modules | | etc |
+----------------------------------------------------------+ +-----+
| static analysis engine (in python) |
+---------------------------+ +----------------------------------------+
| gcc-python-plugin (in C) |-| libpython2.7.so or libpython3.2.so |
+---------------------------+ +----------------------------------------+
| gcc's cc1 (the compiler) |
+---------------------------+
though the separation between the top two layers isn't as clean as it
ought to be yet.
If anyone's interested in writing another static analysis plugin (e.g.
libc malloc leaks), or in helping with the cpython checker, that would
be great.
The idea is that the flows of control through a function are modeled by
a tree of State objects, each of which describes the current program
counter location, along with the memory regions (l-values) we know
about, and the r-values that each region has (to track pointers and
arrays).
There are Transition objects, which link the State objects within the
tree.
There is a core "engine" which can generate Trace objects representing
each possible path of State and Transition objects through this tree,
interpreting GCC's internal representation.
Each State object can have additional "facets" of state: there's a
cpython facet, which adds the extra information about CPython
reference-counting. Other analyses could add additional facets (e.g.
the state of open file-descriptors within libc - for detecting code
paths that leak file-descriptors).
All of this is in Python, so new static analysis "facets" are Python
classes that call into a Python API. Hopefully we can make it simple to
add new static analysis hooks.
This is somewhat simplistic compared to some approaches, but it does
mean that we can emit nice error messages when a problem is detected: I
hope that by closely modeling the problem domain, we can give
"higher-level" error messages to the user. You can see examples of this
in the HTML reports here:
http://dmalcolm.livejournal.com/6560.html
and here:
http://fedorapeople.org/~dmalcolm/blog/2011-07-15/
(again, that HTML reporting code can be reused for other types of code
analysis)
Anyone up for writing some other analysis hooks? (e.g. libc? kernel
code? your favorite library?)
Dave