Thank you Dave!
That's exactly the kind of ideas I was looking for.
Just a short summary what we can do on the server (now) to get this
- it has all the rpm debuginfo packages, so getting the symbol names or
lines is not a problem (actually we do that even now)
- it can extract backtrace from userspace coredumps
- and Fedora users are sending them...
- it can extract backtrace from kernel coredumps
- actually never seen any Fedora user to send a kernel core
- it's not a problem to run some custom scripts during the analysis
- so far it takes the mapping component->owner from the Fedora pkgdb,
the bigger plan is to be more distro agnostic, so we're not against
using other data for component->owner mapping
- we have all the backtraces from all the crashes processed by the
server, so we can do a lot of datamining (deduplication, finding the
common component in different crashes, ...)
To wrap it up: All of these ideas bellow are doable, but not without
your help (so get ready for some emails from us ;)). Almost every
package needs some special handling and we can't know them all, so it's
up to maintainers and developers to let us know what kind of information
they need and how to get them. I can't promise it will be implemented
over night, but if you shout loud enough...
One thing we're struggling with now is the normalization of stacktraces
which means deciding which functions are important and which are not.
e.g. for kernel there are stacktraces with a lot of warn_* functions and
only a few functions are different and our logic detects this as a
duplicate because the stacktraces are very similar. We're dealing with
this problem, but it's very slow process because to make such decision
you need to know the specific program and we would appreciate any help
with this matter.
On 02/20/2013 02:09 AM, Dave Jones wrote:
On Tue, Feb 19, 2013 at 10:10:38PM +0100, Jiri Moskovcak wrote:
> >>So if you want to hack this into a tool for use on kernel bugs, go for
> >...and please integrate with abrt! Let's have it all working together :)
> - I am all for it, the abrt server is exactly the place where these
> kind of things should be
What I have in mind is the cases where some human interaction is still necessary.
Adding heuristics on the server side for certain cases would help us, but
there are still a bunch of common operations we do that require a human
to make a judgment call before we make a change.
But, pursuing the server-side solution, here are some things that we'd find useful
that *could* be automated.
- Unlike most packages, we have individual maintainers for subcomponents
(this is where our bugzilla implementation sucks, because we can't file
by subcomponent). So when we get bugs against certain drivers,
or filesystems etc, we reassign to those developers who signed up to work
This probably counts for a significant percentage of our interactions with
bugzilla. I'm not sure what kind of heuristics you'd need to add to automate
assigning to the right person. Maybe you can pull the symbol from the IP,
translate that to a filename, and have a database of wildcards so you can do
drivers/net/wireless/* -> linville@
fs/btrfs/* -> zab@
Because it's not always easy from a report to tell what component is responsible,
sometimes parsing the Summary is necessary, which is the sort of thing
I meant by 'needs human to make a judgment call'. But if we can automate
the majority of the cases, it would still help a lot.
- Similar thing as previous, but all graphics bugs get reassigned by us
immediately to xorg-x11-drv-* because those guys deal with both the X and
kernel modesetting/dri code. So any trace with 'i915', 'radeon' etc
can probably be auto-reassigned.
- When we get 'general protection fault' bugs, it's useful to run the Code:
line of the oops through scripts/decodecode (from a kernel tree).
This disassembly will allow us to see what instruction caused the GPF.
(Note: *just* general protection faults, not every trace. Also, we
only really need the faulting instruction, not the whole disassembly).
Bonus points if it can suck the relevant data out of the debuginfo rpms
to map the code line to C code.
- Extrapolating from the above, when we see certain register values in those
bugs, they usually hint at the cause of a bug. For example 0x6b6b6b6b is
SLAB_POISON, and usually means we tried to use memory after it was freed.
Adding a comment to point this out speeds up analysis.
- Getting trickier.. We see a *lot* of flaky hardware, where we tried to
dereference an address which had a single bit flip in memory.
If the server side had some smarts so it knew what 'good' addresses looked
it could detect the single bit-flip case, and guide the user to run
memtest86 will save us a round-trip.
That's all I have right now, but there are probably a bunch of other
common operations we do which could be automated.