Re: backtrace parser and processor for ABRT

8 Sep 2010

      On Tue, 2010-08-31 at 16:30 +0200, Karel Klic wrote:
...
Hi,
I have created the next iteration of backtrace parsing and processing 
code, and I would like to see it included in the development version of 
ABRT.
What are the improvements compared to the current ABRT's backtrace code?

the backtrace duplication hash algorithm has been significantly

improved: it now removes much more irrelevant frames, and unifies and 
merges frames. For example, glib's functions are sometimes prefixed with 
IA__, and sometimes they are not (sometimes you see IA__g_logv, 
sometimes just g_logv), so the IA__ is stripped for the duplication 
hash. Another example: many backtraces were the same except that one 
crashed in strcpy_ssse3 and other in strcpy_sse2. So now all glibc's 
functions which depend on the instruction set available are unified for 
the hash purposes (both strcpy_sse2 and strcpy_ssse3 are renamed to strcpy)

the backtrace rating algorithm has been improved: it ignores

cleanup-after-crash frames even when they miss debug info, and it also 
fixes bug #592523

the "crash function name" detection has been improved: many irrelevant

frames are skipped during the crash frame detection, so the reported 
crash function is more often the actual place where the program crashed

new hand written parser: it parses much more backtraces correctly

(especially those with C++ frames, applications using boost library), 
and the code seems to be more readable than the current bison grammar. 
That is because some parts of the backtrace format are very difficult to 
express in the rules for the Bison GLR parser while keeping the memory 
usage and parsing time reasonable

the error messages returned by the hand written parser always include

precise location of the error (line:column) and an informative message

the new parser does not crash on "wrong" backtraces; the current bison

parser uses stack heavily, and when some line in a backtrace is too long 
(many kilobytes) bison puts too much stuff on the stack and crashes (I 
do not know how to fix that without rewriting most of the grammar, and 
that would probably introduce limits on other places); so the new parser 
fixes crashes #627698, #627680, #616988, #589962, #588129, #573333
You obviously put a lot of work into this thing. Impressive!
...
The new code has been written as a separate library (see the attached 
archive), because it became pretty large, containing a binary not used 
by the rest of ABRT, many tests, and several helper scripts. Within 
ABRT, it's used only by the CCpp plugin. I am not sure how to integrate 
it. Should we include it (as a separate project in a subdirectory) to 
ABRT's git repository?
I propose just giving it its own subdirectory in the ABRT code.
When some other project will want it factored out, then we surely
can separate it (or let that other project do it).
This allows you to not roll your own strbuf, xmalloc and such.
...
I spend many days hunting bugs in the code, so it is in a good shape now 
as far as I can tell. Most C functions are covered by unit tests, and I 
often run the parser on all ABRT-reported backtraces downloaded from 
Bugzilla (~27000) and it provides good results. So putting btparser into 
ABRT should not cause much disruption. See the attached patches for the 
integration code.
Review (in no particular order):
/**
 * Checks whether the frame represents a call of function with certain
 * function name.
 */
bool
btp_frame_eq0(struct btp_frame *frame,
              const char *function_name);
...
bool
btp_frame_eq1(struct btp_frame *frame,
              const char *function_name,
              const char *source_file);
These functions beg for better names. Maybe:
btp_frame_calls_func(frame, "func")
btp_frame_calls_func_in_file(frame, "func", "file.c")
struct btp_thread *
btp_thread_dup(struct btp_thread *thread,
               bool shallow,
               bool siblings);
struct btp_frame *
btp_frame_dup(struct btp_frame *frame,
              bool shallow,
              bool siblings);
You are using shallow copies only once for every of these functions,
because you copy an on-stack structures in these cases. If you'll
allocate them via malloc, then you'll never need shallow copies
and can drop that parameter and code.
thread.h
^^^^^^^^ The name is too generic. To forestall collisions,
I propose btp_thread.h
#ifndef BTPARSER_THREAD_H
#define BTPARSER_THREAD_H
#ifdef __cplusplus
extern "C" {
#endif
#include <stdio.h>
#include <stdbool.h>
^^^^^^^^ I think they are better moved up, above #ifdef __cplusplus.
(actually, stdio.h is not needed at all here...)
/* Declarations. */   <------- can be safely removed
struct btp_thread;
struct btp_frame;
struct btp_location;
char *name = btp_malloc(strlen(frame->function_name) - 3);
            strcpy(name, frame->function_name + 4);
            free(frame->function_name);
            frame->function_name = name;
You can just
            char *name = [x/btp_]strdup(frame->function_name + 4);
            free(frame->function_name);
            frame->function_name = name;
or even
            strcpy(frame->function_name, frame->function_name + 4);
sans the regretful fact that strcpy() is not safe on overlapped strings;
thus either open-code a safe version:
            char *p = frame->function_name;
            while ((*p = p[4]) != '\0') p++;
or create and use helper function overlapping_strcpy().
#define NORMALIZE_ARCH_SPECIFIC(func)                                   \
        if (btp_frame_eq2(frame, "__" func "_sse2", func ".S", "libc.so") || \
            btp_frame_eq2(frame, "__" func "_ssse3", func ".S", "libc.so") || \
            btp_frame_eq2(frame, "__" func "_ia32", func ".S", "libc.so")) \
        {                                                               \
            free(frame->function_name);                                 \
            frame->function_name = btp_strdup(func);                    \
        }
You can just strcpy instead of free + strdup.
normalize.c:
#include "normalize.h"
#include "normalize_dbus.h"
#include "normalize_gdk.h"
#include "normalize_glib.h"
#include "normalize_glibc.h"
#include "normalize_libstdcpp.h"
#include "normalize_linux.h"
#include "normalize_xorg.h"
You will need to touch this file every time you add a new normalizer.
You can reduce your future work by consolidating all normalize_foo.h
files into one header (they are all very similar and simple,
unlike corresponding .c files),
thus you will need to create one less file and add one less
#include per each new  btp_normalize_foo_thread().
static char *
findfirstabnul(char *input, const char *a, const char *b)
{
    size_t alen = strlen(a);
    size_t blen = strlen(b);
    char *p = input;
    while (*p)
    {
        if (strncmp(p, a, alen) == 0)
            return p;
        if (strncmp(p, b, blen) == 0)
            return p;
        ++p;
    }
    return p;
}
Looks like it is possible to use strcmp's instead of strncmp's
and drop {ab}len and strlen calls
bool
btp_parse_regexp(char **input, const char *pattern, char **result);
Can use a simpler API here:
/* Returns NULL on regexp error or no match */
char*
btp_parse_regexp(char **input, const char *pattern);
It is a bit better because it looks a bit simpler in C, and
   is more efficient (old one takes addresses of pointers,
   forcing them to use stack, not CPU registers).
   The difference is small but positive on both fronts.
Similarly:
bool
btp_backtrace_parse(char **input,
                    struct btp_backtrace **backtrace,
                    struct btp_location *location)
can use a bit different API:
/* Returns NULL on error */
struct btp_backtrace*
btp_backtrace_parse(char **input,
                    struct btp_location *location)
So this usage:
+    struct btp_backtrace *backtrace = NULL;
+    struct btp_location location;
+    btp_location_init(&location);
+    if (!btp_backtrace_parse(&backtrace_str_ptr, &backtrace, &location))
+    {...
turns into this:
+    struct btp_location location;
+    btp_location_init(&location);
+    struct btp_backtrace *backtrace = btp_backtrace_parse(&backtrace_str_ptr, &location);
+    if (!backtrace)
+    {...
Similarly for btp_thread_parse(), btp_frame_parse_file_location() etc...
static char *
btp_get_regerror(int errcode, regex_t *compiled)
{
    size_t length = regerror(errcode, compiled, NULL, 0);
    char *buffer = btp_malloc(length);
    (void) regerror(errcode, compiled, buffer, length);
    return buffer;
}
...
    int regresult = regcomp(&compiled, pattern, REG_EXTENDED);
    if (regresult != 0)
    {
        char *error = btp_get_regerror(regresult, &compiled);
        fprintf(stderr, "btp: Error while compiling regexp: %s", error);
        free(error);
...
    regresult = regexec(&compiled, *input, 1, &matchptr, 0);
    if (regresult != 0) /* REG_NOMATCH or some error */
    {
        if (regresult != REG_NOMATCH)
        {
            char *error = btp_get_regerror(regresult, &compiled);
            fprintf(stderr, "btp: Error while matching regexp: %s", error);
            free(error);
You do the same thing with btp_get_regerror result in all two cases:
print and free. Why not move these ops *into* btp_get_regerror instead?
/**
 * A location in the backtrace file with an attached message.
 * It's used for error reporting: the line and the column points to
 * the place where a parser error occurred, and the message explains
 * what the parser expected and didn't find on that place.
 */
struct btp_location
{
    /** Starts from 1. */
    int line;
    /** Starts from 0. */
    int column;
    /**
     * Error message related to the line and column.  Do not release
     * the memory this pointer points to.
     */
    const char *message;
};
IIUC you are trying to get accurate error reporting, and for that you are
recording the position while you consume input,
and record message and return with it on error.
End user uses this machinery like this:
btp_thread_parse(&input_char_ptr, ..., ptr_to_location_struct);
if (ERROR) {
    printf("%d:%d: %s, run for your lives", location.line, location.column, location.message);
}
The goal is sound, but the price you pay to achieve it seems to be
high in terms of code simplicity and readability.
I would like to make it simpler. How about 
(1) printing (logging) error message on the spot instead of propagating it back to caller?
(2) passing two char* pointers down to callees:
btp_thread_parse(&input_char_ptr, ...):
   start = input_char_ptr;
   while (...)
   -> btp_thread_parse(start, &input_char_ptr, ....)
      -> btp_frame_parse(start, &input_char_ptr, ....)
         if (ERROR DETECTED)
            btp_log_error(start, *input_char_ptr, "Titanic hit the iceberg")
            -> calculates line:col by walking from start
               to cur = *input_char_ptr and counting \n and other chars,
               then log("%d:%d: %s", line, col, msg)
This way, not only struct btp_location is not needed anymore,
but also a lot of "location->column += chars" statements
can be dropped, along with calls to btp_location_add* and btp_location_eat*.
They don't need to be replaced, they simply aren't needed at all.
Again, this looks like a well written chunk of code. Thanks!
-- 
vda

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: backtrace parser and processor for ABRT