Hi Fedorans,
we are facing a very strange Python segfault in an OLPC build, based on F14, for XO-1.5 (I know, do you remember _that_ far ago?).
The state of the OS and disk is well known, but we cannot make it happen at will. So I got my hands on a coredump, installed the exact same OS (so exact matching disk state as the machine that segfaults, same rpms, etc), and went for http://fedoraproject.org/wiki/StackTraces#Obtaining_a_stack_trace_from_a_cor...
I've yum-installed the matching -debuginfo packages for python, glibc and glib (apparently the segfault is calling glib).
Running gdb over the core, it complains: "the .dynamic section for "/usr/lib/python2.7 is not at the expected address", and suggests a yum install (of /usr/lib/debug/.build-id/<identifier>) that fails.
Full error msg from gdb at http://fpaste.org/YqGv/
I can see that some packages have invalid debuginfo subpackages (http://fedoraproject.org/wiki/Packaging:Debuginfo#Useless_or_incomplete_debu...) , but given Python's importance to the Fedora stack, my bet is that Python's debuginfo is fine, and there's a subtle user error in the middle...
To be clear -- we don't suspect Python or the Fedora stack. But something is rotten, and this segfault is the only clue we have.
This problem is fairly important for the OLPC team at this time, help on this track appreciated.
m
On Thu, 15 Mar 2012 21:18:16 +0100, Martin Langhoff wrote:
we are facing a very strange Python segfault in an OLPC build, based on F14, for XO-1.5 (I know, do you remember _that_ far ago?).
[...]
Full error msg from gdb at http://fpaste.org/YqGv/
All those GDB messages mean your system libraries do not match the core file.
Is it a fresh core file from last hours? Is it generated on the same machine you run GDB on?
F-14 is EOLed, its repositories incl. debuginfos are out of date, I do not think it matters to spend any time on F-14 at all.
Regards, Jan
On Thu, Mar 15, 2012 at 4:58 PM, Jan Kratochvil jan.kratochvil@redhat.com wrote:
All those GDB messages mean your system libraries do not match the core file.
Well, that should just not be. The machine that fails, and my machine have both been installed from the same disk image, which gets written to disk with a process equivalent to dd.
And both machines pass rpm -Va just fine. So the binaries should, um, be the same.
Is it a fresh core file from last hours? Is it generated on the same machine you run GDB on?
It is a core from yesterday, on a machine installed from a 'dd' disk image. The machine that fails is exactly on the opposite side of the world. dd'ing the same OS image on my machine doesn't trigger the failure. So there is something funny on the opposite side of the world.
F-14 is EOLed, its repositories incl. debuginfos are out of date
Not in this case, at least yum&rpm claim that the debuginfos match.
think it matters to spend any time on F-14 at all.
As I stated in my earlier email, I don't want anyone to fix F14, I don't think F14 is to blame.
My questions are simpler:
- Is python 2.7 debuginfo in F14 known to be good or bad?
- If it's known to be good, are there any gotchas not documented in the StackTraces wikipage that could be tripping me up?
cheers,
m
On Thu, 15 Mar 2012 22:31:58 +0100, Martin Langhoff wrote:
Well, that should just not be. The machine that fails, and my machine have both been installed from the same disk image, which gets written to disk with a process equivalent to dd.
And both machines pass rpm -Va just fine. So the binaries should, um, be the same.
+
It is a core from yesterday,
There can be difference one of the machines has the files prelink-ed while the other one does not. prelink runs nightly (/etc/cron.daily/prelink). But it should be already fixed in your GDB version gdb-7.2-52.fc14, it was fixed in gdb-7.2-45.fc14 by: [patch] [i386] Fix {,un}prelinked libraries for attach/core-load http://sourceware.org/ml/gdb-patches/2011-02/msg00630.html Still I may have missed some case.
If one of the binaries is prelinked and one was not (or vice verse) the message "is not at the expected address (wrong library or version mismatch?)" is really printed (more in the mail above) but the backtrace should work OK.
You can try for the experiment: /etc/sysconfig/prelink: # Set this to no to disable prelinking altogether # (if you change this from yes to no prelink -ua # will be run next night to undo prelinking) PRELINKING=no ^^ If it helps please contact me off-list, with your disk image. It assumes the system generating the core file was not prelinked.
That missing file: Missing separate debuginfo for Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/63/420e48a2edbae61166c708ebd2ff1a5aed1054
is probably for kernel vDSO (as its name is empty), therefore kernel rpm. One never knows what the build-id matches to until Darkserver gets deployed, hopefully for F-17: https://fedoraproject.org/wiki/Darkserver
Thanks, Jan
On Fri, Mar 16, 2012 at 1:57 AM, Jan Kratochvil jan.kratochvil@redhat.com wrote:
And both machines pass rpm -Va just fine. So the binaries should, um, be the same.
It is a core from yesterday,
There can be difference one of the machines has the files prelink-ed while the other one does not. prelink runs nightly (/etc/cron.daily/prelink). But it
Thanks!
Prelink is not involved -- I doublechecked. In OLPC builds, we currently don't prelink due to http://dev.laptop.org/ticket/10898 , we just don't install prelink and don't run it during OS image creation. Even back then when we did, we disabled the cronjob :-)
should be already fixed in your GDB version gdb-7.2-52.fc14,
You got that one right :-)
If it helps please contact me off-list, with your disk image. It assumes the system generating the core file was not prelinked.
Uploading at http://dev.laptop.org/~martin/os5rw-brokenimg/Sandisk_1200908562DEN.img
Bear in mind - that'll contain 2 partitions. The 2nd partition is / but our initrd mounts it, and then chroots into a subdirectory. So when you mount it, you'll want too look into /versions/run/5/
(WTF is this? Root FS "snapshots" via hardlinked trees. Until we have btrfs running on these puppies, it's the best update fail-proof mechanism we have.)
That missing file: Missing separate debuginfo for Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/63/420e48a2edbae61166c708ebd2ff1a5aed1054
is probably for kernel vDSO (as its name is empty), therefore kernel rpm.
Argh, that could be. But our kernel is a custom built rpm, and we don't build -debuginfo. Here, have a fistful of my freshly-torn-out hair.
Now, at the time of this segfault, the dmesg reports a segfault in python2.7, inside calls to glib... (1) why are we then in the kernel and (2) why isn't gdb telling us anything about the python/glib part of the callstack?
still confused -
martin PS: On a different investigation track we think there may be some subtle/odd disk corruption that _passes_ rpm -Va and our own olpc-contents-verify, yet strikes at runtime. Could a subtly corrupt binary (ie: vmlinuz) lead here?
On Fri, 16 Mar 2012 20:46:16 +0100, Martin Langhoff wrote:
Argh, that could be. But our kernel is a custom built rpm,
You have a bug for Fedora there, in the core file by readelf -l: Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align [...] LOAD 0x1933000 0xa7703000 0x00000000 0x00000 0x174000 R E 0x1000 ^^^^^^^ There is normally 0x1000 on x86* Fedora kernels due to: $ cat /proc/self/coredump_filter 00000033
/usr/share/doc/kernel-doc-*/Documentation/filesystems/proc.txt - (bit 4) ELF header pages in file-backed private memory areas (it is effective only if the bit 2 is cleared)
This way build-id for the executable and shared libraries is dumped in the core file but it is missing in this OLPC kernel. Fedora GDB has not yet upstreamed patch for build-id which did not expect such core files.
Going to push a fix for F-15+ but F-14 is EOLed, you can either use FSF GDB or patch Fedora GDB by this patch or use F-15+ GDB etc.
That backtrace of "core.522" FYI is at: http://people.redhat.com/jkratoch/sandisk.bt
Thanks, Jan
--- gdb-7.2/gdb/solib-svr4.c.orig 2012-03-17 09:39:54.874090162 +0100 +++ gdb-7.2/gdb/solib-svr4.c 2012-03-17 09:42:12.561810807 +0100 @@ -1202,14 +1202,30 @@ svr4_current_sos (void) } else { - struct build_id *build_id; + struct build_id *build_id = NULL;
strncpy (new->so_original_name, buffer, SO_NAME_MAX_PATH_SIZE - 1); new->so_original_name[SO_NAME_MAX_PATH_SIZE - 1] = '\0'; /* May get overwritten below. */ strcpy (new->so_name, new->so_original_name);
- build_id = build_id_addr_get (LM_DYNAMIC_FROM_LINK_MAP (new)); + /* In the case the main executable was found according to its + build-id (from a core file) prevent loading a different build + of a library with accidentally the same SO_NAME. + + It suppresses bogus backtraces (and prints "??" there instead) + if the on-disk files no longer match the running program + version. + + If the main executable was not loaded according to its + build-id do not do any build-id checking of the libraries. + There may be missing build-ids dumped in the core file and we + would map all the libraries to the only existing file loaded + that time - the executable. */ + + if (symfile_objfile != NULL + && (symfile_objfile->flags & OBJF_BUILD_ID_CORE_LOADED) != 0) + build_id = build_id_addr_get (LM_DYNAMIC_FROM_LINK_MAP (new)); if (build_id != NULL) { char *name, *build_id_filename; @@ -1224,23 +1240,7 @@ svr4_current_sos (void) xfree (name); } else - { - debug_print_missing (new->so_name, build_id_filename); - - /* In the case the main executable was found according to - its build-id (from a core file) prevent loading - a different build of a library with accidentally the - same SO_NAME. - - It suppresses bogus backtraces (and prints "??" there - instead) if the on-disk files no longer match the - running program version. */ - - if (symfile_objfile != NULL - && (symfile_objfile->flags - & OBJF_BUILD_ID_CORE_LOADED) != 0) - new->so_name[0] = 0; - } + debug_print_missing (new->so_name, build_id_filename);
xfree (build_id_filename); xfree (build_id);
Hi Jan,
that's enormously useful -- thanks! I'll make sure we fix our kernel options so this isn't an issue in the future.
And I'll patch my gdb so I can read the other stacktraces.
cheers -
m
On Sat, Mar 17, 2012 at 4:56 AM, Jan Kratochvil jan.kratochvil@redhat.com wrote:
On Fri, 16 Mar 2012 20:46:16 +0100, Martin Langhoff wrote:
Argh, that could be. But our kernel is a custom built rpm,
You have a bug for Fedora there, in the core file by readelf -l: Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align [...] LOAD 0x1933000 0xa7703000 0x00000000 0x00000 0x174000 R E 0x1000 ^^^^^^^ There is normally 0x1000 on x86* Fedora kernels due to: $ cat /proc/self/coredump_filter 00000033
/usr/share/doc/kernel-doc-*/Documentation/filesystems/proc.txt - (bit 4) ELF header pages in file-backed private memory areas (it is effective only if the bit 2 is cleared)
This way build-id for the executable and shared libraries is dumped in the core file but it is missing in this OLPC kernel. Fedora GDB has not yet upstreamed patch for build-id which did not expect such core files.
Going to push a fix for F-15+ but F-14 is EOLed, you can either use FSF GDB or patch Fedora GDB by this patch or use F-15+ GDB etc.
That backtrace of "core.522" FYI is at: http://people.redhat.com/jkratoch/sandisk.bt
Thanks, Jan
--- gdb-7.2/gdb/solib-svr4.c.orig 2012-03-17 09:39:54.874090162 +0100 +++ gdb-7.2/gdb/solib-svr4.c 2012-03-17 09:42:12.561810807 +0100 @@ -1202,14 +1202,30 @@ svr4_current_sos (void) } else {
- struct build_id *build_id;
- struct build_id *build_id = NULL;
strncpy (new->so_original_name, buffer, SO_NAME_MAX_PATH_SIZE - 1); new->so_original_name[SO_NAME_MAX_PATH_SIZE - 1] = '\0'; /* May get overwritten below. */ strcpy (new->so_name, new->so_original_name);
- build_id = build_id_addr_get (LM_DYNAMIC_FROM_LINK_MAP (new));
- /* In the case the main executable was found according to its
- build-id (from a core file) prevent loading a different build
- of a library with accidentally the same SO_NAME.
- It suppresses bogus backtraces (and prints "??" there instead)
- if the on-disk files no longer match the running program
- version.
- If the main executable was not loaded according to its
- build-id do not do any build-id checking of the libraries.
- There may be missing build-ids dumped in the core file and we
- would map all the libraries to the only existing file loaded
- that time - the executable. */
- if (symfile_objfile != NULL
- && (symfile_objfile->flags & OBJF_BUILD_ID_CORE_LOADED) != 0)
- build_id = build_id_addr_get (LM_DYNAMIC_FROM_LINK_MAP (new));
if (build_id != NULL) { char *name, *build_id_filename; @@ -1224,23 +1240,7 @@ svr4_current_sos (void) xfree (name); } else
- {
- debug_print_missing (new->so_name, build_id_filename);
- /* In the case the main executable was found according to
- its build-id (from a core file) prevent loading
- a different build of a library with accidentally the
- same SO_NAME.
- It suppresses bogus backtraces (and prints "??" there
- instead) if the on-disk files no longer match the
- running program version. */
- if (symfile_objfile != NULL
- && (symfile_objfile->flags
- & OBJF_BUILD_ID_CORE_LOADED) != 0)
- new->so_name[0] = 0;
- }
- debug_print_missing (new->so_name, build_id_filename);
xfree (build_id_filename); xfree (build_id);
-- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel
On 03/15/2012 01:58 PM, Jan Kratochvil wrote:
On Thu, 15 Mar 2012 21:18:16 +0100, Martin Langhoff wrote:
we are facing a very strange Python segfault in an OLPC build, based on F14, for XO-1.5 (I know, do you remember _that_ far ago?).
[...]
Full error msg from gdb at http://fpaste.org/YqGv/
All those GDB messages mean your system libraries do not match the core file.
F-14 is EOLed, its repositories incl. debuginfos are out of date, I do not think it matters to spend any time on F-14 at all.
Yes, F-14 is now 1.4 years old, but the situation should not be that bleak. In particular, the mirrors should contain _matching_ packages and debuginfo, and the ordinary install command for a particular debuginfo should succeed. If not, then perhaps a "by-hand" search for the matching debuginfo will succeed. Check the build-id "by hand", too. Search the net for any filename which contains that string. Look at several mirrors of F-14 for your CPU architecture.
If manual search fails, then get the source, rebuild the offending package and its debuginfo, install them, re-create the crash, and analyze the new dumps.
--