Hacky patch that mlock()s rpmdb's environment mmap(2)s, in order to attempt to avoid spurious rpmdb corruption issues on Linux that seem to be somehow related to pagein/pageout occuring.
Signed-off-by: Lennert Buytenhek buytenh@marvell.com Signed-off-by: Kedar Sovani kedars@marvell.com --- rpm-4.6.0-always-mlock.diff | 11 +++++++++++ rpm.spec | 8 ++++++-- 2 files changed, 17 insertions(+), 2 deletions(-) create mode 100644 rpm-4.6.0-always-mlock.diff
diff --git a/rpm-4.6.0-always-mlock.diff b/rpm-4.6.0-always-mlock.diff new file mode 100644 index 0000000..6af1b78 --- /dev/null +++ b/rpm-4.6.0-always-mlock.diff @@ -0,0 +1,11 @@ +diff -urp rpm-4.6.0-rc1.orig/lib/backend/db3.c rpm-4.6.0-rc1/lib/backend/db3.c +--- rpm-4.6.0-rc1.orig/lib/backend/db3.c 2008-10-10 01:42:14.000000000 -0400 ++++ rpm-4.6.0-rc1/lib/backend/db3.c 2008-11-20 22:19:55.000000000 -0500 +@@ -368,6 +368,7 @@ static int db_init(dbiIndex dbi, const c + } + } + ++ eflags |= DB_LOCKDOWN; + rc = (dbenv->open)(dbenv, dbhome, eflags, dbi->dbi_perms); + rc = cvtdberr(dbi, "dbenv->open", rc, _debug); + if (rc) diff --git a/rpm.spec b/rpm.spec index b65281a..eef2967 100644 --- a/rpm.spec +++ b/rpm.spec @@ -18,7 +18,7 @@ Summary: The RPM package management system Name: rpm Version: %{rpmver} -Release: 0.%{snapver}.7 +Release: 0.%{snapver}.7.fa1 Group: System Environment/Base Url: http://www.rpm.org/ Source0: http://rpm.org/releases/testing/%%7Bname%7D-%%7Bsrcver%7D.tar.bz2 @@ -42,6 +42,7 @@ Patch205: rpm-4.6.0-rc1-file-debuginfo.patch
# These are not yet upstream Patch300: rpm-4.5.90-posttrans.patch +Patch400: rpm-4.6.0-always-mlock.diff
# Partially GPL/LGPL dual-licensed and some bits with BSD # SourceLicense: (GPLv2+ and LGPLv2+ with exceptions) and BSD @@ -178,7 +179,7 @@ that will manipulate RPM packages and databases. %patch203 -p1 -b .defaultdocdir %patch204 -p1 -b .fp-hash %patch205 -p1 -b .file-debuginfo - +%patch400 -p1 -b .always-mlock # needs a bit of upstream love first... #%patch300 -p1 -b .posttrans
@@ -368,6 +369,9 @@ exit 0 %doc doc/librpm/html/*
%changelog +* Tue Dec 9 2008 Kedar Sovani <kedars@marvell.com +- always lock the rpm database + * Fri Oct 31 2008 Panu Matilainen pmatilai@redhat.com - adjust find-debuginfo for "file" output change (#468129)
On Thu, 2008-12-11 at 12:37 +0530, Kedar Sovani wrote:
Hacky patch that mlock()s rpmdb's environment mmap(2)s, in order to attempt to avoid spurious rpmdb corruption issues on Linux that seem to be somehow related to pagein/pageout occuring.
Ick.
No.
On Thu, Dec 11, 2008 at 08:53:27AM +0000, David Woodhouse wrote:
On Thu, 2008-12-11 at 12:37 +0530, Kedar Sovani wrote:
Hacky patch that mlock()s rpmdb's environment mmap(2)s, in order to attempt to avoid spurious rpmdb corruption issues on Linux that seem to be somehow related to pagein/pageout occuring.
Ick.
No.
The relevent questions are:
1. which kernel version is this occuring with?
2. what device is the swap on?
3. which drivers are being used?
If there's problems with paging causing corruption in userspace, it is a *serious* issue which can't be fixed by throwing hacks at userspace programs.
On Thu, 2008-12-11 at 08:59 +0000, Russell King wrote:
On Thu, Dec 11, 2008 at 08:53:27AM +0000, David Woodhouse wrote:
On Thu, 2008-12-11 at 12:37 +0530, Kedar Sovani wrote:
Hacky patch that mlock()s rpmdb's environment mmap(2)s, in order to attempt to avoid spurious rpmdb corruption issues on Linux that seem to be somehow related to pagein/pageout occuring.
Ick.
No.
The relevent questions are:
which kernel version is this occuring with?
what device is the swap on?
which drivers are being used?
If there's problems with paging causing corruption in userspace, it is a *serious* issue which can't be fixed by throwing hacks at userspace programs.
Hang on.. I recollect seeing this problem with rpm 4.4. Let me see if it is present in rpm 4.6 as well, or did this patch just creep in.
I'll dig further...
Kedar.
On Thu, Dec 11, 2008 at 08:59:40AM +0000, Russell King wrote:
Hacky patch that mlock()s rpmdb's environment mmap(2)s, in order to attempt to avoid spurious rpmdb corruption issues on Linux that seem to be somehow related to pagein/pageout occuring.
Ick.
No.
The relevent questions are:
which kernel version is this occuring with?
what device is the swap on?
which drivers are being used?
This issue goes back to May 2007 or so, when I noticed db4 corruption when using rpm. I started digging into it, and ran into an issue with fsx-linux, which you reported to linux-arch@ here:
http://marc.info/?l=linux-arch&m=118026300719763&w=2
Unfortunately, the issue seen with fsx-linux turned out to be unrelated to the rpm db4 corruption issue.
I applied the hacky rpm db4 database mlock() patch (which was never meant to go upstream!) to see if that would make it go away, and it seems to have made it go away, since I haven't managed to reproduce it since and haven't had any reports about it since.
Without the mlock patch, the corruption would happen even in qemu-system-arm, an environment in which cache aliasing effects don't exist, so I abandoned the theory of it being a cache aliasing issue at the time and theorised that somehow a dirty page was having its dirty data discarded and an older stale copy being swapped back in, although I've never been able to prove this -- after spending a week unsuccessfully trying to hunt it down at the time I haven't spent any more time on it since. (And everyone I mentioned this to seemed to agree that shared writeable mmap() is icky and yuck and booh and "hard to get right", and that didn't increase my motivation to look into it further either.)
I don't even know if it's an issue anymore in recent kernels. I don't even know if it's (assuming that it _is_ indeed a kernel issue) an arch/arm issue or a kernel-wide issue that simply occurs more often on ARM because ARM systems generally have less memory and therefore generally have more memory pressure. (There's certainly enough reports of rpm database corruption on x86 as well, but in almost every report there are more factors involved, such as people Ctrl-C'ing and killing rpm processes as they are manipulating the database, etc.)
thanks, Lennert
On Mon, 2009-01-05 at 10:31 +0100, Lennert Buytenhek wrote:
On Thu, Dec 11, 2008 at 08:59:40AM +0000, Russell King wrote:
Hacky patch that mlock()s rpmdb's environment mmap(2)s, in order to attempt to avoid spurious rpmdb corruption issues on Linux that seem to be somehow related to pagein/pageout occuring.
Ick.
No.
The relevent questions are:
which kernel version is this occuring with?
what device is the swap on?
which drivers are being used?
This issue goes back to May 2007 or so, when I noticed db4 corruption when using rpm. I started digging into it, and ran into an issue with fsx-linux, which you reported to linux-arch@ here:
http://marc.info/?l=linux-arch&m=118026300719763&w=2
Unfortunately, the issue seen with fsx-linux turned out to be unrelated to the rpm db4 corruption issue.
I applied the hacky rpm db4 database mlock() patch (which was never meant to go upstream!) to see if that would make it go away, and it seems to have made it go away, since I haven't managed to reproduce it since and haven't had any reports about it since.
Without the mlock patch, the corruption would happen even in qemu-system-arm, an environment in which cache aliasing effects don't exist, so I abandoned the theory of it being a cache aliasing issue at the time and theorised that somehow a dirty page was having its dirty data discarded and an older stale copy being swapped back in, although I've never been able to prove this -- after spending a week unsuccessfully trying to hunt it down at the time I haven't spent any more time on it since. (And everyone I mentioned this to seemed to agree that shared writeable mmap() is icky and yuck and booh and "hard to get right", and that didn't increase my motivation to look into it further either.)
I don't even know if it's an issue anymore in recent kernels. I don't even know if it's (assuming that it _is_ indeed a kernel issue) an arch/arm issue or a kernel-wide issue that simply occurs more often on ARM because ARM systems generally have less memory and therefore generally have more memory pressure. (There's certainly enough reports of rpm database corruption on x86 as well, but in almost every report there are more factors involved, such as people Ctrl-C'ing and killing rpm processes as they are manipulating the database, etc.)
I have been running a few systems with a lot of rpm activity without this patch, and I haven't seen a problem with these (probably because of the rpm 4.4 to 4.6 transition?). I have taken that patch out from the F10 rpm patches that I had submitted earlier.
thanks, Lennert
Kedar.
On Mon, Jan 05, 2009 at 03:12:30PM +0530, Kedar Sovani wrote:
Hacky patch that mlock()s rpmdb's environment mmap(2)s, in order to attempt to avoid spurious rpmdb corruption issues on Linux that seem to be somehow related to pagein/pageout occuring.
Ick.
No.
The relevent questions are:
which kernel version is this occuring with?
what device is the swap on?
which drivers are being used?
This issue goes back to May 2007 or so, when I noticed db4 corruption when using rpm. I started digging into it, and ran into an issue with fsx-linux, which you reported to linux-arch@ here:
http://marc.info/?l=linux-arch&m=118026300719763&w=2
Unfortunately, the issue seen with fsx-linux turned out to be unrelated to the rpm db4 corruption issue.
I applied the hacky rpm db4 database mlock() patch (which was never meant to go upstream!) to see if that would make it go away, and it seems to have made it go away, since I haven't managed to reproduce it since and haven't had any reports about it since.
Without the mlock patch, the corruption would happen even in qemu-system-arm, an environment in which cache aliasing effects don't exist, so I abandoned the theory of it being a cache aliasing issue at the time and theorised that somehow a dirty page was having its dirty data discarded and an older stale copy being swapped back in, although I've never been able to prove this -- after spending a week unsuccessfully trying to hunt it down at the time I haven't spent any more time on it since. (And everyone I mentioned this to seemed to agree that shared writeable mmap() is icky and yuck and booh and "hard to get right", and that didn't increase my motivation to look into it further either.)
I don't even know if it's an issue anymore in recent kernels. I don't even know if it's (assuming that it _is_ indeed a kernel issue) an arch/arm issue or a kernel-wide issue that simply occurs more often on ARM because ARM systems generally have less memory and therefore generally have more memory pressure. (There's certainly enough reports of rpm database corruption on x86 as well, but in almost every report there are more factors involved, such as people Ctrl-C'ing and killing rpm processes as they are manipulating the database, etc.)
I have been running a few systems with a lot of rpm activity without this patch, and I haven't seen a problem with these (probably because of the rpm 4.4 to 4.6 transition?). I have taken that patch out from the F10 rpm patches that I had submitted earlier.
What kernel are you running, 2.6.27/28?
On Mon, 2009-01-05 at 10:47 +0100, Lennert Buytenhek wrote:
On Mon, Jan 05, 2009 at 03:12:30PM +0530, Kedar Sovani wrote:
Hacky patch that mlock()s rpmdb's environment mmap(2)s, in order to attempt to avoid spurious rpmdb corruption issues on Linux that seem to be somehow related to pagein/pageout occuring.
Ick.
No.
The relevent questions are:
which kernel version is this occuring with?
what device is the swap on?
which drivers are being used?
This issue goes back to May 2007 or so, when I noticed db4 corruption when using rpm. I started digging into it, and ran into an issue with fsx-linux, which you reported to linux-arch@ here:
http://marc.info/?l=linux-arch&m=118026300719763&w=2
Unfortunately, the issue seen with fsx-linux turned out to be unrelated to the rpm db4 corruption issue.
I applied the hacky rpm db4 database mlock() patch (which was never meant to go upstream!) to see if that would make it go away, and it seems to have made it go away, since I haven't managed to reproduce it since and haven't had any reports about it since.
Without the mlock patch, the corruption would happen even in qemu-system-arm, an environment in which cache aliasing effects don't exist, so I abandoned the theory of it being a cache aliasing issue at the time and theorised that somehow a dirty page was having its dirty data discarded and an older stale copy being swapped back in, although I've never been able to prove this -- after spending a week unsuccessfully trying to hunt it down at the time I haven't spent any more time on it since. (And everyone I mentioned this to seemed to agree that shared writeable mmap() is icky and yuck and booh and "hard to get right", and that didn't increase my motivation to look into it further either.)
I don't even know if it's an issue anymore in recent kernels. I don't even know if it's (assuming that it _is_ indeed a kernel issue) an arch/arm issue or a kernel-wide issue that simply occurs more often on ARM because ARM systems generally have less memory and therefore generally have more memory pressure. (There's certainly enough reports of rpm database corruption on x86 as well, but in almost every report there are more factors involved, such as people Ctrl-C'ing and killing rpm processes as they are manipulating the database, etc.)
I have been running a few systems with a lot of rpm activity without this patch, and I haven't seen a problem with these (probably because of the rpm 4.4 to 4.6 transition?). I have taken that patch out from the F10 rpm patches that I had submitted earlier.
What kernel are you running, 2.6.27/28?
2.6.22.18
Kedar.
On Mon, Jan 05, 2009 at 03:21:31PM +0530, Kedar Sovani wrote:
> Hacky patch that mlock()s rpmdb's environment mmap(2)s, in order to > attempt to avoid spurious rpmdb corruption issues on Linux that seem > to be somehow related to pagein/pageout occuring.
Ick.
No.
The relevent questions are:
which kernel version is this occuring with?
what device is the swap on?
which drivers are being used?
This issue goes back to May 2007 or so, when I noticed db4 corruption when using rpm. I started digging into it, and ran into an issue with fsx-linux, which you reported to linux-arch@ here:
http://marc.info/?l=linux-arch&m=118026300719763&w=2
Unfortunately, the issue seen with fsx-linux turned out to be unrelated to the rpm db4 corruption issue.
I applied the hacky rpm db4 database mlock() patch (which was never meant to go upstream!) to see if that would make it go away, and it seems to have made it go away, since I haven't managed to reproduce it since and haven't had any reports about it since.
Without the mlock patch, the corruption would happen even in qemu-system-arm, an environment in which cache aliasing effects don't exist, so I abandoned the theory of it being a cache aliasing issue at the time and theorised that somehow a dirty page was having its dirty data discarded and an older stale copy being swapped back in, although I've never been able to prove this -- after spending a week unsuccessfully trying to hunt it down at the time I haven't spent any more time on it since. (And everyone I mentioned this to seemed to agree that shared writeable mmap() is icky and yuck and booh and "hard to get right", and that didn't increase my motivation to look into it further either.)
I don't even know if it's an issue anymore in recent kernels. I don't even know if it's (assuming that it _is_ indeed a kernel issue) an arch/arm issue or a kernel-wide issue that simply occurs more often on ARM because ARM systems generally have less memory and therefore generally have more memory pressure. (There's certainly enough reports of rpm database corruption on x86 as well, but in almost every report there are more factors involved, such as people Ctrl-C'ing and killing rpm processes as they are manipulating the database, etc.)
I have been running a few systems with a lot of rpm activity without this patch, and I haven't seen a problem with these (probably because of the rpm 4.4 to 4.6 transition?). I have taken that patch out from the F10 rpm patches that I had submitted earlier.
What kernel are you running, 2.6.27/28?
2.6.22.18
If you are using the kernel tree I think you are using (the hacked-up 10MB-patch-against-mainline "2.6.22.18"), then that is unfortunately so far away from mainline that it's not a very useful datapoint.