On Mon, 31 Dec 2007 11:00:12 -0500, Dan Williams wrote:
On Sun, 2007-12-30 at 17:54 +0100, Michael Schwendt wrote:
> If in a failed job.log you see the message
>
> Job waited too long for repo to unlock. Killing it...
>
> please notify me.
>
> It's a problem in the plague server code that results in a denial of
> service for subsequent build jobs. I have a traceback from Dec 28th, but
> in the context of the source code it doesn't make sense yet (because a few
> lines earlier the code ensures that the files to be copied exist and are
> readable). Buildsys runs a slightly modified version that adds a bit more
> debug output in this area.
Maybe just trap the exception, print it out, and continue? That way at
least the server doesn't fall over, it just fails to copy one item.
The buildsys runs such a patched Repo.py already. It catches OSError,
IOError, unlocks the locks and prints/logs the results of the file access
check prior to when files are copied.
I also added a debug line in the package job code to see when it starts
deleting the copied files. Normally it waits until a callback tells it
that all files are copied.
It might also help debugging to see if only specific files can't
be
copied...
The offending file was copied, but shutil.copy() failed in its second part
when trying to copy the file mode. It didn't find the source file it had
just copied. :-}