#289: Missing autotest debug logs ----------------------+----------------------------------------------------- Reporter: kparal | Owner: Type: defect | Status: new Priority: major | Milestone: Hot issues Component: autotest | Keywords: ----------------------+----------------------------------------------------- I've found a test run where autotest debug logs are missing. What's wrong? We need them.
http://autoqa.fedoraproject.org/results/72107-autotest/qa06.c.fedoraproject....
#289: Missing autotest debug logs ----------------------+----------------------------------------------------- Reporter: kparal | Owner: Type: defect | Status: new Priority: major | Milestone: Hot issues Component: autotest | Resolution: Keywords: | ----------------------+----------------------------------------------------- Comment (by jlaska):
We ran into disk space issues on our autoqa/autotest server yesterday. I've resolved the issue by pruning old test results to free up space. I'm also using this experience to help plan system requirements for our next autoqa/autotest server.
I only see three results which exhibit this problem ... * http://autoqa.fedoraproject.org/results/70273-autotest/qa03.c.fedoraproject.... * http://autoqa.fedoraproject.org/results/72107-autotest/qa06.c.fedoraproject.... * http://autoqa.fedoraproject.org/results/72110-autotest/qa03.c.fedoraproject....
My suspicion is that this test couldn't upload all of it's logs due to disk space issues. However, I can't really confirm that idea presently. I can't think of anything to do with this ticket, other than to continue monitoring for the issue. If we are not seeing these types of issues over the next few days, I'm comfortable closing the ticket, and attributing it to the disk space shortage.
#289: Missing autotest debug logs ----------------------+----------------------------------------------------- Reporter: kparal | Owner: Type: defect | Status: new Priority: major | Milestone: Hot issues Component: autotest | Resolution: Keywords: | ----------------------+----------------------------------------------------- Comment (by kparal):
{{{ # cd /usr/share/autotest/results/ # df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/GuestVolGroup00-root 18G 10G 6.4G 61% / # for R in *-autotest; do if [ -z "`find $R -type d -name 'debug'`" ]; then echo $R; fi; done 70273-autotest 72107-autotest 72110-autotest 72289-autotest 72807-autotest 72808-autotest 72809-autotest 72985-autotest 77085-autotest 77086-autotest 77087-autotest 77088-autotest 77089-autotest 77090-autotest }}}
Still happens.
#289: Missing autotest debug logs ----------------------+----------------------------------------------------- Reporter: kparal | Owner: Type: defect | Status: new Priority: major | Milestone: Hot issues Component: autotest | Resolution: Keywords: | ----------------------+----------------------------------------------------- Comment (by kparal):
It's probably another issue for the last occurrences, only queue.log is there.
http://autoqa.fedoraproject.org/results/77090-autotest/
And here's the tasks' status:
{{{ # atest job list -a | grep -E '770(8[5-9]|90)' 77085 autotest post-repo-update:conflicts.i386 Aborted=1(100.0%) 77086 autotest post-repo-update:repoclosure.i386 Aborted=1(100.0%) 77087 autotest post-repo-update:rats_sanity.i386 Aborted=1(100.0%) 77088 autotest post-repo-update:conflicts.x86_64 Aborted=1(100.0%) 77089 autotest post-repo-update:repoclosure.x86_64 Aborted=1(100.0%) 77090 autotest post-repo-update:rats_sanity.x86_64 Aborted=1(100.0%) }}}
According to watcher logs, the jobs were created ok, but then went aborted. Why?
All the logs in logs/ directory end by 2011-03-23 date. Should we create a separate issues for this?
#289: Missing autotest debug logs ----------------------+----------------------------------------------------- Reporter: kparal | Owner: Type: defect | Status: new Priority: major | Milestone: Hot issues Component: autotest | Resolution: Keywords: | ----------------------+----------------------------------------------------- Comment (by jlaska):
Replying to [comment:3 kparal]:
According to watcher logs, the jobs were created ok, but then went
aborted. Why?
The autotest schedule aborted them ''due to job timeout''. Which seems odd given the timestamps around the messages. Perhaps this message is emitted on autotestd shutdown or something? Not sure, will need to look into autotest.
{{{ # for JOB in $(seq 77085 77090); do echo "== Job#$JOB Log ==" ; grep $JOB /usr/share/autotest/logs/scheduler.log.2011-0* ; echo ; echo ; done == Job#77085 Log == /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:09 INFO |monitor_db:2770| creating block 77085/8 /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:09 INFO |monitor_db:2906| post-repo-update:conflicts.i386/4/None (job 77085, entry 77085) scheduled on 172.16.0.17, status=Queued /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:09 INFO |monitor_db:2815| 172.16.0.17/77085 (77085) Queued [active] -> Verifying /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:15 INFO |monitor_db:2815| 172.16.0.17/77085 (77085) Verifying [active] -> Verifying /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:21 INFO |monitor_db:2815| 172.16.0.17/77085 (77085) Verifying [active] -> Pending /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:21 INFO |monitor_db:2815| 172.16.0.17/77085 (77085) Pending [active] -> Starting /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:26 WARNI|monitor_db:0065| Aborting job 77085 due to job timeout /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 INFO |monitor_db:1054| Aborting 172.16.0.17/77085 (77085) Starting [active,aborted] /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 INFO |monitor_db:2815| 172.16.0.17/77085 (77085) Starting [active,aborted] -> Aborted
== Job#77086 Log == /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:09 INFO |monitor_db:2770| creating block 77086/6 /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:09 INFO |monitor_db:2906| post-repo-update:repoclosure.i386/4/None (job 77086, entry 77086) scheduled on qa05.c.fedoraproject.org, status=Queued /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:09 INFO |monitor_db:2815| qa05.c.fedoraproject.org/77086 (77086) Queued [active] -> Verifying /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:15 INFO |monitor_db:2815| qa05.c.fedoraproject.org/77086 (77086) Verifying [active] -> Verifying /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:21 INFO |monitor_db:2815| qa05.c.fedoraproject.org/77086 (77086) Verifying [active] -> Pending /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:21 INFO |monitor_db:2815| qa05.c.fedoraproject.org/77086 (77086) Pending [active] -> Starting /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:26 WARNI|monitor_db:0065| Aborting job 77086 due to job timeout /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 INFO |monitor_db:1054| Aborting qa05.c.fedoraproject.org/77086 (77086) Starting [active,aborted] /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 INFO |monitor_db:2815| qa05.c.fedoraproject.org/77086 (77086) Starting [active,aborted] -> Aborted
== Job#77087 Log == /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:09 INFO |monitor_db:2770| creating block 77087/7 /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:09 INFO |monitor_db:2906| post-repo-update:rats_sanity.i386/4/None (job 77087, entry 77087) scheduled on qa06.c.fedoraproject.org, status=Queued /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:09 INFO |monitor_db:2815| qa06.c.fedoraproject.org/77087 (77087) Queued [active] -> Verifying /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:15 INFO |monitor_db:2815| qa06.c.fedoraproject.org/77087 (77087) Verifying [active] -> Verifying /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:21 INFO |monitor_db:2815| qa06.c.fedoraproject.org/77087 (77087) Verifying [active] -> Pending /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:21 INFO |monitor_db:2815| qa06.c.fedoraproject.org/77087 (77087) Pending [active] -> Starting /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:26 WARNI|monitor_db:0065| Aborting job 77087 due to job timeout /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 INFO |monitor_db:1054| Aborting qa06.c.fedoraproject.org/77087 (77087) Starting [active,aborted] /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 INFO |monitor_db:2815| qa06.c.fedoraproject.org/77087 (77087) Starting [active,aborted] -> Aborted
== Job#77088 Log == /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:08 INFO |monitor_db:2770| creating block 77088/9 /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:08 INFO |monitor_db:2906| post-repo-update:conflicts.x86_64/1/None (job 77088, entry 77088) scheduled on 172.16.0.18, status=Queued /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:08 INFO |monitor_db:2815| 172.16.0.18/77088 (77088) Queued [active] -> Verifying /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:15 INFO |monitor_db:2815| 172.16.0.18/77088 (77088) Verifying [active] -> Verifying /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:21 INFO |monitor_db:2815| 172.16.0.18/77088 (77088) Verifying [active] -> Pending /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:21 INFO |monitor_db:2815| 172.16.0.18/77088 (77088) Pending [active] -> Starting /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:26 WARNI|monitor_db:0065| Aborting job 77088 due to job timeout /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 INFO |monitor_db:1054| Aborting 172.16.0.18/77088 (77088) Starting [active,aborted] /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 INFO |monitor_db:2815| 172.16.0.18/77088 (77088) Starting [active,aborted] -> Aborted
== Job#77089 Log == /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:15 INFO |monitor_db:2770| creating block 77089/2 /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:15 INFO |monitor_db:2906| post-repo-update:repoclosure.x86_64/1/None (job 77089, entry 77089) scheduled on qa01.c.fedoraproject.org, status=Queued /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:15 INFO |monitor_db:2815| qa01.c.fedoraproject.org/77089 (77089) Queued [active] -> Verifying /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:21 INFO |monitor_db:2815| qa01.c.fedoraproject.org/77089 (77089) Verifying [active] -> Verifying /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:26 WARNI|monitor_db:0065| Aborting job 77089 due to job timeout /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 WARNI|monitor_db:2435| initialized <class '__main__.HostQueueEntry'> 77089 instance requery is updating: {'aborted': (0, 1)} /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 INFO |monitor_db:1054| Aborting qa01.c.fedoraproject.org/77089 (77089) Verifying [active,aborted] /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 INFO |monitor_db:2815| qa01.c.fedoraproject.org/77089 (77089) Verifying [active,aborted] -> Aborted
== Job#77090 Log == /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:15 INFO |monitor_db:2770| creating block 77090/3 /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:15 INFO |monitor_db:2906| post-repo-update:rats_sanity.x86_64/1/None (job 77090, entry 77090) scheduled on qa02.c.fedoraproject.org, status=Queued /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:15 INFO |monitor_db:2815| qa02.c.fedoraproject.org/77090 (77090) Queued [active] -> Verifying /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:21 INFO |monitor_db:2815| qa02.c.fedoraproject.org/77090 (77090) Verifying [active] -> Verifying /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:26 WARNI|monitor_db:0065| Aborting job 77090 due to job timeout /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 WARNI|monitor_db:2435| initialized <class '__main__.HostQueueEntry'> 77090 instance requery is updating: {'aborted': (0, 1)} /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 INFO |monitor_db:1054| Aborting qa02.c.fedoraproject.org/77090 (77090) Verifying [active,aborted] /usr/share/autotest/logs/scheduler.log.2011-03-23-09.14.33:04/02 21:32:27 INFO |monitor_db:2815| qa02.c.fedoraproject.org/77090 (77090) Verifying [active,aborted] -> Aborted }}}
#289: Jobs are aborted prematurely ----------------------+----------------------------------------------------- Reporter: kparal | Owner: Type: defect | Status: new Priority: major | Milestone: Hot issues Component: autotest | Resolution: Keywords: | ----------------------+----------------------------------------------------- Changes (by kparal):
* summary: Missing autotest debug logs => Jobs are aborted prematurely
Comment:
This is still happening, I just reproduced it on our staging server with autotest 0.13.1.
Job 8636 was created at 2011-12-12 04:20:32. And it was aborted in 16 minutes: {{{ 12/12 04:36:22 INFO |monitor_db:0053| Running periodic cleanup 12/12 04:36:22 INFO |monitor_db:0063| Aborting all jobs that have timed out and are not complete 12/12 04:36:24 WARNI|monitor_db:0067| Aborting job 8636 due to job timeout 12/12 04:36:24 INFO |monitor_db:0075| Aborting all jobs that have passed maximum runtime 12/12 04:36:24 INFO |monitor_db:0146| Clear out blocks for all completed jobs. 12/12 04:36:24 INFO |monitor_db:0090| Cleaning db inconsistencies 12/12 04:36:24 INFO |monitor_db:0734| Aborting no host/8636 (8636) Queued [aborted] 12/12 04:36:24 INFO |scheduler_:0561| no host/8636 (8636) Queued [aborted] -> Aborted }}}
Current settings are: {{{ job_timeout_default: 1 job_max_runtime_hrs_default: 2 }}}
#289: Jobs are aborted prematurely ----------------------+------------------------- Reporter: kparal | Owner: Type: defect | Status: new Priority: major | Milestone: Hot issues Component: autotest | Resolution: Keywords: | ----------------------+-------------------------
Comment (by kparal):
Reported as https://github.com/autotest/autotest/issues/117
#289: Jobs are aborted prematurely [waiting upstream] ----------------------+------------------------- Reporter: kparal | Owner: Type: defect | Status: new Priority: major | Milestone: Hot issues Component: autotest | Resolution: Keywords: | Blocked By: Blocking: | ----------------------+-------------------------
autoqa-devel@lists.fedorahosted.org