Nir Soffer has uploaded a new change for review.
Change subject: sp: Fix stopping domain monitors ......................................................................
sp: Fix stopping domain monitors
Commit 2671777c69 fixed stopping of domain monitors by stopping monitors from StoragePool.__del__. This was a dangerous fix because a stray reference to the pool may delay invocation of __del__ until the referring object is deleted, delaying stopping of domain monitors for unlimited time. In case of a memory leak, the pool may never be deleted and domain monitors would never stop.
Fortunately, this fix did seem to work for couple of years, until commit 7b1cc6a20f made the domain monitor pool independent. Instead of one domain monitor object per pool, there is now a single shared domain monitor object used by all pool objects. Having a shared domain monitor, the pool __del__ method became deadly.
Shortly after commit 7b1cc6a20f was merged, a new random error appeared in CI jobs, where there is no storage space left, while storage has plenty of space. The root cause of the error was anonymous thread running at unexpected times and stopping silently all domain monitors. Adding some logging revealed that this thread was started from StoragePool.__del__ method. This thread was running 14 minutes after the original pool was disconnected, stopping domain monitors used by the current pool object. It seems that a task was holding a reference to the old pool, and when the task was finished, the old pool was finally destroyed.
We seem to stop monitoring when we should, and stopping monitoring in the __del__ method is redundant. This patch removes the deadly method.
Change-Id: Iad80984ab44dd4a4e36b92330eb9320cf68f1580 Relates-To: https://bugzilla.redhat.com/705058 Relates-To: http://gerrit.ovirt.org/19762 Bug-Url: https://bugzilla.redhat.com/1032925 Signed-off-by: Nir Soffer nsoffer@redhat.com --- M vdsm/storage/sp.py 1 file changed, 0 insertions(+), 4 deletions(-)
git pull ssh://gerrit.ovirt.org:29418/vdsm refs/changes/58/22058/1
diff --git a/vdsm/storage/sp.py b/vdsm/storage/sp.py index 31e0bcd..91b27d5 100644 --- a/vdsm/storage/sp.py +++ b/vdsm/storage/sp.py @@ -140,10 +140,6 @@ def getSpmStatus(self): return self.spmRole, self.getSpmLver(), self.getSpmId()
- def __del__(self): - if len(self.domainMonitor.poolMonitoredDomains) > 0: - threading.Thread(target=self.stopMonitoringDomains).start() - @unsecured def forceFreeSpm(self): # DO NOT USE, STUPID, HERE ONLY FOR BC
Liron Ar has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1: Code-Review+1
Sergey Gotliv has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1: Code-Review+1
oVirt Jenkins CI Server has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1:
Build Successful
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5904/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/5108/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5996/ : SUCCESS
Nir Soffer has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1: Verified+1
Verified in the CI - run 10 jobs (like the one from the bug), 9 completed successfully, one had (unrelated) engine install failure.
See http://jenkins-ci.eng.lab.tlv.redhat.com/view/04%20developers/job/vdsm_33_de...
Jobs 43-52
Allon Mureinik has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1: Code-Review+1
Federico Simoncelli has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1: Code-Review+1
(1 comment)
Code is fine but the commit message should be improved.
.................................................... Commit Message Line 12: delaying stopping of domain monitors for unlimited time. In case of a memory Line 13: leak, the pool may never be deleted and domain monitors would never stop. Line 14: Line 15: Fortunately, this fix did seem to work for couple of years, until commit Line 16: 7b1cc6a20f made the domain monitor pool independent. Instead of one domain I don't think you should mention 2671777c69. At its time it was maybe overzealous but correct. (Reading these two paragraphs it gives the impression that it was working by accident). Line 17: monitor object per pool, there is now a single shared domain monitor object Line 18: used by all pool objects. Having a shared domain monitor, the pool __del__ Line 19: method became deadly. Line 20:
Nir Soffer has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1:
(1 comment)
.................................................... Commit Message Line 12: delaying stopping of domain monitors for unlimited time. In case of a memory Line 13: leak, the pool may never be deleted and domain monitors would never stop. Line 14: Line 15: Fortunately, this fix did seem to work for couple of years, until commit Line 16: 7b1cc6a20f made the domain monitor pool independent. Instead of one domain But it was working by accident - stopping monitoring after 14 minutes does not look like the intended behavior. Line 17: monitor object per pool, there is now a single shared domain monitor object Line 18: used by all pool objects. Having a shared domain monitor, the pool __del__ Line 19: method became deadly. Line 20:
Saggi Mizrahi has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1: Code-Review+1
Federico Simoncelli has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1: Code-Review+2
Dan Kenigsberg has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1:
(1 comment)
.................................................... Commit Message Line 22: jobs, where there is no storage space left, while storage has plenty of space. Line 23: The root cause of the error was anonymous thread running at unexpected times Line 24: and stopping silently all domain monitors. Adding some logging revealed that Line 25: this thread was started from StoragePool.__del__ method. This thread was Line 26: running 14 minutes after the original pool was disconnected, stopping domain Your patch seems fine to me, but I am not sure we completely understand the root cause.
It may well be that __del__ was called 14 minutes after the sp object as lost context, but it's unlikely. It is more likely that it has happened much earlier, but the created thread was blocking on a long-disconnected storage.
I suspect that we failed to stop monitoring when we should have, and that sp.__del__ was called while self.domainMonitor.poolMonitoredDomains still had a couple of monitored domains. This suspicion is based on cases when we've seen a destroyStoragePool being received without a prior disconnectStoragePool, and after a disconnectStorageServer.
Could this be the case here? Line 27: monitors used by the current pool object. It seems that a task was holding a Line 28: reference to the old pool, and when the task was finished, the old pool was Line 29: finally destroyed. Line 30:
Nir Soffer has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1:
(1 comment)
.................................................... Commit Message Line 22: jobs, where there is no storage space left, while storage has plenty of space. Line 23: The root cause of the error was anonymous thread running at unexpected times Line 24: and stopping silently all domain monitors. Adding some logging revealed that Line 25: this thread was started from StoragePool.__del__ method. This thread was Line 26: running 14 minutes after the original pool was disconnected, stopping domain I'm pretty sure about my description - you can see that in log of this job: http://jenkins-ci.eng.lab.tlv.redhat.com/job/rhevm_3.3_automation_coretools_... (see vdsm.log of cinteg04).
I suspect that the delay is caused by a copyImage task that keeps a reference to the pool. __del__ was called when this task was committed. We are still investigating this. Line 27: monitors used by the current pool object. It seems that a task was holding a Line 28: reference to the old pool, and when the task was finished, the old pool was Line 29: finally destroyed. Line 30:
Dan Kenigsberg has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1:
(1 comment)
.................................................... Commit Message Line 22: jobs, where there is no storage space left, while storage has plenty of space. Line 23: The root cause of the error was anonymous thread running at unexpected times Line 24: and stopping silently all domain monitors. Adding some logging revealed that Line 25: this thread was started from StoragePool.__del__ method. This thread was Line 26: running 14 minutes after the original pool was disconnected, stopping domain If a copyImage (which is an SPM operation) takes place while we are members of another pool, we have very serious things to worry about: that copyImage might be copying stuff into area used by the new SPM. We should have lost our SPMness before being added to a new pool - either by spmStop, by a vdsm suicide, or host fencing. Line 27: monitors used by the current pool object. It seems that a task was holding a Line 28: reference to the old pool, and when the task was finished, the old pool was Line 29: finally destroyed. Line 30:
Nir Soffer has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1:
(1 comment)
.................................................... Commit Message Line 22: jobs, where there is no storage space left, while storage has plenty of space. Line 23: The root cause of the error was anonymous thread running at unexpected times Line 24: and stopping silently all domain monitors. Adding some logging revealed that Line 25: this thread was started from StoragePool.__del__ method. This thread was Line 26: running 14 minutes after the original pool was disconnected, stopping domain We are worried about additional issues, and investigating. Line 27: monitors used by the current pool object. It seems that a task was holding a Line 28: reference to the old pool, and when the task was finished, the old pool was Line 29: finally destroyed. Line 30:
Dan Kenigsberg has posted comments on this change.
Change subject: sp: Fix stopping domain monitors ......................................................................
Patch Set 1:
(1 comment)
.................................................... Commit Message Line 22: jobs, where there is no storage space left, while storage has plenty of space. Line 23: The root cause of the error was anonymous thread running at unexpected times Line 24: and stopping silently all domain monitors. Adding some logging revealed that Line 25: this thread was started from StoragePool.__del__ method. This thread was Line 26: running 14 minutes after the original pool was disconnected, stopping domain As long we're clear that there's a deeper "root cause". Line 27: monitors used by the current pool object. It seems that a task was holding a Line 28: reference to the old pool, and when the task was finished, the old pool was Line 29: finally destroyed. Line 30:
Dan Kenigsberg has submitted this change and it was merged.
Change subject: sp: Fix stopping domain monitors ......................................................................
sp: Fix stopping domain monitors
Commit 2671777c69 fixed stopping of domain monitors by stopping monitors from StoragePool.__del__. This was a dangerous fix because a stray reference to the pool may delay invocation of __del__ until the referring object is deleted, delaying stopping of domain monitors for unlimited time. In case of a memory leak, the pool may never be deleted and domain monitors would never stop.
Fortunately, this fix did seem to work for couple of years, until commit 7b1cc6a20f made the domain monitor pool independent. Instead of one domain monitor object per pool, there is now a single shared domain monitor object used by all pool objects. Having a shared domain monitor, the pool __del__ method became deadly.
Shortly after commit 7b1cc6a20f was merged, a new random error appeared in CI jobs, where there is no storage space left, while storage has plenty of space. The root cause of the error was anonymous thread running at unexpected times and stopping silently all domain monitors. Adding some logging revealed that this thread was started from StoragePool.__del__ method. This thread was running 14 minutes after the original pool was disconnected, stopping domain monitors used by the current pool object. It seems that a task was holding a reference to the old pool, and when the task was finished, the old pool was finally destroyed.
We seem to stop monitoring when we should, and stopping monitoring in the __del__ method is redundant. This patch removes the deadly method.
Change-Id: Iad80984ab44dd4a4e36b92330eb9320cf68f1580 Relates-To: https://bugzilla.redhat.com/705058 Relates-To: http://gerrit.ovirt.org/19762 Bug-Url: https://bugzilla.redhat.com/1032925 Signed-off-by: Nir Soffer nsoffer@redhat.com Reviewed-on: http://gerrit.ovirt.org/22058 Reviewed-by: Liron Ar laravot@redhat.com Reviewed-by: Sergey Gotliv sgotliv@redhat.com Reviewed-by: Allon Mureinik amureini@redhat.com Reviewed-by: Saggi Mizrahi smizrahi@redhat.com Reviewed-by: Federico Simoncelli fsimonce@redhat.com --- M vdsm/storage/sp.py 1 file changed, 0 insertions(+), 4 deletions(-)
Approvals: Nir Soffer: Verified Saggi Mizrahi: Looks good to me, but someone else must approve Federico Simoncelli: Looks good to me, approved Sergey Gotliv: Looks good to me, but someone else must approve Allon Mureinik: Looks good to me, but someone else must approve Liron Ar: Looks good to me, but someone else must approve
vdsm-patches@lists.fedorahosted.org