Liron Aravot has uploaded a new change for review.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
hsm: prepareForShutdown - operations order
Currently cleanupMasterMount() is executed before taskMgr.prepareForShutdown() and before the domain monitor is stopped, that causes to errors during the tasks abortion (becasue of failure to access the path) and to erros when stopping the domain monitoring thread on that domain.
The solution introduced in that patch is to change the order of the operations - so that the cleanupMasterMount() will be executed only after the other operations are executed.
Change-Id: I9edd84317b08a17db80e265053edaf69582c2a51 Bug-Url: https://bugzilla.redhat.com/show_bug.cgi?id=1161934 Signed-off-by: Liron Aravot laravot@redhat.com --- M vdsm/storage/hsm.py 1 file changed, 1 insertion(+), 1 deletion(-)
git pull ssh://gerrit.ovirt.org:29418/vdsm refs/changes/62/36162/1
diff --git a/vdsm/storage/hsm.py b/vdsm/storage/hsm.py index c8aaf93..8f71d71 100644 --- a/vdsm/storage/hsm.py +++ b/vdsm/storage/hsm.py @@ -3433,7 +3433,6 @@ # stop spm tasks if spm etc.) try: self._connectionMonitor.stopMonitoring() - sp.StoragePool.cleanupMasterMount() self.__releaseLocks()
for spUUID in self.pools: @@ -3454,6 +3453,7 @@ exc_info=True)
self.taskMng.prepareForShutdown() + sp.StoragePool.cleanupMasterMount() except: pass
Nir Soffer has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 1:
(1 comment)
http://gerrit.ovirt.org/#/c/36162/1/vdsm/storage/hsm.py File vdsm/storage/hsm.py:
Line 3432 Line 3433 Line 3434 Line 3435 Line 3436 I think that it will be more correct to move
self.taskMng.prepareForShutdown()
Before
sp.StoragePool.cleanupMasterMount()
Since you should not access shared storage after you release the locks.
But this will increase the chance that we do not reach the next line, and leave stale locks around, because preparing taskMng for shutdown was too slow and vdsm was killed by the service (after 10 seconds).
We need little more thinking.
oVirt Jenkins CI Server has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 1:
Build Failed
http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit_el/13455/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_master_unit-tests_created/14412/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_master_pep8_gerrit/14244/ : SUCCESS
oVirt Jenkins CI Server has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 2:
Build Failed
http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit_el/13457/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_master_unit-tests_created/14414/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_master_pep8_gerrit/14246/ : SUCCESS
oVirt Jenkins CI Server has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 3:
Build Failed
http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit_el/13458/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_master_unit-tests_created/14415/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_master_pep8_gerrit/14247/ : SUCCESS
oVirt Jenkins CI Server has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 4:
Build Failed
http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit_el/13459/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_master_unit-tests_created/14416/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_master_pep8_gerrit/14248/ : SUCCESS
Liron Aravot has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 1:
(1 comment)
http://gerrit.ovirt.org/#/c/36162/1/vdsm/storage/hsm.py File vdsm/storage/hsm.py:
Line 3432 Line 3433 Line 3434 Line 3435 Line 3436
I think that it will be more correct to move
i agree, as of today self.taskMng.prepareForShutdown() already access the shared storage after releaseLocks - but it doesn't access the tasks dir (which it will after this change). Moving self.taskMng.prepareForShutdown() to be before self.__releaseLocks() may be trickier in terms of verification imo - but let's here other opinions on that.
Allon Mureinik has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 4:
ping?
Allon Mureinik has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 4:
PING?
Allon Mureinik has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 4:
PING?!
Allon Mureinik has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 4:
??????? ??:::::::?? iiii ??:::::::::::? i::::i ?:::::????:::::? iiii ?::::? ?::::? ?::::? ?::::? ppppp ppppppppp iiiiiiinnnn nnnnnnnn ggggggggg ggggg?????? ?::::? p::::ppp:::::::::p i:::::in:::nn::::::::nn g:::::::::ggg::::g ?::::? p:::::::::::::::::p i::::in::::::::::::::nn g:::::::::::::::::g ?::::? pp::::::ppppp::::::p i::::inn:::::::::::::::ng::::::ggggg::::::gg ?::::? p:::::p p:::::p i::::i n:::::nnnn:::::ng:::::g g:::::g ?::::? p:::::p p:::::p i::::i n::::n n::::ng:::::g g:::::g ?::::? p:::::p p:::::p i::::i n::::n n::::ng:::::g g:::::g ?::::? p:::::p p::::::p i::::i n::::n n::::ng::::::g g:::::g ??::?? p:::::ppppp:::::::pi::::::i n::::n n::::ng:::::::ggggg:::::g ???? p::::::::::::::::p i::::::i n::::n n::::n g::::::::::::::::g p::::::::::::::pp i::::::i n::::n n::::n gg::::::::::::::g ??? p::::::pppppppp iiiiiiii nnnnnn nnnnnn gggggggg::::::g ??:?? p:::::p g:::::g ??? p:::::p gggggg g:::::g p:::::::p g:::::gg gg:::::g p:::::::p g::::::ggg:::::::g p:::::::p gg:::::::::::::g ppppppppp ggg::::::ggg gggggg
Allon Mureinik has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 4: Code-Review+1
Nir Soffer has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 4: Code-Review-1
(7 comments)
https://gerrit.ovirt.org/#/c/36162/4//COMMIT_MSG Commit Message:
Line 8: Line 9: Currently cleanupMasterMount() is executed before Line 10: taskMgr.prepareForShutdown() that causes to errors on updating the tasks Line 11: status during the tasks abortion (becasue of failure to access the path Line 12: of the tasks dir). Fixing this requires separation of stopping a other threads and waiting for the thread to complete.
This is implemented partially in the domain monitor (cutting down shutdown time from 120 seconds to 10 with 50 threads).
All entities should implement: - stop - signal need to stop, non-blocking - wait - wait until entity is stopped
During shutdown, we should ask all entities to stop, then wait for all of them.
So fixing this require much larger change, best leave code as is for now. Line 13: Line 14: The solution introduced in that patch is to change the order of the Line 15: operations - so that the cleanupMasterMount() will be executed only Line 16: after the tasks abortion is executed.
Line 12: of the tasks dir). Line 13: Line 14: The solution introduced in that patch is to change the order of the Line 15: operations - so that the cleanupMasterMount() will be executed only Line 16: after the tasks abortion is executed. This may leave behind stale master mounts, worth then tasks with incorrect status. Line 17: Line 18: Change-Id: I9edd84317b08a17db80e265053edaf69582c2a51 Line 19: Bug-Url: https://bugzilla.redhat.com/show_bug.cgi?id=1161934
https://gerrit.ovirt.org/#/c/36162/4/vdsm/storage/hsm.py File vdsm/storage/hsm.py:
Line 3418 Line 3419 Line 3420 Line 3421 Line 3422 Typically we have 10 seconds to complete this, before vdsm is killed (SIGKILL).
Line 3433 Line 3434 Line 3435 Line 3436 Line 3437 After this I would not touch shared storage.
Line 3439 Line 3440 Line 3441 Line 3442 Line 3443 This may block for long time if a task is blocked.
Line 3447 Line 3448 Line 3449 Line 3450 Line 3451 This may take several seconds (10 seconds for 50 storage domains, caused by sanlock), so there will be no time left to cleanup master mount after this.
Line 3452: self.log.warning("Failed to stop RepoStats thread", Line 3453: exc_info=True) Line 3454: Line 3455: self.taskMng.prepareForShutdown() Line 3456: sp.StoragePool.cleanupMasterMount() This may never run or vdsm may be killed in the middle. Line 3457: except: Line 3458: pass Line 3459: Line 3460: @classmethod
Allon Mureinik has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 4: -Code-Review
?
Liron Aravot has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 4:
(1 comment)
https://gerrit.ovirt.org/#/c/36162/4/vdsm/storage/hsm.py File vdsm/storage/hsm.py:
Line 3452: self.log.warning("Failed to stop RepoStats thread", Line 3453: exc_info=True) Line 3454: Line 3455: self.taskMng.prepareForShutdown() Line 3456: sp.StoragePool.cleanupMasterMount()
This may never run or vdsm may be killed in the middle.
So what do you suggest? the current situation is problematic as well...it's either to replace the order or to pass on the tasks status update as it needs that mount. Line 3457: except: Line 3458: pass Line 3459: Line 3460: @classmethod
Nir Soffer has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 4:
(1 comment)
https://gerrit.ovirt.org/#/c/36162/4/vdsm/storage/hsm.py File vdsm/storage/hsm.py:
Line 3452: self.log.warning("Failed to stop RepoStats thread", Line 3453: exc_info=True) Line 3454: Line 3455: self.taskMng.prepareForShutdown() Line 3456: sp.StoragePool.cleanupMasterMount()
So what do you suggest? the current situation is problematic as well...it's
The easy fix would be to remove the code that always fail in the current way we shut down, since this code was never correct.
The long term fix is to redesign the system so we can stop/cleanup concurretnly, and only then wait for all threads. Line 3457: except: Line 3458: pass Line 3459: Line 3460: @classmethod
Liron Aravot has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 4:
(1 comment)
https://gerrit.ovirt.org/#/c/36162/4/vdsm/storage/hsm.py File vdsm/storage/hsm.py:
Line 3452: self.log.warning("Failed to stop RepoStats thread", Line 3453: exc_info=True) Line 3454: Line 3455: self.taskMng.prepareForShutdown() Line 3456: sp.StoragePool.cleanupMasterMount()
The easy fix would be to remove the code that always fail in the current wa
Nir, the current code doesn't always fail..it will fail only for block storage while working on any other storage type..so i wouldn't just remove it. I can move prepare for shutdown "up", so it'll be executed immediately after cleanupMasterMount() if you prefer it.
what do you think? Line 3457: except: Line 3458: pass Line 3459: Line 3460: @classmethod
automation@ovirt.org has posted comments on this change.
Change subject: hsm: prepareForShutdown - operations order ......................................................................
Patch Set 5:
* Update tracker::#1161934::OK * Check Bug-Url::OK * Check Public Bug::#1161934::OK, public bug * Check Product::#1161934::OK, Correct product Red Hat Enterprise Virtualization Manager * Check TM::SKIP, not in a monitored branch (ovirt-3.5 ovirt-3.4 ovirt-3.3 ovirt-3.2) * Check merged to previous::IGNORE, Not in stable branch (['ovirt-3.5', 'ovirt-3.4', 'ovirt-3.3'])
vdsm-patches@lists.fedorahosted.org