why not let the kernel init process to reap the zombie instead of father
process?
signal.signal(signal.SIGCHLD, signal.SIG_IGN)
Just for portable, BSD and System V does not support SIG_IGN?
On 01/25/2013 03:31 PM, Mark Wu wrote:
On 01/25/2013 03:20 PM, Mark Wu wrote:
> Great work!
> The default action for SIGCHLD is ignore, so there's no problems
> reported before a signal handler is installed by zombie reaper.
> But I still have one problem: the python multiprocessing.manager code
> is running a new thread and according to the implementation of
> python's signal, only the main thread can receive the signal.
> So how is the signal delivered to the server thread?
>
Is it possible to reap the zombie process in the main thread
asynchronously and periodically? It could be more safe than using the
signal handler.
>
> On Fri 25 Jan 2013 12:30:39 PM CST, Royce Lv wrote:
>>
>> Hi,
>> I reproduced this issue, and I believe it's a python bug.
>> 1. How to reproduce:
>> with the test case attached, put it under /usr/share/vdsm/tests/,
>> run #./run_tests.sh superVdsmTests.py
>> and this issue will be reproduced.
>> 2.Log analyse:
>> We notice a strange pattern in this log: connectStorageServer be
>> called twice, first supervdsm call succeed, second fails becasue of
>> validateAccess().
>> That is because for the first call validateAccess returns normally
>> and leave a child there, when the second validateAccess call arrives
>> and multirprocessing manager is receiving the method message, it is
>> just the time first child exit and SIGCHLD comming, this signal
>> interrupted multiprocessing receive system call, python managers.py
>> should handle INTR and retry recv() like we do in vdsm but it's not,
>> so the second one raise error.
>> >Thread-18::DEBUG::2013-01-22
>> 10:41:03,570::misc::85::Storage.Misc.excCmd::(<lambda>)
>> '/usr/bin/sudo -n /bin/mount -t nfs -o
>> soft,nosharecache,timeo=600,retrans=6,nfsvers=3
>> 192.168.0.1:/ovirt/silvermoon
>> /rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon' (cwd None)
>> >Thread-18::DEBUG::2013-01-22
>> 10:41:03,607::misc::85::Storage.Misc.excCmd::(<lambda>)
>> '/usr/bin/sudo -n /bin/mount -t nfs -o
>> soft,nosharecache,timeo=600,retrans=6,nfsvers=3
>> 192.168.0.1:/ovirt/undercity
>> /rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd None)
>> >Thread-18::ERROR::2013-01-22
>> 10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could
>> not connect to storageServer
>> >Traceback (most recent call last):
>> > File "/usr/share/vdsm/storage/hsm.py", line 2211, in
>> connectStorageServer
>> > conObj.connect()
>> > File "/usr/share/vdsm/storage/storageServer.py", line 303, in
>> connect
>> > return self._mountCon.connect()
>> > File "/usr/share/vdsm/storage/storageServer.py", line 209, in
>> connect
>> > fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file)
>> > File "/usr/share/vdsm/storage/fileSD.py", line 55, in
>> validateDirAccess
>> > (os.R_OK | os.X_OK))
>> > File "/usr/share/vdsm/supervdsm.py", line 81, in __call__
>> > return callMethod()
>> > File "/usr/share/vdsm/supervdsm.py", line 72, in <lambda>
>> > **kwargs)
>> > File "<string>", line 2, in validateAccess
>> > File "/usr/lib64/python2.6/multiprocessing/managers.py", line
>> 740, in _callmethod
>> > raise convert_to_error(kind, result)
>> the vdsm side receive RemoteError because of supervdsm server
>> multiprocessing manager raise error KIND='TRACEBACK'
>> >RemoteError:
>> The upper part is the trace back from the client side, the following
>> part is from server side:
>> >---------------------------------------------------------------------------
>>
>> >Traceback (most recent call last):
>> > File "/usr/lib64/python2.6/multiprocessing/managers.py", line
>> 214, in serve_client
>> > request = recv()
>> >IOError: [Errno 4] Interrupted system call
>> >---------------------------------------------------------------------------
>>
>>
>> Corresponding Python source code:managers.py(Server side)
>> def serve_client(self, conn):
>> '''
>> Handle requests from the proxies in a particular
>> process/thread
>> '''
>> util.debug('starting server thread to service %r',
>> threading.current_thread().name)
>> recv = conn.recv
>> send = conn.send
>> id_to_obj = self.id_to_obj
>> while not self.stop:
>> try:
>> methodname = obj = None
>> request = recv()<------------------this line been
>> interrupted by SIGCHLD
>> ident, methodname, args, kwds = request
>> obj, exposed, gettypeid = id_to_obj[ident]
>> if methodname not in exposed:
>> raise AttributeError(
>> 'method %r of %r object is not in
>> exposed=%r' %
>> (methodname, type(obj), exposed)
>> )
>> function = getattr(obj, methodname)
>> try:
>> res = function(*args, **kwds)
>> except Exception, e:
>> msg = ('#ERROR', e)
>> else:
>> typeid = gettypeid and
>> gettypeid.get(methodname, None)
>> if typeid:
>> rident, rexposed = self.create(conn,
>> typeid, res)
>> token = Token(typeid, self.address, rident)
>> msg = ('#PROXY', (rexposed, token))
>> else:
>> msg = ('#RETURN', res)
>> except AttributeError:
>> if methodname is None:
>> msg = ('#TRACEBACK', format_exc())
>> else:
>> try:
>> fallback_func =
>> self.fallback_mapping[methodname]
>> result = fallback_func(
>> self, conn, ident, obj, *args, **kwds
>> )
>> msg = ('#RETURN', result)
>> except Exception:
>> msg = ('#TRACEBACK', format_exc())
>>
>> except EOFError:
>> util.debug('got EOF -- exiting thread serving %r',
>> threading.current_thread().name)
>> sys.exit(0)
>>
>> except Exception:<------does not handle IOError,INTR
>> here should retry recv()
>> msg = ('#TRACEBACK', format_exc())
>>
>>
>> 3. Actions we will take:
>> (1)As a work round we can first remove the zombie reaper from
>> supervdsm server
>> (2)I'll see whether python has a fixed version for this
>> (3)Yaniv is working on changing vdsm/svdsm communication channel to
>> pipe and handle it ourselves, I believe we'll get rid of this with
>> that properly handled.
>>
>>
>> On 01/25/2013 06:00 AM, Dead Horse wrote:
>>> Tried some manual edits to SD states in the dbase. The net result was
>>> I was able to get a node active. However as reconstructing the master
>>> storage domain kicked in it was unable to do so. It was also not able
>>> to recognize the other SD with similar failure modes to the
>>> unrecognized master above. Guessing the newer VDSM version borked
>>> things pretty good. So being as this is a test harness and the SD
>>> data is not worth saving I just smoked the all SD. I ran
>>> engine-cleanup and started from fresh and all is well now.
>>>
>>> - DHC
>>>
>>>
>>> On Thu, Jan 24, 2013 at 11:53 AM, Dead Horse
>>> <deadhorseconsulting(a)gmail.com
>>> <mailto:deadhorseconsulting@gmail.com>> wrote:
>>>
>>> This test harness setup here consists of two servers tied to NFS
>>> storage via IB (NFS mounts are via IPoIB, NFS over RDMA is
>>> disabled) . All storage domains are NFS. The issue does occur
>>> with both servers on when attempting to bring them out of
>>> maintenance mode with the end result being non-operational due to
>>> storage attach fail.
>>>
>>> The current issue is now that with a working older commit the
>>> master storage domain is "stuck" in state "locked"
and I see the
>>> secondary issue wherein VDSM cannot seem to find or contact the
>>> master storage domain even though it is there. I am can mount the
>>> master storage domain manually and and all content appears to be
>>> accounted for accordingly on either host.
>>>
>>> Here is the current contents of the master storage domain
>>> metadata:
>>> CLASS=Data
>>> DESCRIPTION=orgrimmar
>>> IOOPTIMEOUTSEC=1
>>> LEASERETRIES=3
>>> LEASETIMESEC=5
>>> LOCKPOLICY=
>>> LOCKRENEWALINTERVALSEC=5
>>> MASTER_VERSION=417
>>> POOL_DESCRIPTION=Azeroth
>>>
POOL_DOMAINS=0549ee91-4498-4130-8c23-4c173b5c0959:Active,d8b55105-c90a-465d-9803-8130da9a671e:Active,67534cca-1327-462a-b455-a04464084b31:Active,c331a800-839d-4d23-9059-870a7471240a:Active,f8984825-ff8d-43d9-91db-0d0959f8bae9:Active,c434056e-96be-4702-8beb-82a408a5c8cb:Active,f7da73c7-b5fe-48b6-93a0-0c773018c94f:Active,82e3b34a-6f89-4299-8cd8-2cc8f973a3b4:Active,e615c975-6b00-469f-8fb6-ff58ae3fdb2c:Active,5bc86532-55f7-4a91-a52c-fad261f322d5:Active,1130b87a-3b34-45d6-8016-d435825c68ef:Active
>>>
>>> POOL_SPM_ID=1
>>> POOL_SPM_LVER=6
>>> POOL_UUID=f90a0d1c-06ca-11e2-a05b-00151712f280
>>> REMOTE_PATH=192.168.0.1:/ovirt/orgrimmar
>>> ROLE=Master
>>> SDUUID=67534cca-1327-462a-b455-a04464084b31
>>> TYPE=NFS
>>> VERSION=3
>>> _SHA_CKSUM=1442bb078fd8c9468d241ff141e9bf53839f0721
>>>
>>> So now with the older working commit I now get this the
>>> "StoragePoolMasterNotFound: Cannot find master domain" error
>>> (prior details above when I worked backwards to that commit)
>>>
>>> This is odd as the nodes can definitely reach the master storage
>>> domain:
>>>
>>> showmount from one of the el6.3 nodes:
>>> [root@kezan ~]# showmount -e 192.168.0.1
>>> Export list for 192.168.0.1 <
http://192.168.0.1>:
>>> /ovirt/orgrimmar 192.168.0.0/16 <
http://192.168.0.0/16>
>>>
>>> mount/ls from one of the nodes:
>>> [root@kezan ~]# mount 192.168.0.1:/ovirt/orgrimmar /mnt
>>> [root@kezan ~]# ls -al
>>> /mnt/67534cca-1327-462a-b455-a04464084b31/dom_md/
>>> total 1100
>>> drwxr-xr-x 2 vdsm kvm 4096 Jan 24 11:44 .
>>> drwxr-xr-x 5 vdsm kvm 4096 Oct 19 16:16 ..
>>> -rw-rw---- 1 vdsm kvm 1048576 Jan 19 22:09 ids
>>> -rw-rw---- 1 vdsm kvm 0 Sep 25 00:46 inbox
>>> -rw-rw---- 1 vdsm kvm 2097152 Jan 10 13:33 leases
>>> -rw-r--r-- 1 vdsm kvm 903 Jan 10 13:39 metadata
>>> -rw-rw---- 1 vdsm kvm 0 Sep 25 00:46 outbox
>>>
>>>
>>> - DHC
>>>
>>>
>>>
>>> On Thu, Jan 24, 2013 at 7:51 AM, ybronhei <ybronhei(a)redhat.com
>>> <mailto:ybronhei@redhat.com>> wrote:
>>>
>>> On 01/24/2013 12:44 AM, Dead Horse wrote:
>>>
>>> I narrowed down on the commit where the originally
>>> reported issue crept in:
>>> commitfc3a44f71d2ef202cff18d7203b9e4165b546621building
>>> and testing with
>>>
>>> this commit or subsequent commits yields the original
>>> issue.
>>>
>>> Interesting.. it might be related to this commit and we're
>>> trying to reproduce it.
>>>
>>> Did you try to remove that code and run again? does it work
>>> without the additional of zombieReaper?
>>> does the connectivity to the storage work well? when you run
>>> 'ls' on the mounted folder you get see the files without a
>>> long delay ? it might related to too long timeout when
>>> validating access to this mount..
>>> we work on that.. any additional info can help
>>>
>>> Thanks.
>>>
>>>
>>> - DHC
>>>
>>>
>>> On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse
>>> <deadhorseconsulting(a)gmail.com
>>> <mailto:deadhorseconsulting@gmail.com>>wrote:
>>>
>>> Indeed reverting back to an older vdsm clears up the
>>> above issue. However
>>> now I the issue is see is:
>>> Thread-18::ERROR::2013-01-23
>>> 15:50:42,885::task::833::TaskManager.Task::(_setError)
>>> Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Unexpected
>>> error
>>>
>>> Traceback (most recent call last):
>>> File "/usr/share/vdsm/storage/task.py", line
840,
>>> in _run
>>> return fn(*args, **kargs)
>>> File "/usr/share/vdsm/logUtils.py", line 42, in
>>> wrapper
>>> res = f(*args, **kwargs)
>>> File "/usr/share/vdsm/storage/hsm.py", line
923,
>>> in connectStoragePool
>>> masterVersion, options)
>>> File "/usr/share/vdsm/storage/hsm.py", line
970,
>>> in _connectStoragePool
>>> res = pool.connect(hostID, scsiKey, msdUUID,
>>> masterVersion)
>>> File "/usr/share/vdsm/storage/sp.py", line 643,
in
>>> connect
>>> self.__rebuild(msdUUID=msdUUID,
>>> masterVersion=masterVersion)
>>> File "/usr/share/vdsm/storage/sp.py", line
1167,
>>> in __rebuild
>>> self.masterDomain =
>>> self.getMasterDomain(msdUUID=msdUUID,
>>> masterVersion=masterVersion)
>>> File "/usr/share/vdsm/storage/sp.py", line
1506,
>>> in getMasterDomain
>>> raise se.StoragePoolMasterNotFound(self.spUUID,
>>> msdUUID)
>>> StoragePoolMasterNotFound: Cannot find master domain:
>>> 'spUUID=f90a0d1c-06ca-11e2-a05b-00151712f280,
>>> msdUUID=67534cca-1327-462a-b455-a04464084b31'
>>> Thread-18::DEBUG::2013-01-23
>>> 15:50:42,887::task::852::TaskManager.Task::(_run)
>>> Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Task._run:
>>> 08709e68-bcbc-40d8-843a-d69d4df40ac6
>>> ('f90a0d1c-06ca-11e2-a05b-00151712f280', 2,
>>> 'f90a0d1c-06ca-11e2-a05b-00151712f280',
>>> '67534cca-1327-462a-b455-a04464084b31', 433) {}
>>> failed - stopping task
>>>
>>> This is with vdsm built from
>>> commit25a2d8572ad32352227c98a86631300fbd6523c1
>>> - DHC
>>>
>>>
>>> On Wed, Jan 23, 2013 at 10:44 AM, Dead Horse <
>>> deadhorseconsulting(a)gmail.com
>>> <mailto:deadhorseconsulting@gmail.com>> wrote:
>>>
>>> VDSM was built from:
>>> commit 166138e37e75767b32227746bb671b1dab9cdd5e
>>>
>>> Attached is the full vdsm log
>>>
>>> I should also note that from engine perspective
>>> it sees the master
>>> storage domain as locked and the others as
>>> unknown.
>>>
>>>
>>> On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg
>>> <danken(a)redhat.com
>>> <mailto:danken@redhat.com>>wrote:
>>>
>>> On Tue, Jan 22, 2013 at 04:02:24PM -0600,
>>> Dead Horse wrote:
>>>
>>> Any ideas on this one? (from VDSM log):
>>> Thread-25::DEBUG::2013-01-22
>>> 15:35:29,065::BindingXMLRPC::914::vds::(wrapper)
>>> client
>>>
>>> [3.57.111.30]::call
>>>
>>> getCapabilities with () {}
>>> Thread-25::ERROR::2013-01-22
>>> 15:35:29,113::netinfo::159::root::(speed)
>>> cannot read ib0 speed
>>> Traceback (most recent call last):
>>> File
>>> "/usr/lib64/python2.6/site-packages/vdsm/netinfo.py",
>>> line 155,
>>>
>>> in
>>>
>>> speed
>>> s =
>>> int(file('/sys/class/net/%s/speed' %
>>> dev).read())
>>> IOError: [Errno 22] Invalid argument
>>>
>>> Causes VDSM to fail to attach storage
>>>
>>>
>>> I doubt that this is the cause of the
>>> failure, as vdsm has always
>>> reported "0" for ib devices, and still is.
>>>
>>> it happens only when you call to getCapabilities.. so it
>>> doesn't related to the flow, and it can't effect the
storage.
>>> Dan: I guess this is not the issue but why is the IOError?
>>>
>>>
>>> Does a former version works with your Engine?
>>> Could you share more of your vdsm.log? I
>>> suppose the culprit lies in one
>>> one of the storage-related commands, not in
>>> statistics retrieval.
>>>
>>>
>>> Engine side sees:
>>> ERROR
>>> [org.ovirt.engine.core.bll.storage.NFSStorageHelper]
>>> (QuartzScheduler_Worker-96) [553ef26e]
>>> The connection with details
>>> 192.168.0.1:/ovirt/ds failed because of
>>> error code 100 and error
>>>
>>> message
>>>
>>> is: general exception
>>> 2013-01-22 15:35:30,160 INFO
>>> [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
>>> (QuartzScheduler_Worker-96) [1ab78378]
>>> Running command:
>>> SetNonOperationalVdsCommand internal:
>>> true. Entities affected : ID:
>>> 8970b3fe-1faf-11e2-bc1f-00151712f280
>>> Type: VDS
>>> 2013-01-22 15:35:30,200 INFO
>>> [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
>>> (QuartzScheduler_Worker-96) [1ab78378]
>>> START,
>>> SetVdsStatusVDSCommand(HostName = kezan,
>>> HostId =
>>> 8970b3fe-1faf-11e2-bc1f-00151712f280,
>>> status=NonOperational,
>>> nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE),
>>> log id: 4af5c4cd
>>> 2013-01-22 15:35:30,211 INFO
>>> [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
>>> (QuartzScheduler_Worker-96) [1ab78378]
>>> FINISH, SetVdsStatusVDSCommand,
>>>
>>> log
>>>
>>> id: 4af5c4cd
>>> 2013-01-22 15:35:30,242 ERROR
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (QuartzScheduler_Worker-96) [1ab78378]
>>> Try to add duplicate audit log
>>> values with the same name. Type:
>>> VDS_SET_NONOPERATIONAL_DOMAIN. Value:
>>> storagepoolname
>>>
>>> Engine = latest master
>>> VDSM = latest master
>>>
>>>
>>> Since "latest master" is an unstable
>>> reference by definition, I'm sure
>>> that History would thank you if you post the
>>> exact version (git hash?)
>>> of the code.
>>>
>>> node = el6
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users(a)ovirt.org <mailto:Users@ovirt.org>
>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>
>>>
>>>
>>> --
>>> Yaniv Bronhaim.
>>> RedHat, Israel
>>> 09-7692289
>>> 054-7744187
>>>
>>>
>>>
>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users(a)ovirt.org
>>
http://lists.ovirt.org/mailman/listinfo/users
>
>
> _______________________________________________
> vdsm-devel mailing list
> vdsm-devel(a)lists.fedorahosted.org
>
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
_______________________________________________
vdsm-devel mailing list
vdsm-devel(a)lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
--
Sheldon Feng(冯少合)<shaohef(a)linux.vnet.ibm.com>
IBM Linux Technology Center