On 02/04/11 - 10:54:45AM, Jan Provaznik wrote:
On 02/03/2011 06:38 PM, Chris Lalancette wrote:
>On 02/03/11 - 03:53:59PM, jprovazn(a)redhat.com wrote:
>>From: Jan Provaznik<jprovazn(a)redhat.com>
>>
>>While-true loop w/o any sleep eats all system resources.
>
>Hm, I'm not sure that this patch is what is needed. The fact of the matter is
>that in both condor_refreshd and dbomatic, they should be blocking based on
>events. While your patch will stop them from consuming all resources, it may
>mask an underlying problem.
>
>condor_refreshd should block indefinitely waiting from an update from condor
>(which should only happen once every 15 minutes). Similarly, dbomatic should
>block indefinitely waiting on a change to EventLog (which again, shouldn't
>happen that often).
>
>What's the failure scenario that you saw? Is it repeatable?
>
Actually sleep doesn't mask any problem (wrapping in rescue w/o
logging is example of masking), it just reduces "damages" of
while-true loop -> especially filling disk with big logfile.
OK. So the first bug is that we should never have a failure that doesn't
print an error. I know we have periodically had reports of this, but someone
needs to track down exactly why that happens sometimes. We should always at
least print an error.
condor_refreshd - weakness is in calling condormatic_classads_sync
before waiting for data on socket. Examples of failing scenarios:
- cloud_account has nil instance_key
- DB is not initialized (there are no tables in DB) when you are
starting the script
OK, I see what can happen. This is a bug, though, and we should just fix it
instead of putting in sleeps. My first reaction to this would be to move the
first condormatic_classads_sync outside of the while true loop, and then add
another call to it at the very bottom of the loop (after the select returns).
That way, if the initial one fails, we still go into the select, wait 60
seconds, and try again. We'll get the sleep for "free", while still
waiting
on the select socket.
dbomatic - haven't caught error with this, but only because
notifier
apparently wraps code in its own rescue block, but it doesn't matter
- generally you can't rely on that code in 'while block' won't crash
or that it will be blocked by waiting for data so I believe it's
always good idea to add sleep into busy-waiting blocks.
The thing is, I actually prefer not to have the sleep, because it shows you
the problem faster so you can fix it :).
--
Chris Lalancette