On 02/04/2011 11:13 PM, Chris Lalancette wrote:
On 02/04/11 - 10:54:45AM, Jan Provaznik wrote:
> On 02/03/2011 06:38 PM, Chris Lalancette wrote:
>> On 02/03/11 - 03:53:59PM, jprovazn(a)redhat.com wrote:
>>> From: Jan Provaznik<jprovazn(a)redhat.com>
>>>
>>> While-true loop w/o any sleep eats all system resources.
>>
>> Hm, I'm not sure that this patch is what is needed. The fact of the matter
is
>> that in both condor_refreshd and dbomatic, they should be blocking based on
>> events. While your patch will stop them from consuming all resources, it may
>> mask an underlying problem.
>>
>> condor_refreshd should block indefinitely waiting from an update from condor
>> (which should only happen once every 15 minutes). Similarly, dbomatic should
>> block indefinitely waiting on a change to EventLog (which again, shouldn't
>> happen that often).
>>
>> What's the failure scenario that you saw? Is it repeatable?
>>
>
> Actually sleep doesn't mask any problem (wrapping in rescue w/o
> logging is example of masking), it just reduces "damages" of
> while-true loop -> especially filling disk with big logfile.
OK. So the first bug is that we should never have a failure that doesn't
print an error. I know we have periodically had reports of this, but someone
needs to track down exactly why that happens sometimes. We should always at
least print an error.
Of course, each error should be logged, but that's not the point.
Filling disk space was only one example. More info bellow.
>
> condor_refreshd - weakness is in calling condormatic_classads_sync
> before waiting for data on socket. Examples of failing scenarios:
> - cloud_account has nil instance_key
> - DB is not initialized (there are no tables in DB) when you are
> starting the script
OK, I see what can happen. This is a bug, though, and we should just fix it
instead of putting in sleeps. My first reaction to this would be to move the
first condormatic_classads_sync outside of the while true loop, and then add
another call to it at the very bottom of the loop (after the select returns).
That way, if the initial one fails, we still go into the select, wait 60
seconds, and try again. We'll get the sleep for "free", while still
waiting
on the select socket.
Sure, fixing these bugs is good but that's not the point too. Moving
condormatic_classads_sync to the end is good idea, but still not
sufficient - if select raises exception (for example pipe is broken) you
are still in same problem.
> dbomatic - haven't caught error with this, but only because
notifier
> apparently wraps code in its own rescue block, but it doesn't matter
> - generally you can't rely on that code in 'while block' won't crash
> or that it will be blocked by waiting for data so I believe it's
> always good idea to add sleep into busy-waiting blocks.
The thing is, I actually prefer not to have the sleep, because it shows you
the problem faster so you can fix it :).
Don't understand this, what do you mean by "it shows you the problem
faster"? I still don't understand your averse to using sleep in rescue
block.
In general you should always use sleep in while-true where whole loop is
wrapped in exception block.
while true
begin
# some code, no matter what, everytime something can be broken
rescue
sleep # this prevents script to be wasteful if exception is raised
end
end
More info to our code. Relying on blocking feature of 'notifier.run' and
'select' methods is bad idea - it can throw exception too.
Dbomatic:
while true
begin
notifier.run
rescue => e
logger.error "#{e.backtrace.shift}: #{e.message}"
e.backtrace.each do |step|
logger.error "\tfrom #{step}"
end
sleep 5
end
end
As you noticed in some previous mail - notifier.run should run all the
time. But what if notifier.run raises exception (no matter why)? Then
w/o using sleep whole server will have problems (disk space because of
logging, CPU load,...). With putting sleep into rescue block (code
above) you prevent server from being overloaded because of while-true loop.
condor_refreshd (with change you suggested and sleep):
while true
begin
results = select([socket], nil, nil, timeout * 60)
# if results was nil, we timed out waiting, and we need to do a sync
# but not read from the socket
if not results.nil? and results[0][0] == socket
socket.recvfrom(1024)
logger.info "Doing classad sync after getting data from the socket"
else
logger.info "Doing classad sync after timeout"
end
condormatic_classads_sync
rescue => e
logger.error "#{e.backtrace.shift}: #{e.message}"
e.backtrace.each do |step|
logger.error "\tfrom #{step}"
end
sleep 5
end
end
In this case, what if select raises exception? W/o calling sleep same
problems again. With sleep, you are OK.
Jan