On 01/23/2014 04:04 PM, Vivek Goyal wrote:
On Thu, Jan 23, 2014 at 10:53:51AM +0100, Marek Grac wrote:
[..]
>> I think this is a problem. How would we know in advance how
>> much it will take for dump to finish. And it will vary depending
>> on so many things. (size of memory, speed of network etc).
> You don't need to know this in advance. This is set on cluster-side
> and administrator should be able to set this timeout to proper
> value.
How would cluster admin know how long will it take to save dump and
what's the right value for this parameter?
Documentation but mainly it is
matter of experience and testing. It was
same in previous versions.
>> By default, why this value can't be very high? Or this value can act
>> more like a watchdog. As long as you keep on getting tick, you keep
>> resetting internal counter. If you don't get a tick (message from
>> node which is saving vmcore) for 60 seconds, then you assume
>> that something went wrong with the node and power cycle it.
>>
>> Trying to keep an upper limit of 60 seconds and assuming dump will
>> finish in this time, will not help.
> This is a general fence agent settings in cluster and fence_kdump is
> only one that uses 'ticking' mechanism, all other should finished in
> a much more fixed time. Setting this value for kdump agent is fine
> as fence_kdump itself contains a different timeout mechanism which
> are based on 'ticks'. I agree that it should be explained in
> documentation/kbase but it is not something what can be changed on
> fence agent level.
So are you saying that 60 seconds above is not total time taken to
dump. Instead it is the duration in which atleast one message from
fence_kdump should be received and timer will reset. And it should
receive another message with-in 60 seonds and it keeps going like
this.
IOW, as long as fence_kdump keeps on sending message to manager/nodes,
every 60 seconds, theoritically dump could take inifinitely long?
Nope. Default is 60 seconds for fence agent then cluster decides that it
fails - this is tunable.
If you set this value to a really high number (like 1 day) then it will
work with fence_kdump because if there is no 'tick' it will fail and
timeout will not be applied. In general we can say that admin can set it
to such high number and do not risk. But if there is a problem in a
fence_kdump (we believe that this is not true), it is possible that node
will continue and potentially it can destroy data. I wanted to add a
link to a fence_kdump technical paper but unfortunately it is not online
anymore (I will contact author)
m,