On 01/13/2014 11:23 AM, WANG Chao wrote:
This is a patchset to add fence kdump support.
In cluster environment, fence kdump is used to notify all the other
nodes that current is crashed and stop from being fenced off.
The patchset has the following features:
1. rebuild kdump initrd regarding timestamp of fence kdump config or cluster
configuration.
2. setup a required working environment for fence kdump in 2nd kernel.
3. fence_kdump_send notify other nodes to stop the crashed one being fenced off
before dumping process.
4. add kdump-in-cluster-environment.txt
Hi,
I have tested this patch on my cluster environment (2-node virtual
cluster) and it works correctly. Great work guys.
There are some steps which are not intuitive enough, so I'm including my
test scenario:
1) standard installation and configuration of kexec-tools
2) standard installation of corosync/pacemaker cluster
3) setup fence_kdump [integration is not seamless but problem is not in
kexec-tools]
pcs stonith update myfence pcmk_monitor_action=metadata --force
pcs stonith update myfence pcmk_status_action=metadata --force
pcs stonith update myfence pcmk_reboot_action=off --force
if you have a lot of memory, you should set fence_kdump to wait longer
(default 60 seconds)
pcs stonith update myfence pcmk_reboot_timeout=600 --force
this is an example output of my fence agent (pcs stonith show myfence):
Resource: myfence (class=stonith type=fence_kdump)
Attributes: pcmk_host_list="r7a r7b" pcmk_host_check=static-list
pcmk_monitor_action=metadata pcmk_status_action=metadata
pcmk_reboot_action=off
Operations: monitor interval=60s (myfence-monitor-interval-60s)
4) kdumpctl restart (on both nodes)
5) on node B run: echo c > /proc/sysrq-trigger
6) on node A check /var/log/messages, you should find there:
Jan 13 12:24:39 nodeA fence_kdump[10862]: waiting for message from
'192.168.122.52'
Jan 13 12:24:41 nodeA fence_kdump[10862]: received valid message from
'192.168.122.52'
This means that nodeB currently executed fence_kdump_send
m,