On Mon, May 26, 2014 at 03:30:42PM +0800, WANG Chao wrote:
When starting kdump service with dump target being ssh host, after
network-online.target, we connect to ssh host and touch the dump
directory to make sure the host is ready to be dumped to.
Chances are after network-online.target, the particular network resource
we interest in isn't ready for connecting to the specified ssh host.
And at that time, we connect to ssh host and fail.
What we should do is to wait for the specific network resource, not
totally depending on network-online.target. But it's relatively
complicated to implement. A simple and direct solution would be try as
many time as it needs to connect to the configured ssh host. However to
avoid a infinitely loop, we time out and fail. I set this time out value
to be 180 seconds, and general speaking, 180 seconds would be enough for
almost any kind of network to be up and ready.
Signed-off-by: WANG Chao <chaowang(a)redhat.com>
---
kdumpctl | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/kdumpctl b/kdumpctl
index 9cae0c4..0bd6021 100755
--- a/kdumpctl
+++ b/kdumpctl
@@ -381,8 +381,19 @@ function check_ssh_config()
function check_ssh_target()
{
local _ret
- ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
- _ret=$?
+ local _start _delta
+
+ # Timeout out after 180 seconds, hopefully it's enough.
+ _start=$(date +%s)
+ while : ; do
+ ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
+ _ret=$?
+ _delta=$(($(date +%s) - $_start))
+ if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then
+ break
+ fi
+ done
+
Hi Chao,
Few comments.
- I think we should sleep for a while before we retry ssh. Say sleep for 2
seconds.
- I think we need to give brief message about retrying as well as giving
up. Something like.
"ssh to $target failed. Will retry after 2 seconds"
"ssh to $target failed after multiple tries."
- We need to define timeout of 180 seconds in kdump-lib.sh and use that
everywhere.
- We have ssh operations in dracut-kdump.sh. So this logic of retry
should apply everywhere and not just kdumpctl. Isn't it. Same issue
will arise in second kernel context if network is not up?
Thanks
Vivek
if [ $_ret -ne 0 ]; then
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run
\"kdumpctl propagate\"" >&2
return 1
--
1.9.3
_______________________________________________
kexec mailing list
kexec(a)lists.fedoraproject.org
https://lists.fedoraproject.org/mailman/listinfo/kexec