Currently, systemd uses 90s as the default mount unit timeout, in some cases, it's not enough and results in mount timeout further results in kdump dumping failure, but the device can actually be ready after a while. We've met several such issues.
So, we add a "x-systemd.device-timeout=600"(600s should be long enough) as the default timeout to the mount options if there is no "x-systemd.device-timeout=X" specified. It can be overridden by /etc/fstab mount options, so that users can specify other timeout values if they want to.
Note: this is different from rd.timeout which was introduced by dracut initqueue.
Signed-off-by: Xunlei Pang xlpang@redhat.com --- mkdumprd | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mkdumprd b/mkdumprd index d3ecbd6..30f8ba6 100644 --- a/mkdumprd +++ b/mkdumprd @@ -104,6 +104,11 @@ to_mount() { _options=$(echo $_options | sed 's/noauto//') _options=${_options/#ro/rw} #mount fs target as rw in 2nd kernel
+ # use 600s as default systemd mount timeout if none + if ! strstr $_options "x-systemd.device-timeout"; then + _options="$_options,x-systemd.device-timeout=600" + fi + _mntopts="$_target $_fstype $_options" #for non-nfs _dev converting to use udev persistent name if [ -b "$_source" ]; then
On 07/26/17 at 04:12pm, Xunlei Pang wrote:
Currently, systemd uses 90s as the default mount unit timeout, in some cases, it's not enough and results in mount timeout further results in kdump dumping failure, but the device can actually be ready after a while. We've met several such issues.
So, we add a "x-systemd.device-timeout=600"(600s should be long enough) as the default timeout to the mount options if there is no "x-systemd.device-timeout=X" specified. It can be overridden by /etc/fstab mount options, so that users can specify other timeout values if they want to.
Note: this is different from rd.timeout which was introduced by dracut initqueue.
So what if add rd.timeout=xxx meantime?
Signed-off-by: Xunlei Pang xlpang@redhat.com
mkdumprd | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mkdumprd b/mkdumprd index d3ecbd6..30f8ba6 100644 --- a/mkdumprd +++ b/mkdumprd @@ -104,6 +104,11 @@ to_mount() { _options=$(echo $_options | sed 's/noauto//') _options=${_options/#ro/rw} #mount fs target as rw in 2nd kernel
- # use 600s as default systemd mount timeout if none
- if ! strstr $_options "x-systemd.device-timeout"; then
_options="$_options,x-systemd.device-timeout=600"
- fi
- _mntopts="$_target $_fstype $_options" #for non-nfs _dev converting to use udev persistent name if [ -b "$_source" ]; then
-- 1.8.3.1 _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
On 07/26/2017 at 04:51 PM, Baoquan He wrote:
On 07/26/17 at 04:12pm, Xunlei Pang wrote:
Currently, systemd uses 90s as the default mount unit timeout, in some cases, it's not enough and results in mount timeout further results in kdump dumping failure, but the device can actually be ready after a while. We've met several such issues.
So, we add a "x-systemd.device-timeout=600"(600s should be long enough) as the default timeout to the mount options if there is no "x-systemd.device-timeout=X" specified. It can be overridden by /etc/fstab mount options, so that users can specify other timeout values if they want to.
Note: this is different from rd.timeout which was introduced by dracut initqueue.
So what if add rd.timeout=xxx meantime?
If the device is not added into initqueue, it doesn't matter. For example, multipath device, there is no wait_for_dev() call for it.
systemd mount unit is a different task from dracut initqueue, I think they are running parallelly.
Regards, Xunlei
Signed-off-by: Xunlei Pang xlpang@redhat.com
mkdumprd | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mkdumprd b/mkdumprd index d3ecbd6..30f8ba6 100644 --- a/mkdumprd +++ b/mkdumprd @@ -104,6 +104,11 @@ to_mount() { _options=$(echo $_options | sed 's/noauto//') _options=${_options/#ro/rw} #mount fs target as rw in 2nd kernel
- # use 600s as default systemd mount timeout if none
- if ! strstr $_options "x-systemd.device-timeout"; then
_options="$_options,x-systemd.device-timeout=600"
- fi
- _mntopts="$_target $_fstype $_options" #for non-nfs _dev converting to use udev persistent name if [ -b "$_source" ]; then
-- 1.8.3.1 _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
On 07/26/2017 at 05:38 PM, Xunlei Pang wrote:
On 07/26/2017 at 04:51 PM, Baoquan He wrote:
On 07/26/17 at 04:12pm, Xunlei Pang wrote:
Currently, systemd uses 90s as the default mount unit timeout, in some cases, it's not enough and results in mount timeout further results in kdump dumping failure, but the device can actually be ready after a while. We've met several such issues.
So, we add a "x-systemd.device-timeout=600"(600s should be long enough) as the default timeout to the mount options if there is no "x-systemd.device-timeout=X" specified. It can be overridden by /etc/fstab mount options, so that users can specify other timeout values if they want to.
Note: this is different from rd.timeout which was introduced by dracut initqueue.
So what if add rd.timeout=xxx meantime?
If the device is not added into initqueue, it doesn't matter. For example, multipath device, there is no wait_for_dev() call for it.
As for devices like lvm volumes, the default dracut rd.timeout will happen before systemd mount after this patch, I can't see any problem here currently.
Maybe we can also define kdump's default value of rd.retry and rd.timeout, anyway I think that should be a different issue from this patch tries to solve.
Regards, Xunlei
systemd mount unit is a different task from dracut initqueue, I think they are running parallelly.
Regards, Xunlei
Signed-off-by: Xunlei Pang xlpang@redhat.com
mkdumprd | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mkdumprd b/mkdumprd index d3ecbd6..30f8ba6 100644 --- a/mkdumprd +++ b/mkdumprd @@ -104,6 +104,11 @@ to_mount() { _options=$(echo $_options | sed 's/noauto//') _options=${_options/#ro/rw} #mount fs target as rw in 2nd kernel
- # use 600s as default systemd mount timeout if none
- if ! strstr $_options "x-systemd.device-timeout"; then
_options="$_options,x-systemd.device-timeout=600"
- fi
- _mntopts="$_target $_fstype $_options" #for non-nfs _dev converting to use udev persistent name if [ -b "$_source" ]; then
-- 1.8.3.1 _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
On 07/26/17 at 05:54pm, Xunlei Pang wrote:
On 07/26/2017 at 05:38 PM, Xunlei Pang wrote:
On 07/26/2017 at 04:51 PM, Baoquan He wrote:
On 07/26/17 at 04:12pm, Xunlei Pang wrote:
Currently, systemd uses 90s as the default mount unit timeout, in some cases, it's not enough and results in mount timeout further results in kdump dumping failure, but the device can actually be ready after a while. We've met several such issues.
So, we add a "x-systemd.device-timeout=600"(600s should be long enough) as the default timeout to the mount options if there is no "x-systemd.device-timeout=X" specified. It can be overridden by /etc/fstab mount options, so that users can specify other timeout values if they want to.
Note: this is different from rd.timeout which was introduced by dracut initqueue.
So what if add rd.timeout=xxx meantime?
If the device is not added into initqueue, it doesn't matter. For example, multipath device, there is no wait_for_dev() call for it.
As for devices like lvm volumes, the default dracut rd.timeout will happen before systemd mount after this patch, I can't see any problem here currently.
Maybe we can also define kdump's default value of rd.retry and rd.timeout, anyway I think that should be a different issue from this patch tries to solve.
Kdump is thin version of normal kernel, should need less time than normal kernel. So if normal kernel works, then no reason kdump need more time. if more time needed, normal kernel need too.
Just pass by, Dave must has better comment.
systemd mount unit is a different task from dracut initqueue, I think they are running parallelly.
Regards, Xunlei
Signed-off-by: Xunlei Pang xlpang@redhat.com
mkdumprd | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mkdumprd b/mkdumprd index d3ecbd6..30f8ba6 100644 --- a/mkdumprd +++ b/mkdumprd @@ -104,6 +104,11 @@ to_mount() { _options=$(echo $_options | sed 's/noauto//') _options=${_options/#ro/rw} #mount fs target as rw in 2nd kernel
- # use 600s as default systemd mount timeout if none
- if ! strstr $_options "x-systemd.device-timeout"; then
_options="$_options,x-systemd.device-timeout=600"
- fi
- _mntopts="$_target $_fstype $_options" #for non-nfs _dev converting to use udev persistent name if [ -b "$_source" ]; then
-- 1.8.3.1 _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
On 07/26/2017 at 06:33 PM, Baoquan He wrote:
On 07/26/17 at 05:54pm, Xunlei Pang wrote:
On 07/26/2017 at 05:38 PM, Xunlei Pang wrote:
On 07/26/2017 at 04:51 PM, Baoquan He wrote:
On 07/26/17 at 04:12pm, Xunlei Pang wrote:
Currently, systemd uses 90s as the default mount unit timeout, in some cases, it's not enough and results in mount timeout further results in kdump dumping failure, but the device can actually be ready after a while. We've met several such issues.
So, we add a "x-systemd.device-timeout=600"(600s should be long enough) as the default timeout to the mount options if there is no "x-systemd.device-timeout=X" specified. It can be overridden by /etc/fstab mount options, so that users can specify other timeout values if they want to.
Note: this is different from rd.timeout which was introduced by dracut initqueue.
So what if add rd.timeout=xxx meantime?
If the device is not added into initqueue, it doesn't matter. For example, multipath device, there is no wait_for_dev() call for it.
As for devices like lvm volumes, the default dracut rd.timeout will happen before systemd mount after this patch, I can't see any problem here currently.
Maybe we can also define kdump's default value of rd.retry and rd.timeout, anyway I think that should be a different issue from this patch tries to solve.
Kdump is thin version of normal kernel, should need less time than normal kernel. So if normal kernel works, then no reason kdump need more time. if more time needed, normal kernel need too.
What you said is a good point, however it is not always the case according to our tests and issues we've met that timeout happened only under kdump.
kdump and normal kdump are a bit different, for example the first(root) mount may be different after the "remove root=X" feature on Fedora. Different boot cmdlines(like nr_cpus=1) might affect the boot process, there will less components under kdump than normal kernel, different hardware status due to hot reboot, etc.
So my point is that kdump mount happens at a different time/stage as that under normal kernel, one notably fact is that normal kernel mount happens after switch-root to the real root fs, while kdump kernel happens at the initramfs stage.
Regards, Xunlei
Just pass by, Dave must has better comment.
systemd mount unit is a different task from dracut initqueue, I think they are running parallelly.
Regards, Xunlei
Signed-off-by: Xunlei Pang xlpang@redhat.com
mkdumprd | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mkdumprd b/mkdumprd index d3ecbd6..30f8ba6 100644 --- a/mkdumprd +++ b/mkdumprd @@ -104,6 +104,11 @@ to_mount() { _options=$(echo $_options | sed 's/noauto//') _options=${_options/#ro/rw} #mount fs target as rw in 2nd kernel
- # use 600s as default systemd mount timeout if none
- if ! strstr $_options "x-systemd.device-timeout"; then
_options="$_options,x-systemd.device-timeout=600"
- fi
- _mntopts="$_target $_fstype $_options" #for non-nfs _dev converting to use udev persistent name if [ -b "$_source" ]; then
-- 1.8.3.1 _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
On 07/26/2017 at 07:29 PM, Xunlei Pang wrote:
On 07/26/2017 at 06:33 PM, Baoquan He wrote:
On 07/26/17 at 05:54pm, Xunlei Pang wrote:
On 07/26/2017 at 05:38 PM, Xunlei Pang wrote:
On 07/26/2017 at 04:51 PM, Baoquan He wrote:
On 07/26/17 at 04:12pm, Xunlei Pang wrote:
Currently, systemd uses 90s as the default mount unit timeout, in some cases, it's not enough and results in mount timeout further results in kdump dumping failure, but the device can actually be ready after a while. We've met several such issues.
So, we add a "x-systemd.device-timeout=600"(600s should be long enough) as the default timeout to the mount options if there is no "x-systemd.device-timeout=X" specified. It can be overridden by /etc/fstab mount options, so that users can specify other timeout values if they want to.
Note: this is different from rd.timeout which was introduced by dracut initqueue.
So what if add rd.timeout=xxx meantime?
If the device is not added into initqueue, it doesn't matter. For example, multipath device, there is no wait_for_dev() call for it.
As for devices like lvm volumes, the default dracut rd.timeout will happen before systemd mount after this patch, I can't see any problem here currently.
Maybe we can also define kdump's default value of rd.retry and rd.timeout, anyway I think that should be a different issue from this patch tries to solve.
Kdump is thin version of normal kernel, should need less time than normal kernel. So if normal kernel works, then no reason kdump need more time. if more time needed, normal kernel need too.
What you said is a good point, however it is not always the case according to our tests and issues we've met that timeout happened only under kdump.
kdump and normal kdump are a bit different, for example the first(root) mount may be different after the "remove root=X" feature on Fedora. Different boot cmdlines(like nr_cpus=1) might affect the boot process, there will less components under kdump than normal kernel, different hardware status due to hot reboot, etc.
So my point is that kdump mount happens at a different time/stage as that under normal kernel, one notably fact is that normal kernel mount happens after switch-root to the real root fs, while kdump kernel happens at the initramfs stage.
Another guess is that, say hardware iscsi, for normal reboot, the hardware can get more time during boot via BIOS, and for kexec boot it was not given enough time to get ready. Anyway it would be good to have some hardware guy give us a reasonable explanation.
Regards, Xunlei
Just pass by, Dave must has better comment.
systemd mount unit is a different task from dracut initqueue, I think they are running parallelly.
Regards, Xunlei
Signed-off-by: Xunlei Pang xlpang@redhat.com
mkdumprd | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mkdumprd b/mkdumprd index d3ecbd6..30f8ba6 100644 --- a/mkdumprd +++ b/mkdumprd @@ -104,6 +104,11 @@ to_mount() { _options=$(echo $_options | sed 's/noauto//') _options=${_options/#ro/rw} #mount fs target as rw in 2nd kernel
- # use 600s as default systemd mount timeout if none
- if ! strstr $_options "x-systemd.device-timeout"; then
_options="$_options,x-systemd.device-timeout=600"
- fi
- _mntopts="$_target $_fstype $_options" #for non-nfs _dev converting to use udev persistent name if [ -b "$_source" ]; then
-- 1.8.3.1 _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
On 07/26/17 at 07:38pm, Xunlei Pang wrote:
Maybe we can also define kdump's default value of rd.retry and rd.timeout, anyway I think that should be a different issue from this patch tries to solve.
Kdump is thin version of normal kernel, should need less time than normal kernel. So if normal kernel works, then no reason kdump need more time. if more time needed, normal kernel need too.
What you said is a good point, however it is not always the case according to our tests and issues we've met that timeout happened only under kdump.
kdump and normal kdump are a bit different, for example the first(root) mount may be different after the "remove root=X" feature on Fedora. Different boot cmdlines(like nr_cpus=1) might affect the boot process,
Yes, kdump mostly takes nr_cpus=1 and it matters in the parallel initialization of boot system service. While kdump takes off all unnecessary services and units, it also matters since most of the services before mount and device waiting are very critical but quick one. Paralleling may not win too much.
The most important things is 500 seconds are almost 10 minutes, it's so long that I believe almost all people will think boot hang and they will reboot without waiting 10 minutes to see if there's any further hints message printed. With this change, a log time will last for us to get reports that kdump hang during boot without clear reason or message. We have to ask for the system information and try to reproduce and find out that 'Ah, ok, it doesn't hang, it's mount timeout failure' issues finally.
there will less components under kdump than normal kernel, different hardware status due to hot reboot, etc.
So my point is that kdump mount happens at a different time/stage as that under normal kernel, one notably fact is that normal kernel mount happens after switch-root to the real root fs, while kdump kernel happens at the initramfs stage.
Another guess is that, say hardware iscsi, for normal reboot, the hardware can get more time during boot via BIOS, and for kexec boot it was not given enough time to get ready. Anyway it would be good to have some hardware guy give us a reasonable explanation.
Yes, if possible, a worst time consumption might be needed. Like 273 centigrade degree is the absolute zero degree just because it's the known lowest centigrade degree in nature that substance concretes. We can use that one as the biggest timeout value, if one day it's exceeded we can increase the timeout value, or ask for reason why it becomes much longer to the mounted device developer.
On 07/27/17 at 10:11am, Baoquan He wrote:
On 07/26/17 at 07:38pm, Xunlei Pang wrote:
Maybe we can also define kdump's default value of rd.retry and rd.timeout, anyway I think that should be a different issue from this patch tries to solve.
Kdump is thin version of normal kernel, should need less time than normal kernel. So if normal kernel works, then no reason kdump need more time. if more time needed, normal kernel need too.
What you said is a good point, however it is not always the case according to our tests and issues we've met that timeout happened only under kdump.
kdump and normal kdump are a bit different, for example the first(root) mount may be different after the "remove root=X" feature on Fedora. Different boot cmdlines(like nr_cpus=1) might affect the boot process,
Yes, kdump mostly takes nr_cpus=1 and it matters in the parallel initialization of boot system service. While kdump takes off all unnecessary services and units, it also matters since most of the services before mount and device waiting are very critical but quick one. Paralleling may not win too much.
The most important things is 500 seconds are almost 10 minutes, it's so long that I believe almost all people will think boot hang and they will reboot without waiting 10 minutes to see if there's any further hints message printed. With this change, a log time will last for us to get reports that kdump hang during boot without clear reason or message. We have to ask for the system information and try to reproduce and find out that 'Ah, ok, it doesn't hang, it's mount timeout failure' issues finally.
there will less components under kdump than normal kernel, different hardware status due to hot reboot, etc.
So my point is that kdump mount happens at a different time/stage as that under normal kernel, one notably fact is that normal kernel mount happens after switch-root to the real root fs, while kdump kernel happens at the initramfs stage.
Another guess is that, say hardware iscsi, for normal reboot, the hardware can get more time during boot via BIOS, and for kexec boot it was not given enough time to get ready. Anyway it would be good to have some hardware guy give us a reasonable explanation.
Yes, if possible, a worst time consumption might be needed. Like 273
minus
centigrade degree is the absolute zero degree just because it's the known lowest centigrade degree in nature that substance concretes. We can use that one as the biggest timeout value, if one day it's exceeded we can increase the timeout value, or ask for reason why it becomes much longer to the mounted device developer.
On 07/27/2017 at 10:11 AM, Baoquan He wrote:
On 07/26/17 at 07:38pm, Xunlei Pang wrote:
Maybe we can also define kdump's default value of rd.retry and rd.timeout, anyway I think that should be a different issue from this patch tries to solve.
Kdump is thin version of normal kernel, should need less time than normal kernel. So if normal kernel works, then no reason kdump need more time. if more time needed, normal kernel need too.
What you said is a good point, however it is not always the case according to our tests and issues we've met that timeout happened only under kdump.
kdump and normal kdump are a bit different, for example the first(root) mount may be different after the "remove root=X" feature on Fedora. Different boot cmdlines(like nr_cpus=1) might affect the boot process,
Yes, kdump mostly takes nr_cpus=1 and it matters in the parallel initialization of boot system service. While kdump takes off all unnecessary services and units, it also matters since most of the services before mount and device waiting are very critical but quick one. Paralleling may not win too much.
The most important things is 500 seconds are almost 10 minutes, it's so long that I believe almost all people will think boot hang and they will reboot without waiting 10 minutes to see if there's any further hints message printed. With this change, a log time will last for us to get reports that kdump hang during boot without clear reason or message. We have to ask for the system information and try to reproduce and find out that 'Ah, ok, it doesn't hang, it's mount timeout failure' issues finally.
there will less components under kdump than normal kernel, different hardware status due to hot reboot, etc.
So my point is that kdump mount happens at a different time/stage as that under normal kernel, one notably fact is that normal kernel mount happens after switch-root to the real root fs, while kdump kernel happens at the initramfs stage.
Another guess is that, say hardware iscsi, for normal reboot, the hardware can get more time during boot via BIOS, and for kexec boot it was not given enough time to get ready. Anyway it would be good to have some hardware guy give us a reasonable explanation.
Yes, if possible, a worst time consumption might be needed. Like 273 centigrade degree is the absolute zero degree just because it's the known lowest centigrade degree in nature that substance concretes. We can use that one as the biggest timeout value, if one day it's exceeded we can increase the timeout value, or ask for reason why it becomes much longer to the mounted device developer.
600s is based on the issue found on the large and slow DragonHawk machine, and it is the slowest machine I've ever met, 600s can work for it.
We've suggested a workaround for them by manually appending the timeout option in /etc/fstab and published a KBase to explain that.
Regards, Xunlei
On 07/27/17 at 10:37am, Xunlei Pang wrote:
On 07/27/2017 at 10:11 AM, Baoquan He wrote:
On 07/26/17 at 07:38pm, Xunlei Pang wrote:
Maybe we can also define kdump's default value of rd.retry and rd.timeout, anyway I think that should be a different issue from this patch tries to solve.
Kdump is thin version of normal kernel, should need less time than normal kernel. So if normal kernel works, then no reason kdump need more time. if more time needed, normal kernel need too.
What you said is a good point, however it is not always the case according to our tests and issues we've met that timeout happened only under kdump.
kdump and normal kdump are a bit different, for example the first(root) mount may be different after the "remove root=X" feature on Fedora. Different boot cmdlines(like nr_cpus=1) might affect the boot process,
Yes, kdump mostly takes nr_cpus=1 and it matters in the parallel initialization of boot system service. While kdump takes off all unnecessary services and units, it also matters since most of the services before mount and device waiting are very critical but quick one. Paralleling may not win too much.
The most important things is 500 seconds are almost 10 minutes, it's so long that I believe almost all people will think boot hang and they will reboot without waiting 10 minutes to see if there's any further hints message printed. With this change, a log time will last for us to get reports that kdump hang during boot without clear reason or message. We have to ask for the system information and try to reproduce and find out that 'Ah, ok, it doesn't hang, it's mount timeout failure' issues finally.
there will less components under kdump than normal kernel, different hardware status due to hot reboot, etc.
So my point is that kdump mount happens at a different time/stage as that under normal kernel, one notably fact is that normal kernel mount happens after switch-root to the real root fs, while kdump kernel happens at the initramfs stage.
Another guess is that, say hardware iscsi, for normal reboot, the hardware can get more time during boot via BIOS, and for kexec boot it was not given enough time to get ready. Anyway it would be good to have some hardware guy give us a reasonable explanation.
Yes, if possible, a worst time consumption might be needed. Like 273 centigrade degree is the absolute zero degree just because it's the known lowest centigrade degree in nature that substance concretes. We can use that one as the biggest timeout value, if one day it's exceeded we can increase the timeout value, or ask for reason why it becomes much longer to the mounted device developer.
600s is based on the issue found on the large and slow DragonHawk machine, and it is the slowest machine I've ever met, 600s can work for it.
We've suggested a workaround for them by manually appending the timeout option in /etc/fstab and published a KBase to explain that.
I suggested to use 600s but they actually tested 700s and use 700 in the Kbase so probably we can use 700 as well here.
For Bao's concern, it should be better than kdump failure without the timeout specified. It is impossible to get a minimum value. For the server machine crashes in background it should be not a big problem..
For desktop user if serial console or graphic driver works then we will see the systemd timeout progress messages? If so it should be also fine.
Thanks Dave
Regards, Xunlei _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
On 07/27/17 at 01:13pm, Dave Young wrote:
On 07/27/17 at 10:37am, Xunlei Pang wrote:
On 07/27/2017 at 10:11 AM, Baoquan He wrote:
On 07/26/17 at 07:38pm, Xunlei Pang wrote:
> Maybe we can also define kdump's default value of rd.retry and rd.timeout, > anyway I think that should be a different issue from this patch tries to solve. Kdump is thin version of normal kernel, should need less time than normal kernel. So if normal kernel works, then no reason kdump need more time. if more time needed, normal kernel need too.
What you said is a good point, however it is not always the case according to our tests and issues we've met that timeout happened only under kdump.
kdump and normal kdump are a bit different, for example the first(root) mount may be different after the "remove root=X" feature on Fedora. Different boot cmdlines(like nr_cpus=1) might affect the boot process,
Yes, kdump mostly takes nr_cpus=1 and it matters in the parallel initialization of boot system service. While kdump takes off all unnecessary services and units, it also matters since most of the services before mount and device waiting are very critical but quick one. Paralleling may not win too much.
The most important things is 500 seconds are almost 10 minutes, it's so long that I believe almost all people will think boot hang and they will reboot without waiting 10 minutes to see if there's any further hints message printed. With this change, a log time will last for us to get reports that kdump hang during boot without clear reason or message. We have to ask for the system information and try to reproduce and find out that 'Ah, ok, it doesn't hang, it's mount timeout failure' issues finally.
there will less components under kdump than normal kernel, different hardware status due to hot reboot, etc.
So my point is that kdump mount happens at a different time/stage as that under normal kernel, one notably fact is that normal kernel mount happens after switch-root to the real root fs, while kdump kernel happens at the initramfs stage.
Another guess is that, say hardware iscsi, for normal reboot, the hardware can get more time during boot via BIOS, and for kexec boot it was not given enough time to get ready. Anyway it would be good to have some hardware guy give us a reasonable explanation.
Yes, if possible, a worst time consumption might be needed. Like 273 centigrade degree is the absolute zero degree just because it's the known lowest centigrade degree in nature that substance concretes. We can use that one as the biggest timeout value, if one day it's exceeded we can increase the timeout value, or ask for reason why it becomes much longer to the mounted device developer.
600s is based on the issue found on the large and slow DragonHawk machine, and it is the slowest machine I've ever met, 600s can work for it.
We've suggested a workaround for them by manually appending the timeout option in /etc/fstab and published a KBase to explain that.
I suggested to use 600s but they actually tested 700s and use 700 in the Kbase so probably we can use 700 as well here.
For Bao's concern, it should be better than kdump failure without the timeout specified. It is impossible to get a minimum value. For the server machine crashes in background it should be not a big problem..
Well, this might be not good. I personally would suggest that please don't do it. We can't increase this default timeout value 7 times as big as before just because a special system. Default value is for most of systems, this will bring agony to us, QA and all other testers. We have to wait each time 10 minutes to find out if it's really a hang, or just a mount failure whenever we do a simple mount case test.
Default value is for general case, not for special case. Simple example is that real estate developer won't design and build all room in China with door of 3 meters high and with beds of 3 meters long just because Yao Ming is 2.26 meters tall and living in China.
Just my personal opinion, won't object this strongly.
For desktop user if serial console or graphic driver works then we will see the systemd timeout progress messages? If so it should be also fine.
Hmm, it's an idea.
Regards, Xunlei _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
On 07/27/17 at 01:52pm, Baoquan He wrote:
On 07/27/17 at 01:13pm, Dave Young wrote:
On 07/27/17 at 10:37am, Xunlei Pang wrote:
On 07/27/2017 at 10:11 AM, Baoquan He wrote:
On 07/26/17 at 07:38pm, Xunlei Pang wrote:
>> Maybe we can also define kdump's default value of rd.retry and rd.timeout, >> anyway I think that should be a different issue from this patch tries to solve. > Kdump is thin version of normal kernel, should need less time than > normal kernel. So if normal kernel works, then no reason kdump need more > time. if more time needed, normal kernel need too. What you said is a good point, however it is not always the case according to our tests and issues we've met that timeout happened only under kdump.
kdump and normal kdump are a bit different, for example the first(root) mount may be different after the "remove root=X" feature on Fedora. Different boot cmdlines(like nr_cpus=1) might affect the boot process,
Yes, kdump mostly takes nr_cpus=1 and it matters in the parallel initialization of boot system service. While kdump takes off all unnecessary services and units, it also matters since most of the services before mount and device waiting are very critical but quick one. Paralleling may not win too much.
The most important things is 500 seconds are almost 10 minutes, it's so long that I believe almost all people will think boot hang and they will reboot without waiting 10 minutes to see if there's any further hints message printed. With this change, a log time will last for us to get reports that kdump hang during boot without clear reason or message. We have to ask for the system information and try to reproduce and find out that 'Ah, ok, it doesn't hang, it's mount timeout failure' issues finally.
there will less components under kdump than normal kernel, different hardware status due to hot reboot, etc.
So my point is that kdump mount happens at a different time/stage as that under normal kernel, one notably fact is that normal kernel mount happens after switch-root to the real root fs, while kdump kernel happens at the initramfs stage.
Another guess is that, say hardware iscsi, for normal reboot, the hardware can get more time during boot via BIOS, and for kexec boot it was not given enough time to get ready. Anyway it would be good to have some hardware guy give us a reasonable explanation.
Yes, if possible, a worst time consumption might be needed. Like 273 centigrade degree is the absolute zero degree just because it's the known lowest centigrade degree in nature that substance concretes. We can use that one as the biggest timeout value, if one day it's exceeded we can increase the timeout value, or ask for reason why it becomes much longer to the mounted device developer.
600s is based on the issue found on the large and slow DragonHawk machine, and it is the slowest machine I've ever met, 600s can work for it.
We've suggested a workaround for them by manually appending the timeout option in /etc/fstab and published a KBase to explain that.
I suggested to use 600s but they actually tested 700s and use 700 in the Kbase so probably we can use 700 as well here.
For Bao's concern, it should be better than kdump failure without the timeout specified. It is impossible to get a minimum value. For the server machine crashes in background it should be not a big problem..
Well, this might be not good. I personally would suggest that please don't do it. We can't increase this default timeout value 7 times as big as before just because a special system. Default value is for most of systems, this will bring agony to us, QA and all other testers. We have to wait each time 10 minutes to find out if it's really a hang, or just a mount failure whenever we do a simple mount case test.
Default value is for general case, not for special case. Simple example is that real estate developer won't design and build all room in China with door of 3 meters high and with beds of 3 meters long just because Yao Ming is 2.26 meters tall and living in China.
Understand your concern, thing is the interface to tune this from user side is not straitforward, we can not add another option in kdump.conf, one need to modify /etc/fstab line.
We have 2+ reports now, I suppose this is for the FCOE cards problem, this is probably not a special case as the large machine which needs the timeout. Maybe Xunlei can try on the FCOE machine see if a smaller value also works.
Just my personal opinion, won't object this strongly.
For desktop user if serial console or graphic driver works then we will see the systemd timeout progress messages? If so it should be also fine.
Hmm, it's an idea.
Regards, Xunlei _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
On 07/27/2017 at 01:52 PM, Baoquan He wrote:
On 07/27/17 at 01:13pm, Dave Young wrote:
On 07/27/17 at 10:37am, Xunlei Pang wrote:
On 07/27/2017 at 10:11 AM, Baoquan He wrote:
On 07/26/17 at 07:38pm, Xunlei Pang wrote:
>> Maybe we can also define kdump's default value of rd.retry and rd.timeout, >> anyway I think that should be a different issue from this patch tries to solve. > Kdump is thin version of normal kernel, should need less time than > normal kernel. So if normal kernel works, then no reason kdump need more > time. if more time needed, normal kernel need too. What you said is a good point, however it is not always the case according to our tests and issues we've met that timeout happened only under kdump.
kdump and normal kdump are a bit different, for example the first(root) mount may be different after the "remove root=X" feature on Fedora. Different boot cmdlines(like nr_cpus=1) might affect the boot process,
Yes, kdump mostly takes nr_cpus=1 and it matters in the parallel initialization of boot system service. While kdump takes off all unnecessary services and units, it also matters since most of the services before mount and device waiting are very critical but quick one. Paralleling may not win too much.
The most important things is 500 seconds are almost 10 minutes, it's so long that I believe almost all people will think boot hang and they will reboot without waiting 10 minutes to see if there's any further hints message printed. With this change, a log time will last for us to get reports that kdump hang during boot without clear reason or message. We have to ask for the system information and try to reproduce and find out that 'Ah, ok, it doesn't hang, it's mount timeout failure' issues finally.
there will less components under kdump than normal kernel, different hardware status due to hot reboot, etc.
So my point is that kdump mount happens at a different time/stage as that under normal kernel, one notably fact is that normal kernel mount happens after switch-root to the real root fs, while kdump kernel happens at the initramfs stage.
Another guess is that, say hardware iscsi, for normal reboot, the hardware can get more time during boot via BIOS, and for kexec boot it was not given enough time to get ready. Anyway it would be good to have some hardware guy give us a reasonable explanation.
Yes, if possible, a worst time consumption might be needed. Like 273 centigrade degree is the absolute zero degree just because it's the known lowest centigrade degree in nature that substance concretes. We can use that one as the biggest timeout value, if one day it's exceeded we can increase the timeout value, or ask for reason why it becomes much longer to the mounted device developer.
600s is based on the issue found on the large and slow DragonHawk machine, and it is the slowest machine I've ever met, 600s can work for it.
We've suggested a workaround for them by manually appending the timeout option in /etc/fstab and published a KBase to explain that.
I suggested to use 600s but they actually tested 700s and use 700 in the Kbase so probably we can use 700 as well here.
For Bao's concern, it should be better than kdump failure without the timeout specified. It is impossible to get a minimum value. For the server machine crashes in background it should be not a big problem..
Well, this might be not good. I personally would suggest that please don't do it. We can't increase this default timeout value 7 times as big as before just because a special system. Default value is for most of systems, this will bring agony to us, QA and all other testers. We have to wait each time 10 minutes to find out if it's really a hang, or just a mount failure whenever we do a simple mount case test.
Default value is for general case, not for special case. Simple example is that real estate developer won't design and build all room in China with door of 3 meters high and with beds of 3 meters long just because Yao Ming is 2.26 meters tall and living in China.
Just my personal opinion, won't object this strongly.
Hmm, it's an issue, besides, another drawback I thought of is that it might increase the system error recovery time significantly in case of kdump failure as it will wait more minutes after this.
But on the other hand, we can have a dump target not in /etc/fstab, so users do not have a good way to configure the timeout value.
I have another idea, we can reuse rd.timeout=X for kdump mount timeout, the default value is a little greater than rd.retry, say 200s. If there is any explicit rd.timeout specified, we will use it as the mount timeout value. In this way, we only double the default 90s, I think it's acceptable(normally dracut initqueue also waits for the default rd.retry=180s), after all it's a little different under kdump per previous discussion, the default 90s tends to be insufficient on iscsi multipath machines in case of kexec boot which skips BIOS.
Thoughts?
Regards, Xunlei
For desktop user if serial console or graphic driver works then we will see the systemd timeout progress messages? If so it should be also fine.
Hmm, it's an idea.
Regards, Xunlei _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
On 07/27/17 at 02:29pm, Xunlei Pang wrote:
On 07/27/2017 at 01:52 PM, Baoquan He wrote:
On 07/27/17 at 01:13pm, Dave Young wrote:
On 07/27/17 at 10:37am, Xunlei Pang wrote:
On 07/27/2017 at 10:11 AM, Baoquan He wrote:
On 07/26/17 at 07:38pm, Xunlei Pang wrote:
>>> Maybe we can also define kdump's default value of rd.retry and rd.timeout, >>> anyway I think that should be a different issue from this patch tries to solve. >> Kdump is thin version of normal kernel, should need less time than >> normal kernel. So if normal kernel works, then no reason kdump need more >> time. if more time needed, normal kernel need too. > What you said is a good point, however it is not always the case according > to our tests and issues we've met that timeout happened only under kdump. > > kdump and normal kdump are a bit different, for example the first(root) > mount may be different after the "remove root=X" feature on Fedora. > Different boot cmdlines(like nr_cpus=1) might affect the boot process,
Yes, kdump mostly takes nr_cpus=1 and it matters in the parallel initialization of boot system service. While kdump takes off all unnecessary services and units, it also matters since most of the services before mount and device waiting are very critical but quick one. Paralleling may not win too much.
The most important things is 500 seconds are almost 10 minutes, it's so long that I believe almost all people will think boot hang and they will reboot without waiting 10 minutes to see if there's any further hints message printed. With this change, a log time will last for us to get reports that kdump hang during boot without clear reason or message. We have to ask for the system information and try to reproduce and find out that 'Ah, ok, it doesn't hang, it's mount timeout failure' issues finally.
> there will less components under kdump than normal kernel, different > hardware status due to hot reboot, etc. > > So my point is that kdump mount happens at a different time/stage as that > under normal kernel, one notably fact is that normal kernel mount happens > after switch-root to the real root fs, while kdump kernel happens at the > initramfs stage. Another guess is that, say hardware iscsi, for normal reboot, the hardware can get more time during boot via BIOS, and for kexec boot it was not given enough time to get ready. Anyway it would be good to have some hardware guy give us a reasonable explanation.
Yes, if possible, a worst time consumption might be needed. Like 273 centigrade degree is the absolute zero degree just because it's the known lowest centigrade degree in nature that substance concretes. We can use that one as the biggest timeout value, if one day it's exceeded we can increase the timeout value, or ask for reason why it becomes much longer to the mounted device developer.
600s is based on the issue found on the large and slow DragonHawk machine, and it is the slowest machine I've ever met, 600s can work for it.
We've suggested a workaround for them by manually appending the timeout option in /etc/fstab and published a KBase to explain that.
I suggested to use 600s but they actually tested 700s and use 700 in the Kbase so probably we can use 700 as well here.
For Bao's concern, it should be better than kdump failure without the timeout specified. It is impossible to get a minimum value. For the server machine crashes in background it should be not a big problem..
Well, this might be not good. I personally would suggest that please don't do it. We can't increase this default timeout value 7 times as big as before just because a special system. Default value is for most of systems, this will bring agony to us, QA and all other testers. We have to wait each time 10 minutes to find out if it's really a hang, or just a mount failure whenever we do a simple mount case test.
Default value is for general case, not for special case. Simple example is that real estate developer won't design and build all room in China with door of 3 meters high and with beds of 3 meters long just because Yao Ming is 2.26 meters tall and living in China.
Just my personal opinion, won't object this strongly.
Hmm, it's an issue, besides, another drawback I thought of is that it might increase the system error recovery time significantly in case of kdump failure as it will wait more minutes after this.
But on the other hand, we can have a dump target not in /etc/fstab, so users do not have a good way to configure the timeout value.
I have another idea, we can reuse rd.timeout=X for kdump mount timeout, the default value is a little greater than rd.retry, say 200s. If there is any explicit rd.timeout specified, we will use it as the mount timeout value. In this way, we only double the default 90s, I think it's acceptable(normally dracut initqueue also waits for the default rd.retry=180s), after all it's a little different under kdump per previous discussion, the default 90s tends to be insufficient on iscsi multipath machines in case of kexec boot which skips BIOS.
It looks a better solution, people can tune it with kernel cmdline then.. But does the default rd.retry works for the FCOE case?
Thoughts?
Regards, Xunlei
For desktop user if serial console or graphic driver works then we will see the systemd timeout progress messages? If so it should be also fine.
Hmm, it's an idea.
Regards, Xunlei _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
On 07/27/2017 at 03:03 PM, Dave Young wrote:
On 07/27/17 at 02:29pm, Xunlei Pang wrote:
On 07/27/2017 at 01:52 PM, Baoquan He wrote:
On 07/27/17 at 01:13pm, Dave Young wrote:
On 07/27/17 at 10:37am, Xunlei Pang wrote:
On 07/27/2017 at 10:11 AM, Baoquan He wrote:
On 07/26/17 at 07:38pm, Xunlei Pang wrote: >>>> Maybe we can also define kdump's default value of rd.retry and rd.timeout, >>>> anyway I think that should be a different issue from this patch tries to solve. >>> Kdump is thin version of normal kernel, should need less time than >>> normal kernel. So if normal kernel works, then no reason kdump need more >>> time. if more time needed, normal kernel need too. >> What you said is a good point, however it is not always the case according >> to our tests and issues we've met that timeout happened only under kdump. >> >> kdump and normal kdump are a bit different, for example the first(root) >> mount may be different after the "remove root=X" feature on Fedora. >> Different boot cmdlines(like nr_cpus=1) might affect the boot process, Yes, kdump mostly takes nr_cpus=1 and it matters in the parallel initialization of boot system service. While kdump takes off all unnecessary services and units, it also matters since most of the services before mount and device waiting are very critical but quick one. Paralleling may not win too much.
The most important things is 500 seconds are almost 10 minutes, it's so long that I believe almost all people will think boot hang and they will reboot without waiting 10 minutes to see if there's any further hints message printed. With this change, a log time will last for us to get reports that kdump hang during boot without clear reason or message. We have to ask for the system information and try to reproduce and find out that 'Ah, ok, it doesn't hang, it's mount timeout failure' issues finally.
>> there will less components under kdump than normal kernel, different >> hardware status due to hot reboot, etc. >> >> So my point is that kdump mount happens at a different time/stage as that >> under normal kernel, one notably fact is that normal kernel mount happens >> after switch-root to the real root fs, while kdump kernel happens at the >> initramfs stage. > Another guess is that, say hardware iscsi, for normal reboot, the hardware > can get more time during boot via BIOS, and for kexec boot it was not given > enough time to get ready. Anyway it would be good to have some hardware > guy give us a reasonable explanation. Yes, if possible, a worst time consumption might be needed. Like 273 centigrade degree is the absolute zero degree just because it's the known lowest centigrade degree in nature that substance concretes. We can use that one as the biggest timeout value, if one day it's exceeded we can increase the timeout value, or ask for reason why it becomes much longer to the mounted device developer.
600s is based on the issue found on the large and slow DragonHawk machine, and it is the slowest machine I've ever met, 600s can work for it.
We've suggested a workaround for them by manually appending the timeout option in /etc/fstab and published a KBase to explain that.
I suggested to use 600s but they actually tested 700s and use 700 in the Kbase so probably we can use 700 as well here.
For Bao's concern, it should be better than kdump failure without the timeout specified. It is impossible to get a minimum value. For the server machine crashes in background it should be not a big problem..
Well, this might be not good. I personally would suggest that please don't do it. We can't increase this default timeout value 7 times as big as before just because a special system. Default value is for most of systems, this will bring agony to us, QA and all other testers. We have to wait each time 10 minutes to find out if it's really a hang, or just a mount failure whenever we do a simple mount case test.
Default value is for general case, not for special case. Simple example is that real estate developer won't design and build all room in China with door of 3 meters high and with beds of 3 meters long just because Yao Ming is 2.26 meters tall and living in China.
Just my personal opinion, won't object this strongly.
Hmm, it's an issue, besides, another drawback I thought of is that it might increase the system error recovery time significantly in case of kdump failure as it will wait more minutes after this.
But on the other hand, we can have a dump target not in /etc/fstab, so users do not have a good way to configure the timeout value.
I have another idea, we can reuse rd.timeout=X for kdump mount timeout, the default value is a little greater than rd.retry, say 200s. If there is any explicit rd.timeout specified, we will use it as the mount timeout value. In this way, we only double the default 90s, I think it's acceptable(normally dracut initqueue also waits for the default rd.retry=180s), after all it's a little different under kdump per previous discussion, the default 90s tends to be insufficient on iscsi multipath machines in case of kexec boot which skips BIOS.
It looks a better solution, people can tune it with kernel cmdline then.. But does the default rd.retry works for the FCOE case?
rd.retry or rd.timeout doesn't work for multipath(could be lied on pure hardware iscsi or FCOE), as it is not added into dracut initqueue, so it would be good if we can unify them.
Thoughts?
Regards, Xunlei
For desktop user if serial console or graphic driver works then we will see the systemd timeout progress messages? If so it should be also fine.
Hmm, it's an idea.
Regards, Xunlei _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
Hi Xunlei,
On 07/27/17 at 03:03pm, Dave Young wrote:
On 07/27/17 at 02:29pm, Xunlei Pang wrote:
On 07/27/2017 at 01:52 PM, Baoquan He wrote:
On 07/27/17 at 01:13pm, Dave Young wrote:
On 07/27/17 at 10:37am, Xunlei Pang wrote:
On 07/27/2017 at 10:11 AM, Baoquan He wrote:
On 07/26/17 at 07:38pm, Xunlei Pang wrote: >>>> Maybe we can also define kdump's default value of rd.retry and rd.timeout, >>>> anyway I think that should be a different issue from this patch tries to solve. >>> Kdump is thin version of normal kernel, should need less time than >>> normal kernel. So if normal kernel works, then no reason kdump need more >>> time. if more time needed, normal kernel need too. >> What you said is a good point, however it is not always the case according >> to our tests and issues we've met that timeout happened only under kdump. >> >> kdump and normal kdump are a bit different, for example the first(root) >> mount may be different after the "remove root=X" feature on Fedora. >> Different boot cmdlines(like nr_cpus=1) might affect the boot process, Yes, kdump mostly takes nr_cpus=1 and it matters in the parallel initialization of boot system service. While kdump takes off all unnecessary services and units, it also matters since most of the services before mount and device waiting are very critical but quick one. Paralleling may not win too much.
The most important things is 500 seconds are almost 10 minutes, it's so long that I believe almost all people will think boot hang and they will reboot without waiting 10 minutes to see if there's any further hints message printed. With this change, a log time will last for us to get reports that kdump hang during boot without clear reason or message. We have to ask for the system information and try to reproduce and find out that 'Ah, ok, it doesn't hang, it's mount timeout failure' issues finally.
>> there will less components under kdump than normal kernel, different >> hardware status due to hot reboot, etc. >> >> So my point is that kdump mount happens at a different time/stage as that >> under normal kernel, one notably fact is that normal kernel mount happens >> after switch-root to the real root fs, while kdump kernel happens at the >> initramfs stage. > Another guess is that, say hardware iscsi, for normal reboot, the hardware > can get more time during boot via BIOS, and for kexec boot it was not given > enough time to get ready. Anyway it would be good to have some hardware > guy give us a reasonable explanation. Yes, if possible, a worst time consumption might be needed. Like 273 centigrade degree is the absolute zero degree just because it's the known lowest centigrade degree in nature that substance concretes. We can use that one as the biggest timeout value, if one day it's exceeded we can increase the timeout value, or ask for reason why it becomes much longer to the mounted device developer.
600s is based on the issue found on the large and slow DragonHawk machine, and it is the slowest machine I've ever met, 600s can work for it.
We've suggested a workaround for them by manually appending the timeout option in /etc/fstab and published a KBase to explain that.
I suggested to use 600s but they actually tested 700s and use 700 in the Kbase so probably we can use 700 as well here.
For Bao's concern, it should be better than kdump failure without the timeout specified. It is impossible to get a minimum value. For the server machine crashes in background it should be not a big problem..
Well, this might be not good. I personally would suggest that please don't do it. We can't increase this default timeout value 7 times as big as before just because a special system. Default value is for most of systems, this will bring agony to us, QA and all other testers. We have to wait each time 10 minutes to find out if it's really a hang, or just a mount failure whenever we do a simple mount case test.
Default value is for general case, not for special case. Simple example is that real estate developer won't design and build all room in China with door of 3 meters high and with beds of 3 meters long just because Yao Ming is 2.26 meters tall and living in China.
Just my personal opinion, won't object this strongly.
Hmm, it's an issue, besides, another drawback I thought of is that it might increase the system error recovery time significantly in case of kdump failure as it will wait more minutes after this.
But on the other hand, we can have a dump target not in /etc/fstab, so users do not have a good way to configure the timeout value.
I have another idea, we can reuse rd.timeout=X for kdump mount timeout, the default value is a little greater than rd.retry, say 200s. If there is any explicit rd.timeout specified, we will use it as the mount timeout value. In this way, we only double the default 90s, I think it's acceptable(normally dracut initqueue also waits for the default rd.retry=180s), after all it's a little different under kdump per previous discussion, the default 90s tends to be insufficient on iscsi multipath machines in case of kexec boot which skips BIOS.
It looks a better solution, people can tune it with kernel cmdline then.. But does the default rd.retry works for the FCOE case?
Rethink about this, probably 300s will work for most cases, we can consider the general rd.timeout later if there are other reports other than the very very large machine case. The lazy way will avoid more options to support and maybe it will just work for long time.
What do you think?
Thoughts?
Regards, Xunlei
For desktop user if serial console or graphic driver works then we will see the systemd timeout progress messages? If so it should be also fine.
Hmm, it's an idea.
Regards, Xunlei _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
On 07/31/2017 at 11:18 AM, Dave Young wrote:
Hi Xunlei,
On 07/27/17 at 03:03pm, Dave Young wrote:
On 07/27/17 at 02:29pm, Xunlei Pang wrote:
On 07/27/2017 at 01:52 PM, Baoquan He wrote:
On 07/27/17 at 01:13pm, Dave Young wrote:
On 07/27/17 at 10:37am, Xunlei Pang wrote:
On 07/27/2017 at 10:11 AM, Baoquan He wrote: > On 07/26/17 at 07:38pm, Xunlei Pang wrote: >>>>> Maybe we can also define kdump's default value of rd.retry and rd.timeout, >>>>> anyway I think that should be a different issue from this patch tries to solve. >>>> Kdump is thin version of normal kernel, should need less time than >>>> normal kernel. So if normal kernel works, then no reason kdump need more >>>> time. if more time needed, normal kernel need too. >>> What you said is a good point, however it is not always the case according >>> to our tests and issues we've met that timeout happened only under kdump. >>> >>> kdump and normal kdump are a bit different, for example the first(root) >>> mount may be different after the "remove root=X" feature on Fedora. >>> Different boot cmdlines(like nr_cpus=1) might affect the boot process, > Yes, kdump mostly takes nr_cpus=1 and it matters in the parallel > initialization of boot system service. While kdump takes off all > unnecessary services and units, it also matters since most of the > services before mount and device waiting are very critical but quick > one. Paralleling may not win too much. > > The most important things is 500 seconds are almost 10 minutes, it's so > long that I believe almost all people will think boot hang and they will > reboot without waiting 10 minutes to see if there's any further hints > message printed. With this change, a log time will last for us to get > reports that kdump hang during boot without clear reason or message. We > have to ask for the system information and try to reproduce and find out > that 'Ah, ok, it doesn't hang, it's mount timeout failure' issues > finally. > >>> there will less components under kdump than normal kernel, different >>> hardware status due to hot reboot, etc. >>> >>> So my point is that kdump mount happens at a different time/stage as that >>> under normal kernel, one notably fact is that normal kernel mount happens >>> after switch-root to the real root fs, while kdump kernel happens at the >>> initramfs stage. >> Another guess is that, say hardware iscsi, for normal reboot, the hardware >> can get more time during boot via BIOS, and for kexec boot it was not given >> enough time to get ready. Anyway it would be good to have some hardware >> guy give us a reasonable explanation. > Yes, if possible, a worst time consumption might be needed. Like 273 > centigrade degree is the absolute zero degree just because it's the > known lowest centigrade degree in nature that substance concretes. We > can use that one as the biggest timeout value, if one day it's exceeded > we can increase the timeout value, or ask for reason why it becomes much > longer to the mounted device developer. 600s is based on the issue found on the large and slow DragonHawk machine, and it is the slowest machine I've ever met, 600s can work for it.
We've suggested a workaround for them by manually appending the timeout option in /etc/fstab and published a KBase to explain that.
I suggested to use 600s but they actually tested 700s and use 700 in the Kbase so probably we can use 700 as well here.
For Bao's concern, it should be better than kdump failure without the timeout specified. It is impossible to get a minimum value. For the server machine crashes in background it should be not a big problem..
Well, this might be not good. I personally would suggest that please don't do it. We can't increase this default timeout value 7 times as big as before just because a special system. Default value is for most of systems, this will bring agony to us, QA and all other testers. We have to wait each time 10 minutes to find out if it's really a hang, or just a mount failure whenever we do a simple mount case test.
Default value is for general case, not for special case. Simple example is that real estate developer won't design and build all room in China with door of 3 meters high and with beds of 3 meters long just because Yao Ming is 2.26 meters tall and living in China.
Just my personal opinion, won't object this strongly.
Hmm, it's an issue, besides, another drawback I thought of is that it might increase the system error recovery time significantly in case of kdump failure as it will wait more minutes after this.
But on the other hand, we can have a dump target not in /etc/fstab, so users do not have a good way to configure the timeout value.
I have another idea, we can reuse rd.timeout=X for kdump mount timeout, the default value is a little greater than rd.retry, say 200s. If there is any explicit rd.timeout specified, we will use it as the mount timeout value. In this way, we only double the default 90s, I think it's acceptable(normally dracut initqueue also waits for the default rd.retry=180s), after all it's a little different under kdump per previous discussion, the default 90s tends to be insufficient on iscsi multipath machines in case of kexec boot which skips BIOS.
It looks a better solution, people can tune it with kernel cmdline then.. But does the default rd.retry works for the FCOE case?
Rethink about this, probably 300s will work for most cases, we can consider the general rd.timeout later if there are other reports other than the very very large machine case. The lazy way will avoid more options to support and maybe it will just work for long time.
What do you think?
I was considering 200s which is a little greater than default rd.retry(180s), but 300s also looks good to me.
Thoughts?
Regards, Xunlei
For desktop user if serial console or graphic driver works then we will see the systemd timeout progress messages? If so it should be also fine.
Hmm, it's an idea.
Regards, Xunlei _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org