On 07/27/2017 at 01:52 PM, Baoquan He wrote:
On 07/27/17 at 01:13pm, Dave Young wrote:
On 07/27/17 at 10:37am, Xunlei Pang wrote:
On 07/27/2017 at 10:11 AM, Baoquan He wrote:
On 07/26/17 at 07:38pm, Xunlei Pang wrote:
>> Maybe we can also define kdump's default value of rd.retry and rd.timeout, >> anyway I think that should be a different issue from this patch tries to solve. > Kdump is thin version of normal kernel, should need less time than > normal kernel. So if normal kernel works, then no reason kdump need more > time. if more time needed, normal kernel need too. What you said is a good point, however it is not always the case according to our tests and issues we've met that timeout happened only under kdump.
kdump and normal kdump are a bit different, for example the first(root) mount may be different after the "remove root=X" feature on Fedora. Different boot cmdlines(like nr_cpus=1) might affect the boot process,
Yes, kdump mostly takes nr_cpus=1 and it matters in the parallel initialization of boot system service. While kdump takes off all unnecessary services and units, it also matters since most of the services before mount and device waiting are very critical but quick one. Paralleling may not win too much.
The most important things is 500 seconds are almost 10 minutes, it's so long that I believe almost all people will think boot hang and they will reboot without waiting 10 minutes to see if there's any further hints message printed. With this change, a log time will last for us to get reports that kdump hang during boot without clear reason or message. We have to ask for the system information and try to reproduce and find out that 'Ah, ok, it doesn't hang, it's mount timeout failure' issues finally.
there will less components under kdump than normal kernel, different hardware status due to hot reboot, etc.
So my point is that kdump mount happens at a different time/stage as that under normal kernel, one notably fact is that normal kernel mount happens after switch-root to the real root fs, while kdump kernel happens at the initramfs stage.
Another guess is that, say hardware iscsi, for normal reboot, the hardware can get more time during boot via BIOS, and for kexec boot it was not given enough time to get ready. Anyway it would be good to have some hardware guy give us a reasonable explanation.
Yes, if possible, a worst time consumption might be needed. Like 273 centigrade degree is the absolute zero degree just because it's the known lowest centigrade degree in nature that substance concretes. We can use that one as the biggest timeout value, if one day it's exceeded we can increase the timeout value, or ask for reason why it becomes much longer to the mounted device developer.
600s is based on the issue found on the large and slow DragonHawk machine, and it is the slowest machine I've ever met, 600s can work for it.
We've suggested a workaround for them by manually appending the timeout option in /etc/fstab and published a KBase to explain that.
I suggested to use 600s but they actually tested 700s and use 700 in the Kbase so probably we can use 700 as well here.
For Bao's concern, it should be better than kdump failure without the timeout specified. It is impossible to get a minimum value. For the server machine crashes in background it should be not a big problem..
Well, this might be not good. I personally would suggest that please don't do it. We can't increase this default timeout value 7 times as big as before just because a special system. Default value is for most of systems, this will bring agony to us, QA and all other testers. We have to wait each time 10 minutes to find out if it's really a hang, or just a mount failure whenever we do a simple mount case test.
Default value is for general case, not for special case. Simple example is that real estate developer won't design and build all room in China with door of 3 meters high and with beds of 3 meters long just because Yao Ming is 2.26 meters tall and living in China.
Just my personal opinion, won't object this strongly.
Hmm, it's an issue, besides, another drawback I thought of is that it might increase the system error recovery time significantly in case of kdump failure as it will wait more minutes after this.
But on the other hand, we can have a dump target not in /etc/fstab, so users do not have a good way to configure the timeout value.
I have another idea, we can reuse rd.timeout=X for kdump mount timeout, the default value is a little greater than rd.retry, say 200s. If there is any explicit rd.timeout specified, we will use it as the mount timeout value. In this way, we only double the default 90s, I think it's acceptable(normally dracut initqueue also waits for the default rd.retry=180s), after all it's a little different under kdump per previous discussion, the default 90s tends to be insufficient on iscsi multipath machines in case of kexec boot which skips BIOS.
Thoughts?
Regards, Xunlei
For desktop user if serial console or graphic driver works then we will see the systemd timeout progress messages? If so it should be also fine.
Hmm, it's an idea.
Regards, Xunlei _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org