On 07/26/17 at 07:38pm, Xunlei Pang wrote:
Maybe we can also define kdump's default value of rd.retry and rd.timeout, anyway I think that should be a different issue from this patch tries to solve.
Kdump is thin version of normal kernel, should need less time than normal kernel. So if normal kernel works, then no reason kdump need more time. if more time needed, normal kernel need too.
What you said is a good point, however it is not always the case according to our tests and issues we've met that timeout happened only under kdump.
kdump and normal kdump are a bit different, for example the first(root) mount may be different after the "remove root=X" feature on Fedora. Different boot cmdlines(like nr_cpus=1) might affect the boot process,
Yes, kdump mostly takes nr_cpus=1 and it matters in the parallel initialization of boot system service. While kdump takes off all unnecessary services and units, it also matters since most of the services before mount and device waiting are very critical but quick one. Paralleling may not win too much.
The most important things is 500 seconds are almost 10 minutes, it's so long that I believe almost all people will think boot hang and they will reboot without waiting 10 minutes to see if there's any further hints message printed. With this change, a log time will last for us to get reports that kdump hang during boot without clear reason or message. We have to ask for the system information and try to reproduce and find out that 'Ah, ok, it doesn't hang, it's mount timeout failure' issues finally.
there will less components under kdump than normal kernel, different hardware status due to hot reboot, etc.
So my point is that kdump mount happens at a different time/stage as that under normal kernel, one notably fact is that normal kernel mount happens after switch-root to the real root fs, while kdump kernel happens at the initramfs stage.
Another guess is that, say hardware iscsi, for normal reboot, the hardware can get more time during boot via BIOS, and for kexec boot it was not given enough time to get ready. Anyway it would be good to have some hardware guy give us a reasonable explanation.
Yes, if possible, a worst time consumption might be needed. Like 273 centigrade degree is the absolute zero degree just because it's the known lowest centigrade degree in nature that substance concretes. We can use that one as the biggest timeout value, if one day it's exceeded we can increase the timeout value, or ask for reason why it becomes much longer to the mounted device developer.