On 01/06/2017 at 06:11 PM, Pratyush Anand wrote:
Hi Xunlei,
Thanks a lot for making the things better and automating wherever it is possible :-)
On Friday 06 January 2017 12:07 PM, Xunlei Pang wrote:
> Check the number of cpus for x86_64 kdump kernel to boot with.
> We met an issue for x86_64: kdump runs out of vectors with the
> default "nr_cpus=1", when requesting tons of irqs.
>
> This patch detects such situation and warns users about the risk.
>
> Signed-off-by: Xunlei Pang <xlpang(a)redhat.com>
> ---
> v1->v2:
> - When detecting risky cpu vectors, we just warn users instead of
> modifying "nr_cpus=X" forcely.
> - Improved code comments.
> - Replaced nr_old with nr_origin, and improved some logic.
>
> kdumpctl | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 81 insertions(+)
>
> diff --git a/kdumpctl b/kdumpctl
> index b2068cc..b6fc1f9 100755
> --- a/kdumpctl
> +++ b/kdumpctl
> @@ -105,6 +105,85 @@ append_cmdline()
> echo $cmdline
> }
>
> +# Check the number of cpus for kdump kernel to boot with.
> +# We met an issue for x86_64: kdump runs out of vectors with
> +# "nr_cpus=1" when requesting tons of irqs, so here we check
> +# "nr_cpus=X" and warn users if kdump probably can't work.
> +check_kdump_cpus()
> +{
> + local nr nr_search nr_origin nr_min nr_max
> + local arch=$(uname -m) cmdline=$KDUMP_COMMANDLINE
> +
> + # Special treatment for x86_64 only currently.
> + if [ $arch != "x86_64" ]; then
> + return
> + fi
> +
> + # We only care about "nr_cpus=X" format for x86.
> + nr_search=$(echo $cmdline | grep -o "nr_cpus=[0-9]*" | wc -l)
may be we can just have
nr_cpus=$(echo $cmdline | grep -o "nr_cpus=[0-9]*")
> + if [ $nr_search -eq 0 ] ; then
and the can check for [[ -z $nr_cpus ]]
> + # Do not need to process if no valid "nr_cpus=X" specified.
> + return
> + fi
> +
> + # Get value X of "nr_cpus=X"
> + nr_search=$(echo $cmdline | grep -o "nr_cpus=[0-9]*" | cut -d
"=" -f2 | grep "[0-9]" | sort)
and same nr_cpus can be reused here then.
I've improved the logic, please see v3.
> + # In case there are multiple "nr_cpus=X", get the mininum value.
> + for nr in $nr_search; do
> + if [ $nr -gt 0 ]; then
> + nr_origin=$nr
> + break
> + fi
> + done
> + if [ -z "$nr_origin" ]; then
> + echo "Warning: Wrong \"nr_cpus=\" kernel cmdline
detected"
> + return
> + fi
> +
> + # Online cpus in first kernel.
> + nr_max=$(nproc)
> +
> + # Calculate estimated minium cpus required by irqs(vectors).
> + nr_min=$(ls /proc/irq/ -l | grep ^d | wc -l)
> +
> + # We roughly use 256-32(see kernel FIRST_EXTERNAL_VECTOR)=224 as
> + # maximum supported vectors can be allocated to io devices percpu.
> + # As nr_min is a ballpart figure, also some high-numbered vectors
> + # are consumed by the kernel(see FIRST_SYSTEM_VECTOR), we need a
> + # variance for safety.
> + #
> + # We got a large machine with 240 cpus, 6TB memory, 8 iommus, and
> + # 12 io-apics, 132 irqs under /proc/irq/, it can boot successfully
> + # with "nr_cpus=1". (256-32-132)=92, so choosing 64 as the variance
why to guess that. If I have not missed anything then it seems that number of vectors
needed by kernel is fixed.
From arch/x86/include/asm/irq_vectors.h :
vector 128 seems fixed for system call.
If we have CONFIG_X86_LOCAL_APIC, then vector 0xef to 0xff are used by kernel. So, this
variance should have fixed value as 1 for !CONFIG_X86_LOCAL_APIC and as 18 otherwise, no?
So, may be we can take as 18 for all cases.
Hmm, I pondered quite a lot here, because it's hard to decide one exact value we
should rely on :-)
Yes, FIRST_EXTERNAL_VECTOR(32) is defined by the x86 architecture, and it is unlikely to
change.
While FIRST_SYSTEM_VECTOR is assigned by kernel, it may vary for different kernel
versions. e.g.
the latest kernel version and linux-3.10 have different FIRST_SYSTEM_VECTOR values, there
could
be more system vectors added in the kernel in the future.
Also there are other rare kernel internal reserved vectors, e.g. 0x80 is reserved for
system call vector.
Additionally, there may be cases that one irq has explicit affinity, then multiple
vectors(one for each cpu)
are allocated. So I just give a flexible variance, we can't precisely know if it can
boot or not without knowing
all the details, but if it is above the threshold we selected, kdump has a high
possibility to boot fail.
For example, if we choose threshold 256-32-18-1=205 and a system with 204 external device
interrupts,
among them there are several(say 6) has 0x3 affinity explicitly set, then it will consume
6 more vectors in
total that is 210, so the actual calculated nr_min should be 2 instead of 1.
But if you think we played a little too careful with variance 64, I guess 32 should be
good enough, after all
there is one more cpu added for multiple affinity cases:
if [ $nr_min -gt 1 ]; then
nr_min=$(($nr_min + 1))
nr_min=$(($nr_min + $nr_min % 2))
fi
Regards,
Xunlei
> + # seems ok. Then we get the max external irqs supported per cpu:
> + # (256-32-64)=160 as the dividend.
> + nr_min=$(($nr_min + 160 - 1))
> + nr_min=$(($nr_min / 160))
> + if [ $nr_min -gt 1 ]; then
> + # The system seems to have tons of interrupts. while interrupts
> + # with multiple-cpu affinity can consume multiple vectors, with
> + # one vector for each cpu within the affinity mask. Fortunately
> + # for x2apic which is widely used on large modern machines, in
> + # default case of boot, device bringup etc will use a single cpu
> + # for the interrupt affinity to minimize vector pressure.
> + #
> + # For further safety, we add one more cpu and round it up to an
> + # even number which is commonly-used.
> + nr_min=$(($nr_min + 1))
> + nr_min=$(($nr_min + $nr_min % 2))
> + fi
> +
> + if [ $nr_min -gt $nr_max ]; then
> + nr_min=$nr_max
> + fi
> +
> + if [ $nr_origin -ge $nr_min ]; then
> + return
> + fi
> +
> + echo "Warning: CPU vectors under pressure with
\"nr_cpus=$nr_origin\", please try \"nr_cpus=$nr_min\" or more"
> +}
> +
> # This function performs a series of edits on the command line.
> # Store the final result in global $KDUMP_COMMANDLINE.
> prepare_cmdline()
> @@ -134,6 +213,8 @@ prepare_cmdline()
> fi
>
> KDUMP_COMMANDLINE=$cmdline
> +
> + check_kdump_cpus
> }
>
>
>
~Pratyush