When AMD sme/sev enabled, it will require extra memory reserved for swiotlb. Currently the swiotlb kernel parameter is removed by /etc/sysconfig/kdump for 2nd kernel, so a 64MB memory for swiotlb will be reserved by default [1], and the 2nd kernel will fail of OOM with the default crashkernel value.
swiotlb memory is mandatory for AMD sme/sev. For rhel8, kernel commit 0adb0f4 ("[x86] Reserve at most 64M of SWIOTLB memory for crashkernel") implemented the reservation in the kernel side, as part of the kernel crashkernel=auto implementation.
For rhel9 however, crashkernel=auto has been mirrored to userspace kexec-tools, so this patch will enable the swiotlb reservation for AMD sme/sev in kexec-tools as well.
[1]: https://elixir.bootlin.com/linux/v5.14/source/arch/x86/mm/mem_encrypt.c#L200
Signed-off-by: Tao Liu ltao@redhat.com --- kdump-lib-initramfs.sh | 10 ++++++++++ kdump-lib.sh | 11 +++++++++-- kdumpctl | 7 ++++++- 3 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/kdump-lib-initramfs.sh b/kdump-lib-initramfs.sh index 9be0fe9..699babd 100755 --- a/kdump-lib-initramfs.sh +++ b/kdump-lib-initramfs.sh @@ -8,6 +8,11 @@ KDUMP_CONFIG_FILE="/etc/kdump.conf" FENCE_KDUMP_CONFIG_FILE="/etc/sysconfig/fence_kdump" FENCE_KDUMP_SEND="/usr/libexec/fence_kdump_send"
+# Reserve 64MB for swiotlb when AMD sme/sev enabled. +AMD_SME_SEV_SWIOTLB=64 +# Copied from include/linux/swiotlb.h +IO_TLB_SHIFT=11 + # Read kdump config in well formated style kdump_read_conf() { @@ -130,3 +135,8 @@ is_fs_dump_target() { [ -n "$(kdump_get_conf_val "ext[234]|xfs|btrfs|minix")" ] } + +is_amd_sme_sev_enabled() +{ + dmesg | grep -sq "AMD Memory Encryption Features active" +} diff --git a/kdump-lib.sh b/kdump-lib.sh index 4ed5035..8ea6bfe 100755 --- a/kdump-lib.sh +++ b/kdump-lib.sh @@ -831,7 +831,7 @@ get_recommend_size() # $1 dump mode, if not specified, dump_mode will be judged by is_fadump_capable kdump_get_arch_recommend_crashkernel() { - local _arch _ck_cmdline _dump_mode + local _arch _ck_cmdline _dump_mode _mem_enc
if [[ -z "$1" ]]; then if is_fadump_capable; then @@ -845,7 +845,14 @@ kdump_get_arch_recommend_crashkernel()
_arch=$(uname -m)
- if [[ $_arch == "x86_64" ]] || [[ $_arch == "s390x" ]]; then + if [[ $_arch == "x86_64" ]]; then + if is_amd_sme_sev_enabled; then + _mem_enc=$AMD_SME_SEV_SWIOTLB + else + _mem_enc=0 + fi + _ck_cmdline="1G-4G:$((192+$_mem_enc))M,4G-64G:$((256+$_mem_enc))M,64G-:$((512+$_mem_enc))M" + elif [[ $_arch == "s390x" ]]; then _ck_cmdline="1G-4G:192M,4G-64G:256M,64G-:512M" elif [[ $_arch == "aarch64" ]]; then _ck_cmdline="2G-:448M" diff --git a/kdumpctl b/kdumpctl index 1869753..50aafda 100755 --- a/kdumpctl +++ b/kdumpctl @@ -21,6 +21,7 @@ FADUMP_REGISTER_SYS_NODE="/sys/kernel/fadump_registered" #kdump shall be the default dump mode DEFAULT_DUMP_MODE="kdump" image_time=0 +size_mb=$((1024 * 1024))
standard_kexec_args="-d -p"
@@ -622,6 +623,11 @@ load_kdump() { local ret
+ # Force 2nd kernel use swiotlb for AMD sme/sve to 64MB at most. + if is_amd_sme_sev_enabled; then + KDUMP_COMMANDLINE_APPEND+=" swiotlb=$((($AMD_SME_SEV_SWIOTLB * $size_mb) >> $IO_TLB_SHIFT)) " + fi + KEXEC_ARGS=$(prepare_kexec_args "${KEXEC_ARGS}") KDUMP_COMMANDLINE=$(prepare_cmdline "${KDUMP_COMMANDLINE}" "${KDUMP_COMMANDLINE_REMOVE}" "${KDUMP_COMMANDLINE_APPEND}")
@@ -1189,7 +1195,6 @@ do_estimate() local -A large_mods local baseline local kernel_size mod_size initrd_size baseline_size runtime_size reserved_size estimated_size recommended_size - local size_mb=$((1024 * 1024))
setup_initrd if [[ ! -f $TARGET_INITRD ]]; then
Hi Tao,
On Tue, 29 Mar 2022 20:45:33 +0800 Tao Liu ltao@redhat.com wrote:
When AMD sme/sev enabled, it will require extra memory reserved for swiotlb. Currently the swiotlb kernel parameter is removed by /etc/sysconfig/kdump for 2nd kernel, so a 64MB memory for swiotlb will be reserved by default [1], and the 2nd kernel will fail of OOM with the default crashkernel value.
swiotlb memory is mandatory for AMD sme/sev. For rhel8, kernel commit 0adb0f4 ("[x86] Reserve at most 64M of SWIOTLB memory for crashkernel")
is that commit id correct? I cannot find it in my rhel-8 clone...
implemented the reservation in the kernel side, as part of the kernel crashkernel=auto implementation.
For rhel9 however, crashkernel=auto has been mirrored to userspace kexec-tools, so this patch will enable the swiotlb reservation for AMD sme/sev in kexec-tools as well.
Not really about this patch but I find it weird that we adjust the crashkernel=auto value for swiotlb but not for other cases wher we know that we need a larger crashkernel, e.g. when the dump target is on a LUKS encrypted device. I understand that in the past, when crashkernel=auto was handled by the kernel, we simply didn't know about what the dump target will be when we allocated the crashkernel during boot. But today, when crashkernel=auto is handled in userspace we can.
Signed-off-by: Tao Liu ltao@redhat.com
kdump-lib-initramfs.sh | 10 ++++++++++ kdump-lib.sh | 11 +++++++++-- kdumpctl | 7 ++++++- 3 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/kdump-lib-initramfs.sh b/kdump-lib-initramfs.sh index 9be0fe9..699babd 100755 --- a/kdump-lib-initramfs.sh +++ b/kdump-lib-initramfs.sh @@ -8,6 +8,11 @@ KDUMP_CONFIG_FILE="/etc/kdump.conf" FENCE_KDUMP_CONFIG_FILE="/etc/sysconfig/fence_kdump" FENCE_KDUMP_SEND="/usr/libexec/fence_kdump_send"
+# Reserve 64MB for swiotlb when AMD sme/sev enabled. +AMD_SME_SEV_SWIOTLB=64 +# Copied from include/linux/swiotlb.h +IO_TLB_SHIFT=11
# Read kdump config in well formated style kdump_read_conf() { @@ -130,3 +135,8 @@ is_fs_dump_target() { [ -n "$(kdump_get_conf_val "ext[234]|xfs|btrfs|minix")" ] }
+is_amd_sme_sev_enabled() +{
- dmesg | grep -sq "AMD Memory Encryption Features active"
Is there a more reliable way, e.g. a sysfs attribute, to determine if SEV is enabled?
The dmesg has the problem that it is stored in a ring buffer. Which means if it runs full old entries are overwritten by new ones. So you cannot fully rely on it.
Thanks Philipp
+} diff --git a/kdump-lib.sh b/kdump-lib.sh index 4ed5035..8ea6bfe 100755 --- a/kdump-lib.sh +++ b/kdump-lib.sh @@ -831,7 +831,7 @@ get_recommend_size() # $1 dump mode, if not specified, dump_mode will be judged by is_fadump_capable kdump_get_arch_recommend_crashkernel() {
- local _arch _ck_cmdline _dump_mode
local _arch _ck_cmdline _dump_mode _mem_enc
if [[ -z "$1" ]]; then if is_fadump_capable; then
@@ -845,7 +845,14 @@ kdump_get_arch_recommend_crashkernel()
_arch=$(uname -m)
- if [[ $_arch == "x86_64" ]] || [[ $_arch == "s390x" ]]; then
- if [[ $_arch == "x86_64" ]]; then
if is_amd_sme_sev_enabled; then
_mem_enc=$AMD_SME_SEV_SWIOTLB
else
_mem_enc=0
fi
_ck_cmdline="1G-4G:$((192+$_mem_enc))M,4G-64G:$((256+$_mem_enc))M,64G-:$((512+$_mem_enc))M"
- elif [[ $_arch == "s390x" ]]; then _ck_cmdline="1G-4G:192M,4G-64G:256M,64G-:512M" elif [[ $_arch == "aarch64" ]]; then _ck_cmdline="2G-:448M"
diff --git a/kdumpctl b/kdumpctl index 1869753..50aafda 100755 --- a/kdumpctl +++ b/kdumpctl @@ -21,6 +21,7 @@ FADUMP_REGISTER_SYS_NODE="/sys/kernel/fadump_registered" #kdump shall be the default dump mode DEFAULT_DUMP_MODE="kdump" image_time=0 +size_mb=$((1024 * 1024))
standard_kexec_args="-d -p"
@@ -622,6 +623,11 @@ load_kdump() { local ret
- # Force 2nd kernel use swiotlb for AMD sme/sve to 64MB at most.
- if is_amd_sme_sev_enabled; then
KDUMP_COMMANDLINE_APPEND+=" swiotlb=$((($AMD_SME_SEV_SWIOTLB * $size_mb) >> $IO_TLB_SHIFT)) "
- fi
- KEXEC_ARGS=$(prepare_kexec_args "${KEXEC_ARGS}") KDUMP_COMMANDLINE=$(prepare_cmdline "${KDUMP_COMMANDLINE}" "${KDUMP_COMMANDLINE_REMOVE}" "${KDUMP_COMMANDLINE_APPEND}")
@@ -1189,7 +1195,6 @@ do_estimate() local -A large_mods local baseline local kernel_size mod_size initrd_size baseline_size runtime_size reserved_size estimated_size recommended_size
local size_mb=$((1024 * 1024))
setup_initrd if [[ ! -f $TARGET_INITRD ]]; then
Hi Philipp,
On Wed, Mar 30, 2022 at 11:34 PM Philipp Rudo prudo@redhat.com wrote:
Hi Tao,
On Tue, 29 Mar 2022 20:45:33 +0800 Tao Liu ltao@redhat.com wrote:
When AMD sme/sev enabled, it will require extra memory reserved for swiotlb. Currently the swiotlb kernel parameter is removed by /etc/sysconfig/kdump for 2nd kernel, so a 64MB memory for swiotlb will be reserved by default [1], and the 2nd kernel will fail of OOM with the default crashkernel value.
swiotlb memory is mandatory for AMD sme/sev. For rhel8, kernel commit 0adb0f4 ("[x86] Reserve at most 64M of SWIOTLB memory for crashkernel")
is that commit id correct? I cannot find it in my rhel-8 clone...
Sorry, it's ab30d3f, I copied the wrong commit id...
implemented the reservation in the kernel side, as part of the kernel crashkernel=auto implementation.
For rhel9 however, crashkernel=auto has been mirrored to userspace kexec-tools, so this patch will enable the swiotlb reservation for AMD sme/sev in kexec-tools as well.
Not really about this patch but I find it weird that we adjust the crashkernel=auto value for swiotlb but not for other cases wher we know that we need a larger crashkernel, e.g. when the dump target is on a LUKS encrypted device. I understand that in the past, when crashkernel=auto was handled by the kernel, we simply didn't know about what the dump target will be when we allocated the crashkernel during boot. But today, when crashkernel=auto is handled in userspace we can.
I didn't have much context on LUKS, by looking into the code, I see only "kdumpctl estimate" has the LUKS related crashkernel value increment. It needs the user to adjust crashkernel value manually by estimate-and-set. Maybe we can merge the crashkernel adjustment of swiotlb and luks together, so luks don't need to "kdumpctl estimate" manually anymore?
Signed-off-by: Tao Liu ltao@redhat.com
kdump-lib-initramfs.sh | 10 ++++++++++ kdump-lib.sh | 11 +++++++++-- kdumpctl | 7 ++++++- 3 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/kdump-lib-initramfs.sh b/kdump-lib-initramfs.sh index 9be0fe9..699babd 100755 --- a/kdump-lib-initramfs.sh +++ b/kdump-lib-initramfs.sh @@ -8,6 +8,11 @@ KDUMP_CONFIG_FILE="/etc/kdump.conf" FENCE_KDUMP_CONFIG_FILE="/etc/sysconfig/fence_kdump" FENCE_KDUMP_SEND="/usr/libexec/fence_kdump_send"
+# Reserve 64MB for swiotlb when AMD sme/sev enabled. +AMD_SME_SEV_SWIOTLB=64 +# Copied from include/linux/swiotlb.h +IO_TLB_SHIFT=11
# Read kdump config in well formated style kdump_read_conf() { @@ -130,3 +135,8 @@ is_fs_dump_target() { [ -n "$(kdump_get_conf_val "ext[234]|xfs|btrfs|minix")" ] }
+is_amd_sme_sev_enabled() +{
dmesg | grep -sq "AMD Memory Encryption Features active"
Is there a more reliable way, e.g. a sysfs attribute, to determine if SEV is enabled?
The dmesg has the problem that it is stored in a ring buffer. Which means if it runs full old entries are overwritten by new ones. So you cannot fully rely on it.
I agree with your point, dmesg isn't the ideal way to detect sme/sev enablement. However I didn't find a better way to do it.
Actually we need to verify the following:
1) sme enabled on host 2) sev enabled on host 3) sev enabled on VM guest
Only case 2 can be verified through $(cat /sys/module/kvm_amd/parameters/sev), $(dmesg | grep -sq "AMD Memory Encryption Features active") is generic for the 3 cases though...
Thanks, Tao Liu
Thanks Philipp
+} diff --git a/kdump-lib.sh b/kdump-lib.sh index 4ed5035..8ea6bfe 100755 --- a/kdump-lib.sh +++ b/kdump-lib.sh @@ -831,7 +831,7 @@ get_recommend_size() # $1 dump mode, if not specified, dump_mode will be judged by is_fadump_capable kdump_get_arch_recommend_crashkernel() {
local _arch _ck_cmdline _dump_mode
local _arch _ck_cmdline _dump_mode _mem_enc if [[ -z "$1" ]]; then if is_fadump_capable; then
@@ -845,7 +845,14 @@ kdump_get_arch_recommend_crashkernel()
_arch=$(uname -m)
if [[ $_arch == "x86_64" ]] || [[ $_arch == "s390x" ]]; then
if [[ $_arch == "x86_64" ]]; then
if is_amd_sme_sev_enabled; then
_mem_enc=$AMD_SME_SEV_SWIOTLB
else
_mem_enc=0
fi
_ck_cmdline="1G-4G:$((192+$_mem_enc))M,4G-64G:$((256+$_mem_enc))M,64G-:$((512+$_mem_enc))M"
elif [[ $_arch == "s390x" ]]; then _ck_cmdline="1G-4G:192M,4G-64G:256M,64G-:512M" elif [[ $_arch == "aarch64" ]]; then _ck_cmdline="2G-:448M"
diff --git a/kdumpctl b/kdumpctl index 1869753..50aafda 100755 --- a/kdumpctl +++ b/kdumpctl @@ -21,6 +21,7 @@ FADUMP_REGISTER_SYS_NODE="/sys/kernel/fadump_registered" #kdump shall be the default dump mode DEFAULT_DUMP_MODE="kdump" image_time=0 +size_mb=$((1024 * 1024))
standard_kexec_args="-d -p"
@@ -622,6 +623,11 @@ load_kdump() { local ret
# Force 2nd kernel use swiotlb for AMD sme/sve to 64MB at most.
if is_amd_sme_sev_enabled; then
KDUMP_COMMANDLINE_APPEND+=" swiotlb=$((($AMD_SME_SEV_SWIOTLB * $size_mb) >> $IO_TLB_SHIFT)) "
fi
KEXEC_ARGS=$(prepare_kexec_args "${KEXEC_ARGS}") KDUMP_COMMANDLINE=$(prepare_cmdline "${KDUMP_COMMANDLINE}" "${KDUMP_COMMANDLINE_REMOVE}" "${KDUMP_COMMANDLINE_APPEND}")
@@ -1189,7 +1195,6 @@ do_estimate() local -A large_mods local baseline local kernel_size mod_size initrd_size baseline_size runtime_size reserved_size estimated_size recommended_size
local size_mb=$((1024 * 1024)) setup_initrd if [[ ! -f $TARGET_INITRD ]]; then
Hi Tao,
On Sat, Apr 02, 2022 at 02:14:38PM +0800, Tao Liu wrote:
Hi Philipp,
On Wed, Mar 30, 2022 at 11:34 PM Philipp Rudo prudo@redhat.com wrote:
[...]
implemented the reservation in the kernel side, as part of the kernel crashkernel=auto implementation.
For swiotlb, I think Dave and Emma prefer to do it in kernel space. Maybe you can try a kernel space solution first like [1]?
[1] https://lore.kernel.org/lkml/20190910151341.14986-3-kasong@redhat.com/
For rhel9 however, crashkernel=auto has been mirrored to userspace kexec-tools, so this patch will enable the swiotlb reservation for AMD sme/sev in kexec-tools as well.
Not really about this patch but I find it weird that we adjust the crashkernel=auto value for swiotlb but not for other cases wher we know that we need a larger crashkernel, e.g. when the dump target is on a LUKS encrypted device. I understand that in the past, when crashkernel=auto was handled by the kernel, we simply didn't know about what the dump target will be when we allocated the crashkernel during boot. But today, when crashkernel=auto is handled in userspace we can.
I didn't have much context on LUKS, by looking into the code, I see only "kdumpctl estimate" has the LUKS related crashkernel value increment. It needs the user to adjust crashkernel value manually by estimate-and-set. Maybe we can merge the crashkernel adjustment of swiotlb and luks together, so luks don't need to "kdumpctl estimate" manually anymore?
For LUKS, a kernel-space solution by reusing LUKS master key is preferred because a) LUKS requires too much memory b) we can't expect the user to wait at the screen to input the passphrase to open the disk when kernel crashes. Let's see how far my work [2] can go.
[2] https://lore.kernel.org/lkml/20220321014150.w6wux5azabweu7dr@Rk/T/
Hi Coiby, Hi Tao,
On Sat, 2 Apr 2022 16:43:37 +0800 Coiby Xu coxu@redhat.com wrote:
Hi Tao,
On Sat, Apr 02, 2022 at 02:14:38PM +0800, Tao Liu wrote:
Hi Philipp,
On Wed, Mar 30, 2022 at 11:34 PM Philipp Rudo prudo@redhat.com wrote:
[...]
implemented the reservation in the kernel side, as part of the kernel crashkernel=auto implementation.
For swiotlb, I think Dave and Emma prefer to do it in kernel space. Maybe you can try a kernel space solution first like [1]?
[1] https://lore.kernel.org/lkml/20190910151341.14986-3-kasong@redhat.com/
For rhel9 however, crashkernel=auto has been mirrored to userspace kexec-tools, so this patch will enable the swiotlb reservation for AMD sme/sev in kexec-tools as well.
Not really about this patch but I find it weird that we adjust the crashkernel=auto value for swiotlb but not for other cases wher we know that we need a larger crashkernel, e.g. when the dump target is on a LUKS encrypted device. I understand that in the past, when crashkernel=auto was handled by the kernel, we simply didn't know about what the dump target will be when we allocated the crashkernel during boot. But today, when crashkernel=auto is handled in userspace we can.
I didn't have much context on LUKS, by looking into the code, I see only "kdumpctl estimate" has the LUKS related crashkernel value increment. It needs the user to adjust crashkernel value manually by estimate-and-set. Maybe we can merge the crashkernel adjustment of swiotlb and luks together, so luks don't need to "kdumpctl estimate" manually anymore?
For LUKS, a kernel-space solution by reusing LUKS master key is preferred because a) LUKS requires too much memory b) we can't expect the user to wait at the screen to input the passphrase to open the disk when kernel crashes. Let's see how far my work [2] can go.
you are right that it is better to solve the problem with LUKS in the kernel. I only used it as an example where today we know that we need more memory but don't automatically increase the crashkernel. So why do it for swiotlb but for others? What's the reason for this inconsistency?
And a little bit more general. What is the relation between 'kdumpctl reset-crashkernel' and 'kdumctl estimate'? The way I see it the workflow we suggest to users currently is
1. $ kdumpctl reser-crashkernel 2. -> test if it works, if not 3. $ kdumpctl estimate 4. -> set crashkernel manually to suggested value 5. -> test if it works, if not 6. -> set crashkernel manually to larger value 7. -> test if it works, if not go back to 6
I don't understand why we need the separate 'kdumpctl estimate' step in between. In my opinion the following workflow would be much simpler from a user perspective
1. $ kdumpctl reset-crashkernel (calls kdumpctl estimate internally using current default values as minimum values) 2. -> test if it woks, if not 3. -> set crashkernel manually larger value 4. -> test if it works, if not go back to 3
Thanks Philipp
Hi Philipp,
On Tue, Apr 05, 2022 at 11:37:02AM +0200, Philipp Rudo wrote:
Hi Coiby, Hi Tao,
On Sat, 2 Apr 2022 16:43:37 +0800 Coiby Xu coxu@redhat.com wrote:
Hi Tao,
On Sat, Apr 02, 2022 at 02:14:38PM +0800, Tao Liu wrote:
Hi Philipp,
On Wed, Mar 30, 2022 at 11:34 PM Philipp Rudo prudo@redhat.com wrote:
[...]
implemented the reservation in the kernel side, as part of the kernel crashkernel=auto implementation.
For swiotlb, I think Dave and Emma prefer to do it in kernel space. Maybe you can try a kernel space solution first like [1]?
[1] https://lore.kernel.org/lkml/20190910151341.14986-3-kasong@redhat.com/
For rhel9 however, crashkernel=auto has been mirrored to userspace kexec-tools, so this patch will enable the swiotlb reservation for AMD sme/sev in kexec-tools as well.
Not really about this patch but I find it weird that we adjust the crashkernel=auto value for swiotlb but not for other cases wher we know that we need a larger crashkernel, e.g. when the dump target is on a LUKS encrypted device. I understand that in the past, when crashkernel=auto was handled by the kernel, we simply didn't know about what the dump target will be when we allocated the crashkernel during boot. But today, when crashkernel=auto is handled in userspace we can.
I didn't have much context on LUKS, by looking into the code, I see only "kdumpctl estimate" has the LUKS related crashkernel value increment. It needs the user to adjust crashkernel value manually by estimate-and-set. Maybe we can merge the crashkernel adjustment of swiotlb and luks together, so luks don't need to "kdumpctl estimate" manually anymore?
For LUKS, a kernel-space solution by reusing LUKS master key is preferred because a) LUKS requires too much memory b) we can't expect the user to wait at the screen to input the passphrase to open the disk when kernel crashes. Let's see how far my work [2] can go.
you are right that it is better to solve the problem with LUKS in the kernel. I only used it as an example where today we know that we need more memory but don't automatically increase the crashkernel. So why do it for swiotlb but for others? What's the reason for this inconsistency?
And a little bit more general. What is the relation between 'kdumpctl reset-crashkernel' and 'kdumctl estimate'? The way I see it the workflow we suggest to users currently is
- $ kdumpctl reser-crashkernel
- -> test if it works, if not
- $ kdumpctl estimate
- -> set crashkernel manually to suggested value
- -> test if it works, if not
- -> set crashkernel manually to larger value
- -> test if it works, if not go back to 6
I don't understand why we need the separate 'kdumpctl estimate' step in between. In my opinion the following workflow would be much simpler from a user perspective
- $ kdumpctl reset-crashkernel (calls kdumpctl estimate internally using current default values as minimum values)
- -> test if it woks, if not
- -> set crashkernel manually larger value
- -> test if it works, if not go back to 3
Thanks for sharing your thoughts! This approach also looks better to me! Previously I also tried addressing the swiotlb case in "kdumpctl reset-crashkernel" but QE was worried customer requests to address other cases may flood in and we would quickly reach our capacity to process them. On the other end, I think I need more time to evaluate how "kdumpcl estimate" (including the --reboot feature, Karui's patches still under review) works. "kdumpctl estimate --reboot" seems more promising to me because it offers a general way to estimate the memory requirement of kdump kernel and we don't need to address memory requirement case by case. My plan is to unify "reset-crashkernel" and estimate after evaluating how well "estimate" works. For example, one issue with "estimate --reboot" is it fails to find out the peak memory consumption.
Thanks Philipp
Hi Coiby,
On Wed, 6 Apr 2022 18:31:19 +0800 Coiby Xu coxu@redhat.com wrote:
Hi Philipp,
On Tue, Apr 05, 2022 at 11:37:02AM +0200, Philipp Rudo wrote:
Hi Coiby, Hi Tao,
On Sat, 2 Apr 2022 16:43:37 +0800 Coiby Xu coxu@redhat.com wrote:
Hi Tao,
On Sat, Apr 02, 2022 at 02:14:38PM +0800, Tao Liu wrote:
Hi Philipp,
On Wed, Mar 30, 2022 at 11:34 PM Philipp Rudo prudo@redhat.com wrote:
[...]
implemented the reservation in the kernel side, as part of the kernel crashkernel=auto implementation.
For swiotlb, I think Dave and Emma prefer to do it in kernel space. Maybe you can try a kernel space solution first like [1]?
[1] https://lore.kernel.org/lkml/20190910151341.14986-3-kasong@redhat.com/
For rhel9 however, crashkernel=auto has been mirrored to userspace kexec-tools, so this patch will enable the swiotlb reservation for AMD sme/sev in kexec-tools as well.
Not really about this patch but I find it weird that we adjust the crashkernel=auto value for swiotlb but not for other cases wher we know that we need a larger crashkernel, e.g. when the dump target is on a LUKS encrypted device. I understand that in the past, when crashkernel=auto was handled by the kernel, we simply didn't know about what the dump target will be when we allocated the crashkernel during boot. But today, when crashkernel=auto is handled in userspace we can.
I didn't have much context on LUKS, by looking into the code, I see only "kdumpctl estimate" has the LUKS related crashkernel value increment. It needs the user to adjust crashkernel value manually by estimate-and-set. Maybe we can merge the crashkernel adjustment of swiotlb and luks together, so luks don't need to "kdumpctl estimate" manually anymore?
For LUKS, a kernel-space solution by reusing LUKS master key is preferred because a) LUKS requires too much memory b) we can't expect the user to wait at the screen to input the passphrase to open the disk when kernel crashes. Let's see how far my work [2] can go.
you are right that it is better to solve the problem with LUKS in the kernel. I only used it as an example where today we know that we need more memory but don't automatically increase the crashkernel. So why do it for swiotlb but for others? What's the reason for this inconsistency?
And a little bit more general. What is the relation between 'kdumpctl reset-crashkernel' and 'kdumctl estimate'? The way I see it the workflow we suggest to users currently is
- $ kdumpctl reser-crashkernel
- -> test if it works, if not
- $ kdumpctl estimate
- -> set crashkernel manually to suggested value
- -> test if it works, if not
- -> set crashkernel manually to larger value
- -> test if it works, if not go back to 6
I don't understand why we need the separate 'kdumpctl estimate' step in between. In my opinion the following workflow would be much simpler from a user perspective
- $ kdumpctl reset-crashkernel (calls kdumpctl estimate internally using current default values as minimum values)
- -> test if it woks, if not
- -> set crashkernel manually larger value
- -> test if it works, if not go back to 3
Thanks for sharing your thoughts! This approach also looks better to me! Previously I also tried addressing the swiotlb case in "kdumpctl reset-crashkernel" but QE was worried customer requests to address other cases may flood in and we would quickly reach our capacity to process them.
that's indeed a problem...
On the other end, I think I need more time to evaluate how "kdumpcl estimate" (including the --reboot feature, Karui's patches still under review) works. "kdumpctl estimate --reboot" seems more promising to me because it offers a general way to estimate the memory requirement of kdump kernel and we don't need to address memory requirement case by case.
Yes, that's a big benefit. The downside is that you need an extra reboot. Which can take quite some time on larger systems...
My plan is to unify "reset-crashkernel" and estimate after evaluating how well "estimate" works.
Sounds like a plan.
For example, one issue with "estimate --reboot" is it fails to find out the peak memory consumption.
Unfortunately that is a fundamental problem with dynamic memory allocation for which there is no perfect solution. We can always only tell what the requirements are right _now_ but not what they will be when the system crashes. So unless someone finds a magic crystal ball that can predict the future we have to live with that problem and try to get better with every problem we encounter...
Thanks Philipp
On Tue, Apr 5, 2022 at 5:37 PM Philipp Rudo prudo@redhat.com wrote:
Hi Coiby, Hi Tao,
On Sat, 2 Apr 2022 16:43:37 +0800 Coiby Xu coxu@redhat.com wrote:
Hi Tao,
On Sat, Apr 02, 2022 at 02:14:38PM +0800, Tao Liu wrote:
Hi Philipp,
On Wed, Mar 30, 2022 at 11:34 PM Philipp Rudo prudo@redhat.com wrote:
[...]
implemented the reservation in the kernel side, as part of the kernel crashkernel=auto implementation.
For swiotlb, I think Dave and Emma prefer to do it in kernel space. Maybe you can try a kernel space solution first like [1]?
[1] https://lore.kernel.org/lkml/20190910151341.14986-3-kasong@redhat.com/
For rhel9 however, crashkernel=auto has been mirrored to userspace kexec-tools, so this patch will enable the swiotlb reservation for AMD sme/sev in kexec-tools as well.
Not really about this patch but I find it weird that we adjust the crashkernel=auto value for swiotlb but not for other cases wher we know that we need a larger crashkernel, e.g. when the dump target is on a LUKS encrypted device. I understand that in the past, when crashkernel=auto was handled by the kernel, we simply didn't know about what the dump target will be when we allocated the crashkernel during boot. But today, when crashkernel=auto is handled in userspace we can.
I didn't have much context on LUKS, by looking into the code, I see only "kdumpctl estimate" has the LUKS related crashkernel value increment. It needs the user to adjust crashkernel value manually by estimate-and-set. Maybe we can merge the crashkernel adjustment of swiotlb and luks together, so luks don't need to "kdumpctl estimate" manually anymore?
For LUKS, a kernel-space solution by reusing LUKS master key is preferred because a) LUKS requires too much memory b) we can't expect the user to wait at the screen to input the passphrase to open the disk when kernel crashes. Let's see how far my work [2] can go.
Yeah, after I played around with LUKS for a while, I agree with your point. kernel side implementation is better for luks.
Just for curiosity, as for the case b), can't we use a password file instead of inputting passphrase manually every time? Like:
$ cat /etc/fstab /dev/mapper/mydata /mnt ext4 defaults 0 0 $ cat /etc/crypttab mydata /dev/vdb /root/lukspass $ cryptsetup luksAddKey /dev/vdb /root/lukspass
If /root/lukspass is accessible, luks device can be automatically mounted, thus no user interaction needed.
you are right that it is better to solve the problem with LUKS in the kernel. I only used it as an example where today we know that we need more memory but don't automatically increase the crashkernel. So why do it for swiotlb but for others? What's the reason for this inconsistency?
By looking into the "kdumpctl estimate" code, the memory estimating for luks is calculated dynamically, depending on how many crypt devices there are, and their "Memory:" field status.
However for amd sme/sev swiotlb, the reserved memory for 2nd kernel is 64M, see rhel-8 kernel commit ab30d3f ("[x86] Reserve at most 64M of SWIOTLB memory for crashkernel").
I implement it in the kexec-tools side, because the rhel-8 patch is part of craskernel=auto implementation. As far as I know crashkernel=auto has been mirrored to kexec-tools side, so it is reasonable for me to make it in kexec-tools side as well.
And a little bit more general. What is the relation between 'kdumpctl reset-crashkernel' and 'kdumctl estimate'? The way I see it the workflow we suggest to users currently is
- $ kdumpctl reser-crashkernel
- -> test if it works, if not
- $ kdumpctl estimate
- -> set crashkernel manually to suggested value
- -> test if it works, if not
- -> set crashkernel manually to larger value
- -> test if it works, if not go back to 6
Thanks for the summarization. I happened to lose the context, which confused me a lot when reading the code.
I don't understand why we need the separate 'kdumpctl estimate' step in between. In my opinion the following workflow would be much simpler from a user perspective
- $ kdumpctl reset-crashkernel (calls kdumpctl estimate internally using current default values as minimum values)
- -> test if it woks, if not
- -> set crashkernel manually larger value
- -> test if it works, if not go back to 3
Agreed!
Thanks, Tao Liu
Thanks Philipp
Hi Tao,
On Wed, 6 Apr 2022 18:55:38 +0800 Tao Liu ltao@redhat.com wrote:
On Tue, Apr 5, 2022 at 5:37 PM Philipp Rudo prudo@redhat.com wrote:
Hi Coiby, Hi Tao,
On Sat, 2 Apr 2022 16:43:37 +0800 Coiby Xu coxu@redhat.com wrote:
Hi Tao,
On Sat, Apr 02, 2022 at 02:14:38PM +0800, Tao Liu wrote:
Hi Philipp,
On Wed, Mar 30, 2022 at 11:34 PM Philipp Rudo prudo@redhat.com wrote:
[...]
implemented the reservation in the kernel side, as part of the kernel crashkernel=auto implementation.
For swiotlb, I think Dave and Emma prefer to do it in kernel space. Maybe you can try a kernel space solution first like [1]?
[1] https://lore.kernel.org/lkml/20190910151341.14986-3-kasong@redhat.com/
For rhel9 however, crashkernel=auto has been mirrored to userspace kexec-tools, so this patch will enable the swiotlb reservation for AMD sme/sev in kexec-tools as well.
Not really about this patch but I find it weird that we adjust the crashkernel=auto value for swiotlb but not for other cases wher we know that we need a larger crashkernel, e.g. when the dump target is on a LUKS encrypted device. I understand that in the past, when crashkernel=auto was handled by the kernel, we simply didn't know about what the dump target will be when we allocated the crashkernel during boot. But today, when crashkernel=auto is handled in userspace we can.
I didn't have much context on LUKS, by looking into the code, I see only "kdumpctl estimate" has the LUKS related crashkernel value increment. It needs the user to adjust crashkernel value manually by estimate-and-set. Maybe we can merge the crashkernel adjustment of swiotlb and luks together, so luks don't need to "kdumpctl estimate" manually anymore?
For LUKS, a kernel-space solution by reusing LUKS master key is preferred because a) LUKS requires too much memory b) we can't expect the user to wait at the screen to input the passphrase to open the disk when kernel crashes. Let's see how far my work [2] can go.
Yeah, after I played around with LUKS for a while, I agree with your point. kernel side implementation is better for luks.
Just for curiosity, as for the case b), can't we use a password file instead of inputting passphrase manually every time? Like:
$ cat /etc/fstab /dev/mapper/mydata /mnt ext4 defaults 0 0 $ cat /etc/crypttab mydata /dev/vdb /root/lukspass $ cryptsetup luksAddKey /dev/vdb /root/lukspass
If /root/lukspass is accessible, luks device can be automatically mounted, thus no user interaction needed.
Sounds like an interesting idea. My only concern is that we introduce a potential security risk, e.g. somebody might extract the key file from the initrd and use it to access a disk without permission. But I don't have much experience in the area. So I think it's definitely something worth to further investigate.
you are right that it is better to solve the problem with LUKS in the kernel. I only used it as an example where today we know that we need more memory but don't automatically increase the crashkernel. So why do it for swiotlb but for others? What's the reason for this inconsistency?
By looking into the "kdumpctl estimate" code, the memory estimating for luks is calculated dynamically, depending on how many crypt devices there are, and their "Memory:" field status.
However for amd sme/sev swiotlb, the reserved memory for 2nd kernel is 64M, see rhel-8 kernel commit ab30d3f ("[x86] Reserve at most 64M of SWIOTLB memory for crashkernel").
I implement it in the kexec-tools side, because the rhel-8 patch is part of craskernel=auto implementation. As far as I know crashkernel=auto has been mirrored to kexec-tools side, so it is reasonable for me to make it in kexec-tools side as well.
I fully understand. But remember that the crashkernel needs to be reserved extremely early during boot. Even before the initramfs is mounted. So when crashkernel=auto was implemented in the kernel we simply didn't had information about many things, like what the root device is and if it is encrypted. swiotlb was one of the few exceptions. By moving crashkernel=auto to the kexec-tools we now have the information and could use it. If we decide to do so (see Coibys mail).
And a little bit more general. What is the relation between 'kdumpctl reset-crashkernel' and 'kdumctl estimate'? The way I see it the workflow we suggest to users currently is
- $ kdumpctl reser-crashkernel
- -> test if it works, if not
- $ kdumpctl estimate
- -> set crashkernel manually to suggested value
- -> test if it works, if not
- -> set crashkernel manually to larger value
- -> test if it works, if not go back to 6
Thanks for the summarization. I happened to lose the context, which confused me a lot when reading the code.
np
I don't understand why we need the separate 'kdumpctl estimate' step in between. In my opinion the following workflow would be much simpler from a user perspective
- $ kdumpctl reset-crashkernel (calls kdumpctl estimate internally using current default values as minimum values)
- -> test if it woks, if not
- -> set crashkernel manually larger value
- -> test if it works, if not go back to 3
Agreed!
Thanks Philipp
On Thu, Apr 7, 2022 at 10:44 PM Philipp Rudo prudo@redhat.com wrote:
Hi Tao,
On Wed, 6 Apr 2022 18:55:38 +0800 Tao Liu ltao@redhat.com wrote:
On Tue, Apr 5, 2022 at 5:37 PM Philipp Rudo prudo@redhat.com wrote:
Hi Coiby, Hi Tao,
On Sat, 2 Apr 2022 16:43:37 +0800 Coiby Xu coxu@redhat.com wrote:
Hi Tao,
On Sat, Apr 02, 2022 at 02:14:38PM +0800, Tao Liu wrote:
Hi Philipp,
On Wed, Mar 30, 2022 at 11:34 PM Philipp Rudo prudo@redhat.com wrote:
[...]
> implemented the reservation in the kernel side, as part of the kernel > crashkernel=auto implementation.
For swiotlb, I think Dave and Emma prefer to do it in kernel space. Maybe you can try a kernel space solution first like [1]?
[1] https://lore.kernel.org/lkml/20190910151341.14986-3-kasong@redhat.com/
> > For rhel9 however, crashkernel=auto has been mirrored to userspace > kexec-tools, so this patch will enable the swiotlb reservation for AMD > sme/sev in kexec-tools as well.
Not really about this patch but I find it weird that we adjust the crashkernel=auto value for swiotlb but not for other cases wher we know that we need a larger crashkernel, e.g. when the dump target is on a LUKS encrypted device. I understand that in the past, when crashkernel=auto was handled by the kernel, we simply didn't know about what the dump target will be when we allocated the crashkernel during boot. But today, when crashkernel=auto is handled in userspace we can.
I didn't have much context on LUKS, by looking into the code, I see only "kdumpctl estimate" has the LUKS related crashkernel value increment. It needs the user to adjust crashkernel value manually by estimate-and-set. Maybe we can merge the crashkernel adjustment of swiotlb and luks together, so luks don't need to "kdumpctl estimate" manually anymore?
For LUKS, a kernel-space solution by reusing LUKS master key is preferred because a) LUKS requires too much memory b) we can't expect the user to wait at the screen to input the passphrase to open the disk when kernel crashes. Let's see how far my work [2] can go.
Yeah, after I played around with LUKS for a while, I agree with your point. kernel side implementation is better for luks.
Just for curiosity, as for the case b), can't we use a password file instead of inputting passphrase manually every time? Like:
$ cat /etc/fstab /dev/mapper/mydata /mnt ext4 defaults 0 0 $ cat /etc/crypttab mydata /dev/vdb /root/lukspass $ cryptsetup luksAddKey /dev/vdb /root/lukspass
If /root/lukspass is accessible, luks device can be automatically mounted, thus no user interaction needed.
Sounds like an interesting idea. My only concern is that we introduce a potential security risk, e.g. somebody might extract the key file from the initrd and use it to access a disk without permission. But I don't have much experience in the area. So I think it's definitely something worth to further investigate.
Yes, that indeed is a security risk... I think the /root/lukspass is a file which can be managed just like /etc/passwd, /etc/shadow or /root/.ssh/id_rsa, which only the root user/process has access. The initramfs is also only accessible to root user for the file permission is 600 on my machine...
you are right that it is better to solve the problem with LUKS in the kernel. I only used it as an example where today we know that we need more memory but don't automatically increase the crashkernel. So why do it for swiotlb but for others? What's the reason for this inconsistency?
By looking into the "kdumpctl estimate" code, the memory estimating for luks is calculated dynamically, depending on how many crypt devices there are, and their "Memory:" field status.
However for amd sme/sev swiotlb, the reserved memory for 2nd kernel is 64M, see rhel-8 kernel commit ab30d3f ("[x86] Reserve at most 64M of SWIOTLB memory for crashkernel").
I implement it in the kexec-tools side, because the rhel-8 patch is part of craskernel=auto implementation. As far as I know crashkernel=auto has been mirrored to kexec-tools side, so it is reasonable for me to make it in kexec-tools side as well.
I fully understand. But remember that the crashkernel needs to be reserved extremely early during boot. Even before the initramfs is mounted. So when crashkernel=auto was implemented in the kernel we simply didn't had information about many things, like what the root device is and if it is encrypted. swiotlb was one of the few exceptions. By moving crashkernel=auto to the kexec-tools we now have the information and could use it. If we decide to do so (see Coibys mail).
Thanks for the explanation!
Thanks, Tao Liu
And a little bit more general. What is the relation between 'kdumpctl reset-crashkernel' and 'kdumctl estimate'? The way I see it the workflow we suggest to users currently is
- $ kdumpctl reser-crashkernel
- -> test if it works, if not
- $ kdumpctl estimate
- -> set crashkernel manually to suggested value
- -> test if it works, if not
- -> set crashkernel manually to larger value
- -> test if it works, if not go back to 6
Thanks for the summarization. I happened to lose the context, which confused me a lot when reading the code.
np
I don't understand why we need the separate 'kdumpctl estimate' step in between. In my opinion the following workflow would be much simpler from a user perspective
- $ kdumpctl reset-crashkernel (calls kdumpctl estimate internally using current default values as minimum values)
- -> test if it woks, if not
- -> set crashkernel manually larger value
- -> test if it works, if not go back to 3
Agreed!
Thanks Philipp
On Thu, Apr 07, 2022 at 04:44:01PM +0200, Philipp Rudo wrote:
Hi Tao,
On Wed, 6 Apr 2022 18:55:38 +0800 Tao Liu ltao@redhat.com wrote:
On Tue, Apr 5, 2022 at 5:37 PM Philipp Rudo prudo@redhat.com wrote:
Hi Coiby, Hi Tao,
On Sat, 2 Apr 2022 16:43:37 +0800 Coiby Xu coxu@redhat.com wrote:
Hi Tao,
On Sat, Apr 02, 2022 at 02:14:38PM +0800, Tao Liu wrote:
Hi Philipp,
On Wed, Mar 30, 2022 at 11:34 PM Philipp Rudo prudo@redhat.com wrote:
[...]
> implemented the reservation in the kernel side, as part of the kernel > crashkernel=auto implementation.
For swiotlb, I think Dave and Emma prefer to do it in kernel space. Maybe you can try a kernel space solution first like [1]?
[1] https://lore.kernel.org/lkml/20190910151341.14986-3-kasong@redhat.com/
> > For rhel9 however, crashkernel=auto has been mirrored to userspace > kexec-tools, so this patch will enable the swiotlb reservation for AMD > sme/sev in kexec-tools as well.
Not really about this patch but I find it weird that we adjust the crashkernel=auto value for swiotlb but not for other cases wher we know that we need a larger crashkernel, e.g. when the dump target is on a LUKS encrypted device. I understand that in the past, when crashkernel=auto was handled by the kernel, we simply didn't know about what the dump target will be when we allocated the crashkernel during boot. But today, when crashkernel=auto is handled in userspace we can.
I didn't have much context on LUKS, by looking into the code, I see only "kdumpctl estimate" has the LUKS related crashkernel value increment. It needs the user to adjust crashkernel value manually by estimate-and-set. Maybe we can merge the crashkernel adjustment of swiotlb and luks together, so luks don't need to "kdumpctl estimate" manually anymore?
For LUKS, a kernel-space solution by reusing LUKS master key is preferred because a) LUKS requires too much memory b) we can't expect the user to wait at the screen to input the passphrase to open the disk when kernel crashes. Let's see how far my work [2] can go.
Yeah, after I played around with LUKS for a while, I agree with your point. kernel side implementation is better for luks.
Just for curiosity, as for the case b), can't we use a password file instead of inputting passphrase manually every time? Like:
$ cat /etc/fstab /dev/mapper/mydata /mnt ext4 defaults 0 0 $ cat /etc/crypttab mydata /dev/vdb /root/lukspass $ cryptsetup luksAddKey /dev/vdb /root/lukspass
If /root/lukspass is accessible, luks device can be automatically mounted, thus no user interaction needed.
Sounds like an interesting idea. My only concern is that we introduce a potential security risk, e.g. somebody might extract the key file from the initrd and use it to access a disk without permission. But I don't have much experience in the area. So I think it's definitely something worth to further investigate.
Yes, although I don't know how it may be exploited, I think some people may object of the idea of using a passpharse file. Btw, this only partially solves the LUKS problem because the second issue is using the passpharse to compute the master key consumes the same amount of memory as using the password manually by the user.
you are right that it is better to solve the problem with LUKS in the kernel. I only used it as an example where today we know that we need more memory but don't automatically increase the crashkernel. So why do it for swiotlb but for others? What's the reason for this inconsistency?
By looking into the "kdumpctl estimate" code, the memory estimating for luks is calculated dynamically, depending on how many crypt devices there are, and their "Memory:" field status.
However for amd sme/sev swiotlb, the reserved memory for 2nd kernel is 64M, see rhel-8 kernel commit ab30d3f ("[x86] Reserve at most 64M of SWIOTLB memory for crashkernel").
Note the extra memory is not also dynamic for the sme/sev case for the kernel implementation,
if (mem_encrypt_active()) mem_enc_req = min(ALIGN(swiotlb_size_or_default(), SZ_1M), 64UL << 20); else mem_enc_req = 0;
I implement it in the kexec-tools side, because the rhel-8 patch is part of craskernel=auto implementation. As far as I know crashkernel=auto has been mirrored to kexec-tools side, so it is reasonable for me to make it in kexec-tools side as well.
I fully understand. But remember that the crashkernel needs to be reserved extremely early during boot. Even before the initramfs is mounted. So when crashkernel=auto was implemented in the kernel we simply didn't had information about many things, like what the root device is and if it is encrypted. swiotlb was one of the few exceptions. By moving crashkernel=auto to the kexec-tools we now have the information and could use it. If we decide to do so (see Coibys mail).
Actually, implementing it in kernel has one benefit which is we can calculate the memory requirement before allocating kdump memory so there is no need to reboot the system.
There is also another problem for implementing it in kexec-tools. Currently kexec-tools assume the default crashernel values are static and only try to update the kernel cmdline parameter crashkernel when the package is updated. For the case of the user dynamically toggling sme/sev on/off, we need a mechanism to automatically update the crashkernel parameter and then reboot the system. And this is not a concern if we implement it in kernel side.