The current method for kdump memory debug is to use dracut "rd.memdebug=[0-3]", it is not enough for debugging kernel modules. For example, when we want to find out which kernel module consumes a large amount of memory, "rd.memdebug" won't help too much.
A better way is needed to achieve this requirement, this is very useful for kdump OOM debugging.
The principle of this patch series is to use kernel trace to track slab and buddy allocation calls during kernel module loading(module_init), thus we can analyze all the trace data and get the total memory consumption.
The trace events include memory calls under /sys/kernel/debug/tracing/events: kmem/mm_page_alloc kmem/mm_page_free kmem/kmalloc kmem/kmalloc_node kmem/kmem_cache_alloc kmem/kmem_cache_alloc_node
We also inpect the following events to detect the module loading module/module_load module/module_put
We can get the module name and task pid from "module_load" event which also mark the beginning of the loading, and module_put called by the same task pid implies the end of the loading. So the memory events recorded in between by the same task pid are consumed by this module during loading(i.e. modprobe or module_init()).
With these information, we can record approximately the total memory consumption involved by each kernel module loading.
One major flaw of this method is that the trace ring buffer consumes a lot of memory. If it is too small, old records maybe be overwritten by subsequent records. The trace ring buffer is set to be 10MB by default, but it can be overridden by users via the standard kernel boot parameter "trace_buf_size".
Users should increase the crash kernel memory reservation as needed after setting large trace ring buffer size, in case oom happens during debugging.
Usage: 1)Pass "rd.memdebug" to kdump kernel cmdline using "KDUMP_COMMANDLINE_APPEND" in /etc/sysconfig/kdump. 2)Pass the extra "trace_buf_size=nn[KMG]" to specify trace ring buffer size(per cpu) as needed.
Xunlei Pang (2): memdebug-ko: add dracut-memdebug-ko.sh to debug kernel module memory consumption module-setup: apply kernel module memory debug support
dracut-kdump.sh | 11 ++++ dracut-memdebug-ko.sh | 144 +++++++++++++++++++++++++++++++++++++++++++++++++ dracut-module-setup.sh | 12 +++++ kdumpctl | 14 +++++ kexec-tools.spec | 2 + 5 files changed, 183 insertions(+) create mode 100755 dracut-memdebug-ko.sh
Add dracut-memdebug-ko.sh, install it to the dracut kdump module.
The principle is to use kernel trace to track slab and buddy allocation calls during kernel module loading(module_init), thus we can analyze all the trace data and get the total memory consumption.
One major flaw of this method is that it consumes a lot of memory, users should increase the crash kernel memory reservation as needed.
Signed-off-by: Xunlei Pang xlpang@redhat.com --- dracut-memdebug-ko.sh | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++ kexec-tools.spec | 2 + 2 files changed, 153 insertions(+) create mode 100755 dracut-memdebug-ko.sh
diff --git a/dracut-memdebug-ko.sh b/dracut-memdebug-ko.sh new file mode 100755 index 0000000..bb404ce --- /dev/null +++ b/dracut-memdebug-ko.sh @@ -0,0 +1,151 @@ +#Debug the number of memory through alloc_pages and kmalloc consumed +#by kernel modules during loading(i.e. modprobe). +#NOTE: kmalloc may trigger alloc_pages, thus resulting in double account. + +if ! [[ -e /sys/kernel/debug/tracing ]]; then + mount none -t debugfs /sys/kernel/debug + if ! [[ -d /sys/kernel/debug/tracing ]]; then + warn "Mount debugfs failed, can't activate trace, skip kernel module memory analyzing!" + return 0 + fi + + # 10MB should be big enough for most cases? + # If the current ring buffer size is the default one(contains "expanded" keyword), + # set it to 10MB. Users can set other size via "trace_buf_size" kernel boot command. + cat /proc/cmdline | grep -q "trace_buf_size=" + if [[ $? -ne 0 ]]; then + echo 10240 > /sys/kernel/debug/tracing/buffer_size_kb + fi + + # Prepare trace for the first time + echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_page_alloc/enable + echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_page_free/enable + echo 1 > /sys/kernel/debug/tracing/events/kmem/kmalloc/enable + echo 1 > /sys/kernel/debug/tracing/events/kmem/kmalloc_node/enable + echo 1 > /sys/kernel/debug/tracing/events/kmem/kmem_cache_alloc/enable + echo 1 > /sys/kernel/debug/tracing/events/kmem/kmem_cache_alloc_node/enable + echo 1 > /sys/kernel/debug/tracing/events/module/module_load/enable + echo 1 > /sys/kernel/debug/tracing/events/module/module_put/enable + echo 1 > /sys/kernel/debug/tracing/tracing_on + return 0 +fi + +echo 0 > /sys/kernel/debug/tracing/tracing_on +TMPFILE=/tmp/tmp$$$$ +cp /sys/kernel/debug/tracing/trace $TMPFILE -f +# Clear old trace data after copied away +echo > /sys/kernel/debug/tracing/trace + +#Indexed by task pid. +declare -A current_module + +#Indexed by module name. +declare -A module_loaded +declare -A nr_alloc_pages +declare -A nr_alloc_pages_peak +declare -A nr_kmalloc +#For x86: If the request size of kmalloc is greater than 2*PAGE_SIZE, SLUB will use buddy instead. +#So we maintain the statistics of the large kmalloc requests. For ppc64, PAGE_SIZE is not 4096, +#but the large kmalloc request is not very common, this information is just to give some tips. +declare -A nr_kmalloc_above8192 + +declare -A nr_kmem_cache_alloc + +# $1: order of pages +order_to_pages() +{ + local pages=1 + local order=$1 + + while [[ $order != 0 ]]; do + order=$((order-1)) + pages=$(($pages*2)) + done + + echo $pages +} + +while read pid cpu flags ts function ; +do + #Skip comment lines + if [[ $pid = "#" ]]; then + continue + fi + + if [[ $function = module_load* ]]; then + #One module is being loaded, save the task pid for tracking. + module_name=${function#*: } + module_names+=" $module_name" + current_module[$pid]="$module_name" + [[ ${module_loaded[$module_name]} ]] && warn ""$module_name" was loaded multiple times!" + unset module_loaded[$module_name] + nr_alloc_pages[$module_name]=0 + nr_alloc_pages_peak[$module_name]=0 + nr_kmalloc[$module_name]=0 + nr_kmalloc_above8192[$module_name]=0 + nr_kmem_cache_alloc[$module_name]=0 + fi + + if ! [[ ${current_module[$pid]} ]]; then + continue + fi + + if [[ $function = module_put* ]]; then + #Mark the module as loaded + module_loaded[${current_module[$pid]}]=1 + #module has been loaded when module_put is called, untrack the task + unset current_module[$pid] + continue + fi + + #Once we get here, the task is being tracked(is loading a module). + #Get the module name. + module_name=${current_module[$pid]} + + if [[ $function = mm_page_alloc* ]]; then + order=$(echo $function | sed -e 's/.*order=([0-9]*) .*/\1/') + nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$(order_to_pages $order))) + if [[ ${nr_alloc_pages[$module_name]} -gt ${nr_alloc_pages_peak[$module_name]} ]]; then + nr_alloc_pages_peak[$module_name]=${nr_alloc_pages[$module_name]} + fi + fi + + if [[ $function = mm_page_free* ]]; then + order=$(echo $function | sed -e 's/.*order=([0-9]*)/\1/') + nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}-$(order_to_pages $order))) + fi + + if [[ $function = kmalloc* ]]; then + bytes_alloc=$(echo $function | sed -e 's/.*bytes_alloc=([0-9]*).*/\1/') + nr_kmalloc[$module_name]=$((${nr_kmalloc[$module_name]}+$bytes_alloc)) + if [[ $bytes_alloc -gt 8192 ]]; then + nr_kmalloc_above8192[$module_name]=$((${nr_kmalloc_above8192[$module_name]}+$bytes_alloc)) + fi + fi + + if [[ $function = kmem_cache_alloc* ]]; then + bytes_alloc=$(echo $function | sed -e 's/.*bytes_alloc=([0-9]*).*/\1/') + nr_kmem_cache_alloc[$module_name]=$((${nr_kmem_cache_alloc[$module_name]}+$bytes_alloc)) + fi +done < $TMPFILE + +echo -e "\n\n== debug_mem for kernel modules during loading begin ==" >&2 +for i in $module_names; do + status="finished" + if ! [[ ${module_loaded[$i]} ]]; then + status="loading" + fi + echo -e "[Module Name] "$i" (loading status: $status)" >&2 + echo -e "alloc_pages: consumed ${nr_alloc_pages[$i]} pages (peak: ${nr_alloc_pages_peak[$i]} pages)" >&2 + echo -e "kmalloc: consumed ${nr_kmalloc[$i]} bytes (above8192: ${nr_kmalloc_above8192[$i]} bytes)\n" >&2 + echo -e "kmem_cache_alloc: consumed ${nr_kmem_cache_alloc[$i]} bytes\n" >&2 +done +echo -e "== debug_mem for kernel modules during loading end ==\n\n" >&2 + +unset module_names +unset module_loaded + +rm $TMPFILE -f +echo 1 > /sys/kernel/debug/tracing/tracing_on + +return 0 diff --git a/kexec-tools.spec b/kexec-tools.spec index 0bbaf72..1f0b7f5 100644 --- a/kexec-tools.spec +++ b/kexec-tools.spec @@ -40,6 +40,7 @@ Source103: dracut-kdump-error-handler.sh Source104: dracut-kdump-emergency.service Source105: dracut-kdump-error-handler.service Source106: dracut-kdump-capture.service +Source107: dracut-memdebug-ko.sh
Requires(post): systemd-units Requires(preun): systemd-units @@ -203,6 +204,7 @@ cp %{SOURCE103} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpb cp %{SOURCE104} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE104}} cp %{SOURCE105} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE105}} cp %{SOURCE106} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE106}} +cp %{SOURCE107} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE107}} chmod 755 $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE100}} chmod 755 $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE101}}
Hi, Xunlei
Nice work. Thanks for the effort.
Several things I would like to make it clear:
*) Can it be put in dracut? In kdump script we just inst a hook at kdump_pre, other hooks go to dracut code.
*) How do you find the 10M trace buffer size? It is based on some test results?
*) Could we avoid the temp file?
*) You trace the mem alloc functions, but do we need trace *_free as well
Thanks Dave
On 10/10/16 at 03:43pm, Xunlei Pang wrote:
Add dracut-memdebug-ko.sh, install it to the dracut kdump module.
The principle is to use kernel trace to track slab and buddy allocation calls during kernel module loading(module_init), thus we can analyze all the trace data and get the total memory consumption.
One major flaw of this method is that it consumes a lot of memory, users should increase the crash kernel memory reservation as needed.
Signed-off-by: Xunlei Pang xlpang@redhat.com
dracut-memdebug-ko.sh | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++ kexec-tools.spec | 2 + 2 files changed, 153 insertions(+) create mode 100755 dracut-memdebug-ko.sh
diff --git a/dracut-memdebug-ko.sh b/dracut-memdebug-ko.sh new file mode 100755 index 0000000..bb404ce --- /dev/null +++ b/dracut-memdebug-ko.sh @@ -0,0 +1,151 @@ +#Debug the number of memory through alloc_pages and kmalloc consumed +#by kernel modules during loading(i.e. modprobe). +#NOTE: kmalloc may trigger alloc_pages, thus resulting in double account.
+if ! [[ -e /sys/kernel/debug/tracing ]]; then
- mount none -t debugfs /sys/kernel/debug
- if ! [[ -d /sys/kernel/debug/tracing ]]; then
warn "Mount debugfs failed, can't activate trace, skip kernel module memory analyzing!"
return 0
- fi
- # 10MB should be big enough for most cases?
- # If the current ring buffer size is the default one(contains "expanded" keyword),
- # set it to 10MB. Users can set other size via "trace_buf_size" kernel boot command.
- cat /proc/cmdline | grep -q "trace_buf_size="
- if [[ $? -ne 0 ]]; then
echo 10240 > /sys/kernel/debug/tracing/buffer_size_kb
- fi
- # Prepare trace for the first time
- echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_page_alloc/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_page_free/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmalloc/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmalloc_node/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmem_cache_alloc/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmem_cache_alloc_node/enable
- echo 1 > /sys/kernel/debug/tracing/events/module/module_load/enable
- echo 1 > /sys/kernel/debug/tracing/events/module/module_put/enable
- echo 1 > /sys/kernel/debug/tracing/tracing_on
- return 0
+fi
+echo 0 > /sys/kernel/debug/tracing/tracing_on +TMPFILE=/tmp/tmp$$$$ +cp /sys/kernel/debug/tracing/trace $TMPFILE -f +# Clear old trace data after copied away +echo > /sys/kernel/debug/tracing/trace
+#Indexed by task pid. +declare -A current_module
+#Indexed by module name. +declare -A module_loaded +declare -A nr_alloc_pages +declare -A nr_alloc_pages_peak +declare -A nr_kmalloc +#For x86: If the request size of kmalloc is greater than 2*PAGE_SIZE, SLUB will use buddy instead. +#So we maintain the statistics of the large kmalloc requests. For ppc64, PAGE_SIZE is not 4096, +#but the large kmalloc request is not very common, this information is just to give some tips. +declare -A nr_kmalloc_above8192
+declare -A nr_kmem_cache_alloc
+# $1: order of pages +order_to_pages() +{
- local pages=1
- local order=$1
- while [[ $order != 0 ]]; do
order=$((order-1))
pages=$(($pages*2))
- done
- echo $pages
+}
+while read pid cpu flags ts function ; +do
- #Skip comment lines
- if [[ $pid = "#" ]]; then
continue
- fi
- if [[ $function = module_load* ]]; then
#One module is being loaded, save the task pid for tracking.
module_name=${function#*: }
module_names+=" $module_name"
current_module[$pid]="$module_name"
[[ ${module_loaded[$module_name]} ]] && warn "\"$module_name\" was loaded multiple times!"
unset module_loaded[$module_name]
nr_alloc_pages[$module_name]=0
nr_alloc_pages_peak[$module_name]=0
nr_kmalloc[$module_name]=0
nr_kmalloc_above8192[$module_name]=0
nr_kmem_cache_alloc[$module_name]=0
- fi
- if ! [[ ${current_module[$pid]} ]]; then
continue
- fi
- if [[ $function = module_put* ]]; then
#Mark the module as loaded
module_loaded[${current_module[$pid]}]=1
#module has been loaded when module_put is called, untrack the task
unset current_module[$pid]
continue
- fi
- #Once we get here, the task is being tracked(is loading a module).
- #Get the module name.
- module_name=${current_module[$pid]}
- if [[ $function = mm_page_alloc* ]]; then
order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/')
nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$(order_to_pages $order)))
if [[ ${nr_alloc_pages[$module_name]} -gt ${nr_alloc_pages_peak[$module_name]} ]]; then
nr_alloc_pages_peak[$module_name]=${nr_alloc_pages[$module_name]}
fi
- fi
- if [[ $function = mm_page_free* ]]; then
order=$(echo $function | sed -e 's/.*order=\([0-9]*\)/\1/')
nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}-$(order_to_pages $order)))
- fi
- if [[ $function = kmalloc* ]]; then
bytes_alloc=$(echo $function | sed -e 's/.*bytes_alloc=\([0-9]*\).*/\1/')
nr_kmalloc[$module_name]=$((${nr_kmalloc[$module_name]}+$bytes_alloc))
if [[ $bytes_alloc -gt 8192 ]]; then
nr_kmalloc_above8192[$module_name]=$((${nr_kmalloc_above8192[$module_name]}+$bytes_alloc))
fi
- fi
- if [[ $function = kmem_cache_alloc* ]]; then
bytes_alloc=$(echo $function | sed -e 's/.*bytes_alloc=\([0-9]*\).*/\1/')
nr_kmem_cache_alloc[$module_name]=$((${nr_kmem_cache_alloc[$module_name]}+$bytes_alloc))
- fi
+done < $TMPFILE
+echo -e "\n\n== debug_mem for kernel modules during loading begin ==" >&2 +for i in $module_names; do
- status="finished"
- if ! [[ ${module_loaded[$i]} ]]; then
status="loading"
- fi
- echo -e "[Module Name] "$i" (loading status: $status)" >&2
- echo -e "alloc_pages: consumed ${nr_alloc_pages[$i]} pages (peak: ${nr_alloc_pages_peak[$i]} pages)" >&2
- echo -e "kmalloc: consumed ${nr_kmalloc[$i]} bytes (above8192: ${nr_kmalloc_above8192[$i]} bytes)\n" >&2
- echo -e "kmem_cache_alloc: consumed ${nr_kmem_cache_alloc[$i]} bytes\n" >&2
+done +echo -e "== debug_mem for kernel modules during loading end ==\n\n" >&2
+unset module_names +unset module_loaded
+rm $TMPFILE -f +echo 1 > /sys/kernel/debug/tracing/tracing_on
+return 0 diff --git a/kexec-tools.spec b/kexec-tools.spec index 0bbaf72..1f0b7f5 100644 --- a/kexec-tools.spec +++ b/kexec-tools.spec @@ -40,6 +40,7 @@ Source103: dracut-kdump-error-handler.sh Source104: dracut-kdump-emergency.service Source105: dracut-kdump-error-handler.service Source106: dracut-kdump-capture.service +Source107: dracut-memdebug-ko.sh
Requires(post): systemd-units Requires(preun): systemd-units @@ -203,6 +204,7 @@ cp %{SOURCE103} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpb cp %{SOURCE104} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE104}} cp %{SOURCE105} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE105}} cp %{SOURCE106} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE106}} +cp %{SOURCE107} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE107}} chmod 755 $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE100}} chmod 755 $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE101}}
-- 1.8.3.1 _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
On 2016/10/14 at 14:31, Dave Young wrote:
Hi, Xunlei
Nice work. Thanks for the effort.
Several things I would like to make it clear:
*) Can it be put in dracut? In kdump script we just inst a hook at kdump_pre, other hooks go to dracut code.
I can try to do that, but not sure if Harald like this approach, because it consumes many memory.
*) How do you find the 10M trace buffer size? It is based on some test results?
Yes, based on some test results. But users can overide it by passing "trace_buf_size" cmdline.
*) Could we avoid the temp file?
The reason I copy it to a temp file is the fact that I found the read process is very slow if it is from the sysfs trace file.
*) You trace the mem alloc functions, but do we need trace *_free as well
I added it here in case of alloc/free during module_init, for general cases, it should be fine. I can remove this which will also avoid some trace data if you think so :-)
Also slab related functions normally don't cause large memory allocation. But I encountered this case for "qxl" modules.
Regards, Xunlei
Thanks Dave
On 10/10/16 at 03:43pm, Xunlei Pang wrote:
Add dracut-memdebug-ko.sh, install it to the dracut kdump module.
The principle is to use kernel trace to track slab and buddy allocation calls during kernel module loading(module_init), thus we can analyze all the trace data and get the total memory consumption.
One major flaw of this method is that it consumes a lot of memory, users should increase the crash kernel memory reservation as needed.
Signed-off-by: Xunlei Pang xlpang@redhat.com
dracut-memdebug-ko.sh | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++ kexec-tools.spec | 2 + 2 files changed, 153 insertions(+) create mode 100755 dracut-memdebug-ko.sh
diff --git a/dracut-memdebug-ko.sh b/dracut-memdebug-ko.sh new file mode 100755 index 0000000..bb404ce --- /dev/null +++ b/dracut-memdebug-ko.sh @@ -0,0 +1,151 @@ +#Debug the number of memory through alloc_pages and kmalloc consumed +#by kernel modules during loading(i.e. modprobe). +#NOTE: kmalloc may trigger alloc_pages, thus resulting in double account.
+if ! [[ -e /sys/kernel/debug/tracing ]]; then
- mount none -t debugfs /sys/kernel/debug
- if ! [[ -d /sys/kernel/debug/tracing ]]; then
warn "Mount debugfs failed, can't activate trace, skip kernel module memory analyzing!"
return 0
- fi
- # 10MB should be big enough for most cases?
- # If the current ring buffer size is the default one(contains "expanded" keyword),
- # set it to 10MB. Users can set other size via "trace_buf_size" kernel boot command.
- cat /proc/cmdline | grep -q "trace_buf_size="
- if [[ $? -ne 0 ]]; then
echo 10240 > /sys/kernel/debug/tracing/buffer_size_kb
- fi
- # Prepare trace for the first time
- echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_page_alloc/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_page_free/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmalloc/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmalloc_node/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmem_cache_alloc/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmem_cache_alloc_node/enable
- echo 1 > /sys/kernel/debug/tracing/events/module/module_load/enable
- echo 1 > /sys/kernel/debug/tracing/events/module/module_put/enable
- echo 1 > /sys/kernel/debug/tracing/tracing_on
- return 0
+fi
+echo 0 > /sys/kernel/debug/tracing/tracing_on +TMPFILE=/tmp/tmp$$$$ +cp /sys/kernel/debug/tracing/trace $TMPFILE -f +# Clear old trace data after copied away +echo > /sys/kernel/debug/tracing/trace
+#Indexed by task pid. +declare -A current_module
+#Indexed by module name. +declare -A module_loaded +declare -A nr_alloc_pages +declare -A nr_alloc_pages_peak +declare -A nr_kmalloc +#For x86: If the request size of kmalloc is greater than 2*PAGE_SIZE, SLUB will use buddy instead. +#So we maintain the statistics of the large kmalloc requests. For ppc64, PAGE_SIZE is not 4096, +#but the large kmalloc request is not very common, this information is just to give some tips. +declare -A nr_kmalloc_above8192
+declare -A nr_kmem_cache_alloc
+# $1: order of pages +order_to_pages() +{
- local pages=1
- local order=$1
- while [[ $order != 0 ]]; do
order=$((order-1))
pages=$(($pages*2))
- done
- echo $pages
+}
+while read pid cpu flags ts function ; +do
- #Skip comment lines
- if [[ $pid = "#" ]]; then
continue
- fi
- if [[ $function = module_load* ]]; then
#One module is being loaded, save the task pid for tracking.
module_name=${function#*: }
module_names+=" $module_name"
current_module[$pid]="$module_name"
[[ ${module_loaded[$module_name]} ]] && warn "\"$module_name\" was loaded multiple times!"
unset module_loaded[$module_name]
nr_alloc_pages[$module_name]=0
nr_alloc_pages_peak[$module_name]=0
nr_kmalloc[$module_name]=0
nr_kmalloc_above8192[$module_name]=0
nr_kmem_cache_alloc[$module_name]=0
- fi
- if ! [[ ${current_module[$pid]} ]]; then
continue
- fi
- if [[ $function = module_put* ]]; then
#Mark the module as loaded
module_loaded[${current_module[$pid]}]=1
#module has been loaded when module_put is called, untrack the task
unset current_module[$pid]
continue
- fi
- #Once we get here, the task is being tracked(is loading a module).
- #Get the module name.
- module_name=${current_module[$pid]}
- if [[ $function = mm_page_alloc* ]]; then
order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/')
nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$(order_to_pages $order)))
if [[ ${nr_alloc_pages[$module_name]} -gt ${nr_alloc_pages_peak[$module_name]} ]]; then
nr_alloc_pages_peak[$module_name]=${nr_alloc_pages[$module_name]}
fi
- fi
- if [[ $function = mm_page_free* ]]; then
order=$(echo $function | sed -e 's/.*order=\([0-9]*\)/\1/')
nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}-$(order_to_pages $order)))
- fi
- if [[ $function = kmalloc* ]]; then
bytes_alloc=$(echo $function | sed -e 's/.*bytes_alloc=\([0-9]*\).*/\1/')
nr_kmalloc[$module_name]=$((${nr_kmalloc[$module_name]}+$bytes_alloc))
if [[ $bytes_alloc -gt 8192 ]]; then
nr_kmalloc_above8192[$module_name]=$((${nr_kmalloc_above8192[$module_name]}+$bytes_alloc))
fi
- fi
- if [[ $function = kmem_cache_alloc* ]]; then
bytes_alloc=$(echo $function | sed -e 's/.*bytes_alloc=\([0-9]*\).*/\1/')
nr_kmem_cache_alloc[$module_name]=$((${nr_kmem_cache_alloc[$module_name]}+$bytes_alloc))
- fi
+done < $TMPFILE
+echo -e "\n\n== debug_mem for kernel modules during loading begin ==" >&2 +for i in $module_names; do
- status="finished"
- if ! [[ ${module_loaded[$i]} ]]; then
status="loading"
- fi
- echo -e "[Module Name] "$i" (loading status: $status)" >&2
- echo -e "alloc_pages: consumed ${nr_alloc_pages[$i]} pages (peak: ${nr_alloc_pages_peak[$i]} pages)" >&2
- echo -e "kmalloc: consumed ${nr_kmalloc[$i]} bytes (above8192: ${nr_kmalloc_above8192[$i]} bytes)\n" >&2
- echo -e "kmem_cache_alloc: consumed ${nr_kmem_cache_alloc[$i]} bytes\n" >&2
+done +echo -e "== debug_mem for kernel modules during loading end ==\n\n" >&2
+unset module_names +unset module_loaded
+rm $TMPFILE -f +echo 1 > /sys/kernel/debug/tracing/tracing_on
+return 0 diff --git a/kexec-tools.spec b/kexec-tools.spec index 0bbaf72..1f0b7f5 100644 --- a/kexec-tools.spec +++ b/kexec-tools.spec @@ -40,6 +40,7 @@ Source103: dracut-kdump-error-handler.sh Source104: dracut-kdump-emergency.service Source105: dracut-kdump-error-handler.service Source106: dracut-kdump-capture.service +Source107: dracut-memdebug-ko.sh
Requires(post): systemd-units Requires(preun): systemd-units @@ -203,6 +204,7 @@ cp %{SOURCE103} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpb cp %{SOURCE104} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE104}} cp %{SOURCE105} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE105}} cp %{SOURCE106} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE106}} +cp %{SOURCE107} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE107}} chmod 755 $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE100}} chmod 755 $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE101}}
-- 1.8.3.1 _______________________________________________ kexec mailing list -- kexec@lists.fedoraproject.org To unsubscribe send an email to kexec-leave@lists.fedoraproject.org
+# $1: order of pages +order_to_pages() +{
- local pages=1
- local order=$1
- while [[ $order != 0 ]]; do
order=$((order-1))
pages=$(($pages*2))
- done
- echo $pages
+}
For bash we can use $((2 ** $order)) to get the number..
Thanks Dave
On 2016/10/14 at 14:33, Dave Young wrote:
+# $1: order of pages +order_to_pages() +{
- local pages=1
- local order=$1
- while [[ $order != 0 ]]; do
order=$((order-1))
pages=$(($pages*2))
- done
- echo $pages
+}
For bash we can use $((2 ** $order)) to get the number..
Indeed, will do.
Regards, Xunlei
Hi Xunlei,
Thanks for this work. It would be a great help in debugging kdump oom memory issues.
On Monday 10 October 2016 01:13 PM, Xunlei Pang wrote:
Add dracut-memdebug-ko.sh, install it to the dracut kdump module.
The principle is to use kernel trace to track slab and buddy allocation calls during kernel module loading(module_init), thus we can analyze all the trace data and get the total memory consumption.
One major flaw of this method is that it consumes a lot of memory, users should increase the crash kernel memory reservation as needed.
Signed-off-by: Xunlei Pang xlpang@redhat.com
dracut-memdebug-ko.sh | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++ kexec-tools.spec | 2 + 2 files changed, 153 insertions(+) create mode 100755 dracut-memdebug-ko.sh
diff --git a/dracut-memdebug-ko.sh b/dracut-memdebug-ko.sh new file mode 100755 index 0000000..bb404ce --- /dev/null +++ b/dracut-memdebug-ko.sh @@ -0,0 +1,151 @@ +#Debug the number of memory through alloc_pages and kmalloc consumed +#by kernel modules during loading(i.e. modprobe). +#NOTE: kmalloc may trigger alloc_pages, thus resulting in double account.
+if ! [[ -e /sys/kernel/debug/tracing ]]; then
- mount none -t debugfs /sys/kernel/debug
May be you can also check if tracefs is mounted on /sys/kernel/tracing? trace access through debugfs would be obsolete soon.
- if ! [[ -d /sys/kernel/debug/tracing ]]; then
warn "Mount debugfs failed, can't activate trace, skip kernel module memory analyzing!"
return 0
- fi
- # 10MB should be big enough for most cases?
- # If the current ring buffer size is the default one(contains "expanded" keyword),
- # set it to 10MB. Users can set other size via "trace_buf_size" kernel boot command.
- cat /proc/cmdline | grep -q "trace_buf_size="
- if [[ $? -ne 0 ]]; then
echo 10240 > /sys/kernel/debug/tracing/buffer_size_kb
- fi
- # Prepare trace for the first time
What if /sys/kernel/debug/tracing did exist , but kmem events were not enabled?
- echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_page_alloc/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_page_free/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmalloc/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmalloc_node/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmem_cache_alloc/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmem_cache_alloc_node/enable
- echo 1 > /sys/kernel/debug/tracing/events/module/module_load/enable
- echo 1 > /sys/kernel/debug/tracing/events/module/module_put/enable
would be better to clear old trace data at this point, just to make sure that its empty.
- echo 1 > /sys/kernel/debug/tracing/tracing_on
- return 0
+fi
+echo 0 > /sys/kernel/debug/tracing/tracing_on +TMPFILE=/tmp/tmp$$$$ +cp /sys/kernel/debug/tracing/trace $TMPFILE -f +# Clear old trace data after copied away +echo > /sys/kernel/debug/tracing/trace
+#Indexed by task pid. +declare -A current_module
+#Indexed by module name. +declare -A module_loaded +declare -A nr_alloc_pages +declare -A nr_alloc_pages_peak +declare -A nr_kmalloc +#For x86: If the request size of kmalloc is greater than 2*PAGE_SIZE, SLUB will use buddy instead. +#So we maintain the statistics of the large kmalloc requests. For ppc64, PAGE_SIZE is not 4096, +#but the large kmalloc request is not very common, this information is just to give some tips. +declare -A nr_kmalloc_above8192
+declare -A nr_kmem_cache_alloc
+# $1: order of pages +order_to_pages() +{
- local pages=1
- local order=$1
- while [[ $order != 0 ]]; do
order=$((order-1))
pages=$(($pages*2))
- done
- echo $pages
+}
+while read pid cpu flags ts function ; +do
- #Skip comment lines
- if [[ $pid = "#" ]]; then
continue
- fi
- if [[ $function = module_load* ]]; then
#One module is being loaded, save the task pid for tracking.
module_name=${function#*: }
module_names+=" $module_name"
current_module[$pid]="$module_name"
[[ ${module_loaded[$module_name]} ]] && warn "\"$module_name\" was loaded multiple times!"
unset module_loaded[$module_name]
nr_alloc_pages[$module_name]=0
nr_alloc_pages_peak[$module_name]=0
nr_kmalloc[$module_name]=0
nr_kmalloc_above8192[$module_name]=0
nr_kmem_cache_alloc[$module_name]=0
- fi
- if ! [[ ${current_module[$pid]} ]]; then
continue
- fi
- if [[ $function = module_put* ]]; then
#Mark the module as loaded
module_loaded[${current_module[$pid]}]=1
#module has been loaded when module_put is called, untrack the task
unset current_module[$pid]
continue
- fi
- #Once we get here, the task is being tracked(is loading a module).
- #Get the module name.
- module_name=${current_module[$pid]}
- if [[ $function = mm_page_alloc* ]]; then
order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/')
nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$(order_to_pages $order)))
May be I am missing..since we do not track mm_page_free(), so wouldn't nr_alloc_pages[$module_name] be always growing? In that case, how can it provide correct peak memory usage?
if [[ ${nr_alloc_pages[$module_name]} -gt ${nr_alloc_pages_peak[$module_name]} ]]; then
nr_alloc_pages_peak[$module_name]=${nr_alloc_pages[$module_name]}
fi
- fi
- if [[ $function = mm_page_free* ]]; then
order=$(echo $function | sed -e 's/.*order=\([0-9]*\)/\1/')
nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}-$(order_to_pages $order)))
- fi
- if [[ $function = kmalloc* ]]; then
bytes_alloc=$(echo $function | sed -e 's/.*bytes_alloc=\([0-9]*\).*/\1/')
nr_kmalloc[$module_name]=$((${nr_kmalloc[$module_name]}+$bytes_alloc))
if [[ $bytes_alloc -gt 8192 ]]; then
nr_kmalloc_above8192[$module_name]=$((${nr_kmalloc_above8192[$module_name]}+$bytes_alloc))
fi
- fi
- if [[ $function = kmem_cache_alloc* ]]; then
bytes_alloc=$(echo $function | sed -e 's/.*bytes_alloc=\([0-9]*\).*/\1/')
nr_kmem_cache_alloc[$module_name]=$((${nr_kmem_cache_alloc[$module_name]}+$bytes_alloc))
- fi
+done < $TMPFILE
+echo -e "\n\n== debug_mem for kernel modules during loading begin ==" >&2 +for i in $module_names; do
- status="finished"
- if ! [[ ${module_loaded[$i]} ]]; then
status="loading"
- fi
- echo -e "[Module Name] "$i" (loading status: $status)" >&2
- echo -e "alloc_pages: consumed ${nr_alloc_pages[$i]} pages (peak: ${nr_alloc_pages_peak[$i]} pages)" >&2
- echo -e "kmalloc: consumed ${nr_kmalloc[$i]} bytes (above8192: ${nr_kmalloc_above8192[$i]} bytes)\n" >&2
- echo -e "kmem_cache_alloc: consumed ${nr_kmem_cache_alloc[$i]} bytes\n" >&2
+done +echo -e "== debug_mem for kernel modules during loading end ==\n\n" >&2
+unset module_names +unset module_loaded
+rm $TMPFILE -f +echo 1 > /sys/kernel/debug/tracing/tracing_on
+return 0 diff --git a/kexec-tools.spec b/kexec-tools.spec index 0bbaf72..1f0b7f5 100644 --- a/kexec-tools.spec +++ b/kexec-tools.spec @@ -40,6 +40,7 @@ Source103: dracut-kdump-error-handler.sh Source104: dracut-kdump-emergency.service Source105: dracut-kdump-error-handler.service Source106: dracut-kdump-capture.service +Source107: dracut-memdebug-ko.sh
Requires(post): systemd-units Requires(preun): systemd-units @@ -203,6 +204,7 @@ cp %{SOURCE103} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpb cp %{SOURCE104} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE104}} cp %{SOURCE105} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE105}} cp %{SOURCE106} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE106}} +cp %{SOURCE107} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE107}} chmod 755 $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE100}} chmod 755 $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE101}}
Couple of thing I am still trying to understand. Will comment if I find something.
~Pratyush
On 2016/10/17 at 18:06, Pratyush Anand wrote:
Hi Xunlei,
Thanks for this work. It would be a great help in debugging kdump oom memory issues.
On Monday 10 October 2016 01:13 PM, Xunlei Pang wrote:
Add dracut-memdebug-ko.sh, install it to the dracut kdump module.
The principle is to use kernel trace to track slab and buddy allocation calls during kernel module loading(module_init), thus we can analyze all the trace data and get the total memory consumption.
One major flaw of this method is that it consumes a lot of memory, users should increase the crash kernel memory reservation as needed.
Signed-off-by: Xunlei Pang xlpang@redhat.com
dracut-memdebug-ko.sh | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++ kexec-tools.spec | 2 + 2 files changed, 153 insertions(+) create mode 100755 dracut-memdebug-ko.sh
diff --git a/dracut-memdebug-ko.sh b/dracut-memdebug-ko.sh new file mode 100755 index 0000000..bb404ce --- /dev/null +++ b/dracut-memdebug-ko.sh @@ -0,0 +1,151 @@ +#Debug the number of memory through alloc_pages and kmalloc consumed +#by kernel modules during loading(i.e. modprobe). +#NOTE: kmalloc may trigger alloc_pages, thus resulting in double account.
+if ! [[ -e /sys/kernel/debug/tracing ]]; then
- mount none -t debugfs /sys/kernel/debug
May be you can also check if tracefs is mounted on /sys/kernel/tracing? trace access through debugfs would be obsolete soon.
ok, thanks for the information.
- if ! [[ -d /sys/kernel/debug/tracing ]]; then
warn "Mount debugfs failed, can't activate trace, skip kernel module memory analyzing!"
return 0
- fi
- # 10MB should be big enough for most cases?
- # If the current ring buffer size is the default one(contains "expanded" keyword),
- # set it to 10MB. Users can set other size via "trace_buf_size" kernel boot command.
- cat /proc/cmdline | grep -q "trace_buf_size="
- if [[ $? -ne 0 ]]; then
echo 10240 > /sys/kernel/debug/tracing/buffer_size_kb
- fi
- # Prepare trace for the first time
What if /sys/kernel/debug/tracing did exist , but kmem events were not enabled?
correct, we always need to set these events.
- echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_page_alloc/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_page_free/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmalloc/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmalloc_node/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmem_cache_alloc/enable
- echo 1 > /sys/kernel/debug/tracing/events/kmem/kmem_cache_alloc_node/enable
- echo 1 > /sys/kernel/debug/tracing/events/module/module_load/enable
- echo 1 > /sys/kernel/debug/tracing/events/module/module_put/enable
would be better to clear old trace data at this point, just to make sure that its empty.
I will improve the logic.
- echo 1 > /sys/kernel/debug/tracing/tracing_on
- return 0
+fi
+echo 0 > /sys/kernel/debug/tracing/tracing_on +TMPFILE=/tmp/tmp$$$$ +cp /sys/kernel/debug/tracing/trace $TMPFILE -f +# Clear old trace data after copied away +echo > /sys/kernel/debug/tracing/trace
+#Indexed by task pid. +declare -A current_module
+#Indexed by module name. +declare -A module_loaded +declare -A nr_alloc_pages +declare -A nr_alloc_pages_peak +declare -A nr_kmalloc +#For x86: If the request size of kmalloc is greater than 2*PAGE_SIZE, SLUB will use buddy instead. +#So we maintain the statistics of the large kmalloc requests. For ppc64, PAGE_SIZE is not 4096, +#but the large kmalloc request is not very common, this information is just to give some tips. +declare -A nr_kmalloc_above8192
+declare -A nr_kmem_cache_alloc
+# $1: order of pages +order_to_pages() +{
- local pages=1
- local order=$1
- while [[ $order != 0 ]]; do
order=$((order-1))
pages=$(($pages*2))
- done
- echo $pages
+}
+while read pid cpu flags ts function ; +do
- #Skip comment lines
- if [[ $pid = "#" ]]; then
continue
- fi
- if [[ $function = module_load* ]]; then
#One module is being loaded, save the task pid for tracking.
module_name=${function#*: }
module_names+=" $module_name"
current_module[$pid]="$module_name"
[[ ${module_loaded[$module_name]} ]] && warn "\"$module_name\" was loaded multiple times!"
unset module_loaded[$module_name]
nr_alloc_pages[$module_name]=0
nr_alloc_pages_peak[$module_name]=0
nr_kmalloc[$module_name]=0
nr_kmalloc_above8192[$module_name]=0
nr_kmem_cache_alloc[$module_name]=0
- fi
- if ! [[ ${current_module[$pid]} ]]; then
continue
- fi
- if [[ $function = module_put* ]]; then
#Mark the module as loaded
module_loaded[${current_module[$pid]}]=1
#module has been loaded when module_put is called, untrack the task
unset current_module[$pid]
continue
- fi
- #Once we get here, the task is being tracked(is loading a module).
- #Get the module name.
- module_name=${current_module[$pid]}
- if [[ $function = mm_page_alloc* ]]; then
order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/')
nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$(order_to_pages $order)))
May be I am missing..since we do not track mm_page_free(), so wouldn't nr_alloc_pages[$module_name] be always growing? In that case, how can it provide correct peak memory usage?
We do track mm_page_free here(see code below), but will consider to remove it later.
if [[ ${nr_alloc_pages[$module_name]} -gt ${nr_alloc_pages_peak[$module_name]} ]]; then
nr_alloc_pages_peak[$module_name]=${nr_alloc_pages[$module_name]}
fi
- fi
- if [[ $function = mm_page_free* ]]; then
order=$(echo $function | sed -e 's/.*order=\([0-9]*\)/\1/')
nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}-$(order_to_pages $order)))
- fi
- if [[ $function = kmalloc* ]]; then
bytes_alloc=$(echo $function | sed -e 's/.*bytes_alloc=\([0-9]*\).*/\1/')
nr_kmalloc[$module_name]=$((${nr_kmalloc[$module_name]}+$bytes_alloc))
if [[ $bytes_alloc -gt 8192 ]]; then
nr_kmalloc_above8192[$module_name]=$((${nr_kmalloc_above8192[$module_name]}+$bytes_alloc))
fi
- fi
- if [[ $function = kmem_cache_alloc* ]]; then
bytes_alloc=$(echo $function | sed -e 's/.*bytes_alloc=\([0-9]*\).*/\1/')
nr_kmem_cache_alloc[$module_name]=$((${nr_kmem_cache_alloc[$module_name]}+$bytes_alloc))
- fi
+done < $TMPFILE
+echo -e "\n\n== debug_mem for kernel modules during loading begin ==" >&2 +for i in $module_names; do
- status="finished"
- if ! [[ ${module_loaded[$i]} ]]; then
status="loading"
- fi
- echo -e "[Module Name] "$i" (loading status: $status)" >&2
- echo -e "alloc_pages: consumed ${nr_alloc_pages[$i]} pages (peak: ${nr_alloc_pages_peak[$i]} pages)" >&2
- echo -e "kmalloc: consumed ${nr_kmalloc[$i]} bytes (above8192: ${nr_kmalloc_above8192[$i]} bytes)\n" >&2
- echo -e "kmem_cache_alloc: consumed ${nr_kmem_cache_alloc[$i]} bytes\n" >&2
+done +echo -e "== debug_mem for kernel modules during loading end ==\n\n" >&2
+unset module_names +unset module_loaded
+rm $TMPFILE -f +echo 1 > /sys/kernel/debug/tracing/tracing_on
+return 0 diff --git a/kexec-tools.spec b/kexec-tools.spec index 0bbaf72..1f0b7f5 100644 --- a/kexec-tools.spec +++ b/kexec-tools.spec @@ -40,6 +40,7 @@ Source103: dracut-kdump-error-handler.sh Source104: dracut-kdump-emergency.service Source105: dracut-kdump-error-handler.service Source106: dracut-kdump-capture.service +Source107: dracut-memdebug-ko.sh
Requires(post): systemd-units Requires(preun): systemd-units @@ -203,6 +204,7 @@ cp %{SOURCE103} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpb cp %{SOURCE104} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE104}} cp %{SOURCE105} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE105}} cp %{SOURCE106} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE106}} +cp %{SOURCE107} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE107}} chmod 755 $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE100}} chmod 755 $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE101}}
Couple of thing I am still trying to understand. Will comment if I find something.
Thanks, Xunlei
~Pratyush
If there is any "rd.memdebug", we will monitor the kernel module memory consumption as follows: 1) use dracut various inst_hook to monitor. 2) monitor at the kdump's pre_dump stage.
Signed-off-by: Xunlei Pang xlpang@redhat.com --- dracut-kdump.sh | 11 +++++++++++ dracut-module-setup.sh | 12 ++++++++++++ kdumpctl | 14 ++++++++++++++ 3 files changed, 37 insertions(+)
diff --git a/dracut-kdump.sh b/dracut-kdump.sh index 42ba37f..e8f08eb 100755 --- a/dracut-kdump.sh +++ b/dracut-kdump.sh @@ -33,6 +33,17 @@ do_kdump_pre() if [ -n "$KDUMP_PRE" ]; then "$KDUMP_PRE" fi + + # If cmdline hook exists, we know that memdebug-ko.sh + # was activated. + # + # Execute memdebug-ko.sh before dumping, at this point + # all kernel modules are supposed to be loaded. + if [ -f /lib/dracut/hooks/cmdline/99-memdebug-ko.sh ]; then + . /lib/dracut/hooks/cmdline/99-memdebug-ko.sh + # This is the last call of trace, turn off it. + echo 0 > /sys/kernel/debug/tracing/tracing_on + fi }
do_kdump_post() diff --git a/dracut-module-setup.sh b/dracut-module-setup.sh index 68e0ff8..27d13b2 100755 --- a/dracut-module-setup.sh +++ b/dracut-module-setup.sh @@ -733,4 +733,16 @@ install() { # target. Ideally all this should be pushed into dracut iscsi module # at some point of time. kdump_check_iscsi_targets + + # Setup memdebug-ko if there is any "/tmp/.kdump.memdebug.ko.tmp* which + # has been created by kdump temporarily as a hint. + ls /tmp/.kdump.memdebug.ko.tmp* > /dev/null 2>&1 + if [[ $? -eq 0 ]]; then + inst_hook cmdline 99 "$moddir/memdebug-ko.sh" + inst_hook pre-udev 99 "$moddir/memdebug-ko.sh" + inst_hook pre-trigger 99 "$moddir/memdebug-ko.sh" + inst_hook initqueue 99 "$moddir/memdebug-ko.sh" + inst_hook pre-mount 99 "$moddir/memdebug-ko.sh" + inst_hook cleanup 99 "$moddir/memdebug-ko.sh" + fi } diff --git a/kdumpctl b/kdumpctl index 250f87b..928e163 100755 --- a/kdumpctl +++ b/kdumpctl @@ -178,7 +178,21 @@ rebuild_fadump_initrd()
rebuild_kdump_initrd() { + # Clean up memdebug.ko temp file if any + rm /tmp/.kdump.memdebug.ko.tmp* -rf + + echo "$(prepare_cmdline)" | grep -q "rd.memdebug" + if [ $? -eq 0 ]; then + # As a hint to inst_hook memdebug.ko.sh for dracut 99kdumpbase, + # see dracut-module-setup.sh. + touch /tmp/.kdump.memdebug.ko.tmp$$$$ + fi + $MKDUMPRD $TARGET_INITRD $kdump_kver + + # Clean up memdebug.ko temp file if any + rm /tmp/.kdump.memdebug.ko.tmp* -rf + if [ $? != 0 ]; then echo "mkdumprd: failed to make kdump initrd" >&2 return 1
On Monday 10 October 2016 01:13 PM, Xunlei Pang wrote:
If there is any "rd.memdebug", we will monitor the kernel module memory consumption as follows:
- use dracut various inst_hook to monitor.
- monitor at the kdump's pre_dump stage.
Signed-off-by: Xunlei Pang xlpang@redhat.com
dracut-kdump.sh | 11 +++++++++++ dracut-module-setup.sh | 12 ++++++++++++ kdumpctl | 14 ++++++++++++++ 3 files changed, 37 insertions(+)
diff --git a/dracut-kdump.sh b/dracut-kdump.sh index 42ba37f..e8f08eb 100755 --- a/dracut-kdump.sh +++ b/dracut-kdump.sh @@ -33,6 +33,17 @@ do_kdump_pre() if [ -n "$KDUMP_PRE" ]; then "$KDUMP_PRE" fi
- # If cmdline hook exists, we know that memdebug-ko.sh
- # was activated.
- #
- # Execute memdebug-ko.sh before dumping, at this point
- # all kernel modules are supposed to be loaded.
- if [ -f /lib/dracut/hooks/cmdline/99-memdebug-ko.sh ]; then
. /lib/dracut/hooks/cmdline/99-memdebug-ko.sh
# This is the last call of trace, turn off it.
echo 0 > /sys/kernel/debug/tracing/tracing_on
- fi
}
do_kdump_post() diff --git a/dracut-module-setup.sh b/dracut-module-setup.sh index 68e0ff8..27d13b2 100755 --- a/dracut-module-setup.sh +++ b/dracut-module-setup.sh @@ -733,4 +733,16 @@ install() { # target. Ideally all this should be pushed into dracut iscsi module # at some point of time. kdump_check_iscsi_targets
- # Setup memdebug-ko if there is any "/tmp/.kdump.memdebug.ko.tmp* which
- # has been created by kdump temporarily as a hint.
- ls /tmp/.kdump.memdebug.ko.tmp* > /dev/null 2>&1
- if [[ $? -eq 0 ]]; then
inst_hook cmdline 99 "$moddir/memdebug-ko.sh"
inst_hook pre-udev 99 "$moddir/memdebug-ko.sh"
inst_hook pre-trigger 99 "$moddir/memdebug-ko.sh"
inst_hook initqueue 99 "$moddir/memdebug-ko.sh"
inst_hook pre-mount 99 "$moddir/memdebug-ko.sh"
inst_hook cleanup 99 "$moddir/memdebug-ko.sh"
- fi
} diff --git a/kdumpctl b/kdumpctl index 250f87b..928e163 100755 --- a/kdumpctl +++ b/kdumpctl @@ -178,7 +178,21 @@ rebuild_fadump_initrd()
rebuild_kdump_initrd() {
- # Clean up memdebug.ko temp file if any
- rm /tmp/.kdump.memdebug.ko.tmp* -rf
- echo "$(prepare_cmdline)" | grep -q "rd.memdebug"
- if [ $? -eq 0 ]; then
# As a hint to inst_hook memdebug.ko.sh for dracut 99kdumpbase,
# see dracut-module-setup.sh.
touch /tmp/.kdump.memdebug.ko.tmp$$$$
- fi
- $MKDUMPRD $TARGET_INITRD $kdump_kver
Need to save this return value, which can be checked after rm instruction, otherwise $? will not have correct value after rm.
- # Clean up memdebug.ko temp file if any
- rm /tmp/.kdump.memdebug.ko.tmp* -rf
- if [ $? != 0 ]; then echo "mkdumprd: failed to make kdump initrd" >&2 return 1
~Pratyush
On 2016/10/17 at 18:08, Pratyush Anand wrote:
On Monday 10 October 2016 01:13 PM, Xunlei Pang wrote:
If there is any "rd.memdebug", we will monitor the kernel module memory consumption as follows:
- use dracut various inst_hook to monitor.
- monitor at the kdump's pre_dump stage.
Signed-off-by: Xunlei Pang xlpang@redhat.com
dracut-kdump.sh | 11 +++++++++++ dracut-module-setup.sh | 12 ++++++++++++ kdumpctl | 14 ++++++++++++++ 3 files changed, 37 insertions(+)
diff --git a/dracut-kdump.sh b/dracut-kdump.sh index 42ba37f..e8f08eb 100755 --- a/dracut-kdump.sh +++ b/dracut-kdump.sh @@ -33,6 +33,17 @@ do_kdump_pre() if [ -n "$KDUMP_PRE" ]; then "$KDUMP_PRE" fi
- # If cmdline hook exists, we know that memdebug-ko.sh
- # was activated.
- #
- # Execute memdebug-ko.sh before dumping, at this point
- # all kernel modules are supposed to be loaded.
- if [ -f /lib/dracut/hooks/cmdline/99-memdebug-ko.sh ]; then
. /lib/dracut/hooks/cmdline/99-memdebug-ko.sh
# This is the last call of trace, turn off it.
echo 0 > /sys/kernel/debug/tracing/tracing_on
- fi
}
do_kdump_post() diff --git a/dracut-module-setup.sh b/dracut-module-setup.sh index 68e0ff8..27d13b2 100755 --- a/dracut-module-setup.sh +++ b/dracut-module-setup.sh @@ -733,4 +733,16 @@ install() { # target. Ideally all this should be pushed into dracut iscsi module # at some point of time. kdump_check_iscsi_targets
- # Setup memdebug-ko if there is any "/tmp/.kdump.memdebug.ko.tmp* which
- # has been created by kdump temporarily as a hint.
- ls /tmp/.kdump.memdebug.ko.tmp* > /dev/null 2>&1
- if [[ $? -eq 0 ]]; then
inst_hook cmdline 99 "$moddir/memdebug-ko.sh"
inst_hook pre-udev 99 "$moddir/memdebug-ko.sh"
inst_hook pre-trigger 99 "$moddir/memdebug-ko.sh"
inst_hook initqueue 99 "$moddir/memdebug-ko.sh"
inst_hook pre-mount 99 "$moddir/memdebug-ko.sh"
inst_hook cleanup 99 "$moddir/memdebug-ko.sh"
- fi
} diff --git a/kdumpctl b/kdumpctl index 250f87b..928e163 100755 --- a/kdumpctl +++ b/kdumpctl @@ -178,7 +178,21 @@ rebuild_fadump_initrd()
rebuild_kdump_initrd() {
- # Clean up memdebug.ko temp file if any
- rm /tmp/.kdump.memdebug.ko.tmp* -rf
- echo "$(prepare_cmdline)" | grep -q "rd.memdebug"
- if [ $? -eq 0 ]; then
# As a hint to inst_hook memdebug.ko.sh for dracut 99kdumpbase,
# see dracut-module-setup.sh.
touch /tmp/.kdump.memdebug.ko.tmp$$$$
- fi
- $MKDUMPRD $TARGET_INITRD $kdump_kver
Need to save this return value, which can be checked after rm instruction, otherwise $? will not have correct value after rm.
Indeed.
Regards, Xunlei
- # Clean up memdebug.ko temp file if any
- rm /tmp/.kdump.memdebug.ko.tmp* -rf
- if [ $? != 0 ]; then echo "mkdumprd: failed to make kdump initrd" >&2 return 1
~Pratyush
On 2016/10/10 at 15:43, Xunlei Pang wrote:
The current method for kdump memory debug is to use dracut "rd.memdebug=[0-3]", it is not enough for debugging kernel modules. For example, when we want to find out which kernel module consumes a large amount of memory, "rd.memdebug" won't help too much.
A better way is needed to achieve this requirement, this is very useful for kdump OOM debugging.
The principle of this patch series is to use kernel trace to track slab and buddy allocation calls during kernel module loading(module_init), thus we can analyze all the trace data and get the total memory consumption.
The trace events include memory calls under /sys/kernel/debug/tracing/events: kmem/mm_page_alloc kmem/mm_page_free kmem/kmalloc kmem/kmalloc_node kmem/kmem_cache_alloc kmem/kmem_cache_alloc_node
We also inpect the following events to detect the module loading module/module_load module/module_put
We can get the module name and task pid from "module_load" event which also mark the beginning of the loading, and module_put called by the same task pid implies the end of the loading. So the memory events recorded in between by the same task pid are consumed by this module during loading(i.e. modprobe or module_init()).
With these information, we can record approximately the total memory consumption involved by each kernel module loading.
One major flaw of this method is that the trace ring buffer consumes a lot of memory. If it is too small, old records maybe be overwritten by subsequent
For some large systems, the amount of memory consumption is appreciable.
Does anyone have a good idea to reduce the memory size consumed by tracing?
I previously tried to use trace filter to track specific process doing modprobe, but according to the test results, normally the process name with "systemd-udevd" and "modprobe" will insert the ko, but sometimes I found some unknown process will also insert the ko. So it's unreliable to do so.
I am currently thinking to monitor buddy event only, as all the large slab allocation will actually fall back to buddy to request pages. This will considerably reduce the requirement of trace buffer size.
Regards, Xunlei
records. The trace ring buffer is set to be 10MB by default, but it can be overridden by users via the standard kernel boot parameter "trace_buf_size".
Users should increase the crash kernel memory reservation as needed after setting large trace ring buffer size, in case oom happens during debugging.
Usage: 1)Pass "rd.memdebug" to kdump kernel cmdline using "KDUMP_COMMANDLINE_APPEND" in /etc/sysconfig/kdump. 2)Pass the extra "trace_buf_size=nn[KMG]" to specify trace ring buffer size(per cpu) as needed.
Xunlei Pang (2): memdebug-ko: add dracut-memdebug-ko.sh to debug kernel module memory consumption module-setup: apply kernel module memory debug support
dracut-kdump.sh | 11 ++++ dracut-memdebug-ko.sh | 144 +++++++++++++++++++++++++++++++++++++++++++++++++ dracut-module-setup.sh | 12 +++++ kdumpctl | 14 +++++ kexec-tools.spec | 2 + 5 files changed, 183 insertions(+) create mode 100755 dracut-memdebug-ko.sh
On 2016/10/25 at 11:27, Xunlei Pang wrote:
On 2016/10/10 at 15:43, Xunlei Pang wrote:
The current method for kdump memory debug is to use dracut "rd.memdebug=[0-3]", it is not enough for debugging kernel modules. For example, when we want to find out which kernel module consumes a large amount of memory, "rd.memdebug" won't help too much.
A better way is needed to achieve this requirement, this is very useful for kdump OOM debugging.
The principle of this patch series is to use kernel trace to track slab and buddy allocation calls during kernel module loading(module_init), thus we can analyze all the trace data and get the total memory consumption.
The trace events include memory calls under /sys/kernel/debug/tracing/events: kmem/mm_page_alloc kmem/mm_page_free kmem/kmalloc kmem/kmalloc_node kmem/kmem_cache_alloc kmem/kmem_cache_alloc_node
We also inpect the following events to detect the module loading module/module_load module/module_put
We can get the module name and task pid from "module_load" event which also mark the beginning of the loading, and module_put called by the same task pid implies the end of the loading. So the memory events recorded in between by the same task pid are consumed by this module during loading(i.e. modprobe or module_init()).
With these information, we can record approximately the total memory consumption involved by each kernel module loading.
One major flaw of this method is that the trace ring buffer consumes a lot of memory. If it is too small, old records maybe be overwritten by subsequent
For some large systems, the amount of memory consumption is appreciable.
Does anyone have a good idea to reduce the memory size consumed by tracing?
I previously tried to use trace filter to track specific process doing modprobe, but according to the test results, normally the process name with "systemd-udevd" and "modprobe" will insert the ko, but sometimes I found some unknown process will also insert the ko. So it's unreliable to do so.
The "module_load" event trace data output will be like: modprobe-1887 [001] .... 13412.516997: module_load: nfs
But sometimes, "module_load" event trace data output will be like: <...>-1558 [000] .... 12961.529111: module_load: fscache
So the task name(the first column) lost.
After some digging I found it's due to the insufficient trace cmdline buffer, there are two files related: tracing/saved_cmdlines tracing/saved_cmdlines_size
I finally found the fact that if the cmdlines buffer runs out, trace output will miss the task comm-pid information.
So actually we are able to monitor the module loading process using the filter.
There are three kinds of known applications for loading kernel modules: "systemd-udevd", "modprobe" and "insmod".
Using them as the filter considerably reduces trace memory consumption, it only needs tens of lines in trace buffer normally, thus the default size suffices.
Regards, Xunlei
I am currently thinking to monitor buddy event only, as all the large slab allocation will actually fall back to buddy to request pages. This will considerably reduce the requirement of trace buffer size.
Regards, Xunlei
records. The trace ring buffer is set to be 10MB by default, but it can be overridden by users via the standard kernel boot parameter "trace_buf_size".
Users should increase the crash kernel memory reservation as needed after setting large trace ring buffer size, in case oom happens during debugging.
Usage: 1)Pass "rd.memdebug" to kdump kernel cmdline using "KDUMP_COMMANDLINE_APPEND" in /etc/sysconfig/kdump. 2)Pass the extra "trace_buf_size=nn[KMG]" to specify trace ring buffer size(per cpu) as needed.
Xunlei Pang (2): memdebug-ko: add dracut-memdebug-ko.sh to debug kernel module memory consumption module-setup: apply kernel module memory debug support
dracut-kdump.sh | 11 ++++ dracut-memdebug-ko.sh | 144 +++++++++++++++++++++++++++++++++++++++++++++++++ dracut-module-setup.sh | 12 +++++ kdumpctl | 14 +++++ kexec-tools.spec | 2 + 5 files changed, 183 insertions(+) create mode 100755 dracut-memdebug-ko.sh
On 2016/10/10 at 15:43, Xunlei Pang wrote:
The current method for kdump memory debug is to use dracut "rd.memdebug=[0-3]", it is not enough for debugging kernel modules. For example, when we want to find out which kernel module consumes a large amount of memory, "rd.memdebug" won't help too much.
A better way is needed to achieve this requirement, this is very useful for kdump OOM debugging.
The principle of this patch series is to use kernel trace to track slab and buddy allocation calls during kernel module loading(module_init), thus we can analyze all the trace data and get the total memory consumption.
The trace events include memory calls under /sys/kernel/debug/tracing/events: kmem/mm_page_alloc kmem/mm_page_free kmem/kmalloc kmem/kmalloc_node kmem/kmem_cache_alloc kmem/kmem_cache_alloc_node
We also inpect the following events to detect the module loading module/module_load module/module_put
We can get the module name and task pid from "module_load" event which also mark the beginning of the loading, and module_put called by the same task pid implies the end of the loading. So the memory events recorded in between by the same task pid are consumed by this module during loading(i.e. modprobe or module_init()).
With these information, we can record approximately the total memory consumption involved by each kernel module loading.
One major flaw of this method is that the trace ring buffer consumes a lot of memory. If it is too small, old records maybe be overwritten by subsequent
For some large systems, the amount of memory consumption is appreciable.
Does anyone have a good idea to reduce the memory size consumed by tracing?
I previously tried to use trace filter to track specific process doing modprobe, but according to the test results, normally the process name with "systemd-udevd" and "modprobe" will insert the ko, but sometimes I found some unknown process will also insert the ko. So it's unreliable to do so.
I am currently thinking to monitor buddy event only, as all the large slab allocation will actually fall back to buddy to request pages. This will considerably reduce the requirement of trace buffer size.
Regards, Xunlei
records. The trace ring buffer is set to be 10MB by default, but it can be overridden by users via the standard kernel boot parameter "trace_buf_size".
Users should increase the crash kernel memory reservation as needed after setting large trace ring buffer size, in case oom happens during debugging.
Usage: 1)Pass "rd.memdebug" to kdump kernel cmdline using "KDUMP_COMMANDLINE_APPEND" in /etc/sysconfig/kdump. 2)Pass the extra "trace_buf_size=nn[KMG]" to specify trace ring buffer size(per cpu) as needed.
Xunlei Pang (2): memdebug-ko: add dracut-memdebug-ko.sh to debug kernel module memory consumption module-setup: apply kernel module memory debug support
dracut-kdump.sh | 11 ++++ dracut-memdebug-ko.sh | 144 +++++++++++++++++++++++++++++++++++++++++++++++++ dracut-module-setup.sh | 12 +++++ kdumpctl | 14 +++++ kexec-tools.spec | 2 + 5 files changed, 183 insertions(+) create mode 100755 dracut-memdebug-ko.sh