On Mon, May 30, 2022 at 11:28:42AM +0200, Zdenek Kabelac wrote:
Dne 30. 05. 22 v 4:34 Baoquan He napsal(a):
> On 05/27/22 at 11:39am, Vivek Goyal wrote:
> > On Fri, May 27, 2022 at 04:59:38PM +0200, Zdenek Kabelac wrote:
> > > Dne 27. 05. 22 v 16:50 Vivek Goyal napsal(a):
> > > > On Fri, May 27, 2022 at 04:42:25PM +0200, Zdenek Kabelac wrote:
> > > > > Dne 27. 05. 22 v 14:20 Vivek Goyal napsal(a):
> > > > > > On Fri, May 27, 2022 at 02:45:14PM +0800, Tao Liu wrote:
> > > > > > > If lvm2 thinp is enabled in kdump,
lvm2-monitor.service is needed for
> > > > > > > monitor and autoextend the size of thin pool.
Otherwise the vmcore
> > > > > > > dumped to a no-enough-space target will be incomplete
and unable for
> > > > > > > further analysis.
> > > > > > >
> > > > > > > In this patch, lvm2-monitor.service will be started
before kdump-capture
> > > > > > > .service for 2nd kernel, then be stopped in kdump
post.d phase. So
> > > > > > > the thin pool monitoring and size-autoextend can be
ensured during kdump.
> > > > > > >
> > > > > > > Signed-off-by: Tao Liu <ltao(a)redhat.com>
> > > > > > > ---
> > > > > > > dracut-lvm2-monitor.service | 15 +++++++++++++++
> > > > > > > dracut-module-setup.sh | 16 ++++++++++++++++
> > > > > > > kexec-tools.spec | 2 ++
> > > > > > > 3 files changed, 33 insertions(+)
> > > > > > > create mode 100644 dracut-lvm2-monitor.service
> > > > > > >
> > > > > > > diff --git a/dracut-lvm2-monitor.service
b/dracut-lvm2-monitor.service
> > > > > > This seems to be a copy of
/lib/systemd/system/lvm2-monitor.service.
> > > > > > Wondering if we can dirctly include that file in initramfs
when generating
> > > > > > image. But I am fuzzy on details of dracut implementation.
It has been
> > > > > > too long since I played with it. So Bao and kdump team will
be best
> > > > > > to comment on this.
> > > > > >
> > > > > This is quite interesting - monitoring should in fact never be
started
> > > > > wthin 'ramdisk' so I'm acutlly wondering what is
this service file doing
> > > > > there.
> > > > >
> > > > > Design was to start 'monitoring' of devices just after
switch to 'rootfs' -
> > > > > since running 'dmeventd' out of ramdisk does not make
any sense at all.
> > > > Hi Zdenek,
> > > >
> > > > In case of kdump, we save core dump from initramfs context and
reboot
> > > > back into primary kernel. And that's why this need of dm
monitoring (
> > > > and thin pool auto extension) working from inside the initramfs
> > > > context.
> > > >
> > > So IMHO this although does not look like the best approach. AFAIK the
> > > lvm.conf within ramdisk is also a modified version.
> > >
> > > It looks like there should be a better alternative - like 'after'
activation
> > > checking there is 'enough' room in thin-pool for use with thinLV
- should
> > > be 'computable' and in case the size is not good enough - try to
extend
> > > thin-pool prior use/mount of thinLV (size of space in thin-pool %DATA
&
> > > %METATDATA and occupancy of %DATA thinLV could be obtained by
'lvs' tool)
> > One potential problem here is that we don't know what's the size of
> > vmcore in advance. It gets filtered and saved and we dont know in
> > advance, how many kernel pages will be there.
> >
> > Is that still right, Bao?
> Yes, it's still right.
>
> We have features in makedumpfile to estimate the expected disk space for
> vmcore dumping. E.g System RAM is 2TB, makedumpfile running tells 256GB
> disk space is needed for storing vmcore, by filtering out zero pages,
> unused pages, etc. However, that estimation is done in 1st kernel, and
> the running kernel could dynamically allocate pages. So the estimation
> can only give very rough data, in magnitude level. E.g you have 1TB
> memory, while the disk space is only 200GB, that's obviously not enough.
>
> > Technically speaking, one could first run makedumpfile to just determine
> > what will be size of vmcore and then actually save vmcore in second
> > round. But that will double the filtering time.
> Yeah. Besides, memory content of system is changing dynamically all the
> time. E.g your oracle DB is running or not running, the user space data
> is defintely not the same. And two times of work need involve people's
> manual work, automation is still expected if can be made.
>
> > > Running very resource hungry dmeventd (looks all the process memory in
RAM
> > > - could be many many MB) in kdump environment is not IMHO worst option
> > > here - I'd prefer to avoid execution of dmeventd in this ramfs
image.
> > I understand. We also want to keep the size of kdump initramfs to the
> > minimum.
> Right.
>
> I talked to Tao, he tested on kvm guest with 500M memory and 100M disk
> space to trigger the insufficient disk space usage. Tao said the
> dmeventd will cosume about 40MB when executing. I am not familiar with
> dmeventd, if its running will cost about constant 40M memory, no matter
> how much disk space need be extended at one time, we can adjust our
> kdump script to increase the default crashkernel= value if lvm2 thinp is
> detected. It looks acceptable in kdump side.
Dmeventd runs in 'mlockall()' mode - so the whole executable with all
libraries and all the memory allocations are pinned in RAM (so IMHO 40MiB
is way small number)
Reason for this is - in the normal 'running' mode lvm2 protects dmeventd
from being blocked when it would run out of rootfs and it would suspend DM
with rootfs on it - so by having the whole binary mlocked in RAM it cannot
cause 'deadlock' waiting on itsefl when it suspends given DM device.
For kdump executional environment in ramdisk this is not really relevant
condition (but dmeventd was not designed to be executed in such
environment). However as mentioned in my previous post - it's actually
more useful to run 'lvm lvextend --use-policies' with given thin-pool
name in a plain shell parallel loop - as it basically gives same
result with way less memory 'obstruction' and with far better control as
well (i.e. leaving user a defined minimum to be sure it can actuall boot
afterwards - so dumping only when there really is some space...)
Hi Zdenek,
Is running "lvm extend --use-policies"racy as well. I mean, it is possible
that dump process fills up the pool before lvm extend gets a chance to
extend it? Or it is fine even if thin pool gets full. Once it is extended
again, it will unblock dumping process automatically?
But this still does not protect again filling up data LV completely
and making rootfs unusable/unbootable.
Bao mentioned that makedumpfile has capability to estimate the size
of core dump. May be we should run that instead in second kernel,
extend the thin pool accordingly and the ninitiate the dump. For
core collectors like, dd/cp, we know the size of /proc/vmcore and
we can use that instead to make sure there is enough free space
in thin pool otherwise abort.
So yes, I think estimating the size of space required to dump and
then extending thin pool accordingly is probably the best way to
go about it given the options.
Thanks
Vivek
Regards
Zdenek