On Fri, Mar 8, 2013 at 6:55 PM, Frederick Grose <fgrose(a)gmail.com> wrote:
On Thu, Mar 7, 2013 at 10:41 AM, <John.Florian(a)dart.biz>
wrote:
> > From: Frederick Grose <fgrose(a)gmail.com>
> > On Wed, Mar 6, 2013 at 3:59 PM, <John.Florian(a)dart.biz> wrote:
> <snip>
> > root@aos-61:46 # # Lets now make it all go wonky:
> > root@aos-61:46 # time dd if=/dev/zero of=/foo
> > Bus error
> >
> > real 1m15.775s
> > user 0m2.818s
> > sys 0m24.129s
> > root@aos-61:46 #
> > root@aos-61:46 # ls /root
> > -bash: /bin/ls: Input/output error
> > root@aos-61:46 # df -h
> > -bash: /usr/bin/df: Input/output error
> >
> > root@aos-61:46 # mount
> >
> > -bash: /usr/bin/mount: Input/output error
> >
> > root@aos-61:46 # cat /proc/meminfo
> >
> > -bash: /usr/bin/cat: Input/output error
> >
> >
> > Is this expected? Is there anything I can do, e.g., configuration-
> > wise, that can prevent this? Ideally this would fail much like any
> > other full disk situation. I understand that the overlay consumes
> > space, i.e., memory, for this file growth, including file removals,
> > but I'd at least like to be able to remotely reboot a system when in
> > this state, however I can't even do that because the reboot command
> > will either return the same I/O error or it may succeed but get the
> > I/O error when systemd tries to read
> /usr/lib/systemd/system/reboot.target.
> >
> > I dug around in bugzilla, but found nothing there. I can file a
> > bug, but which package is likely at fault here?
> > --
> > John Florian
> >
> > See
https://fedoraproject.org/wiki/LiveOS_image for some background
> > and potential workarounds.
> >
> > --Fred --
>
>
> There's really not much on that page that helps me here. I'm trying to
> use Live images for a mostly-stateless embedded appliance OS deployed to
> hundreds or thousands of devices. I realize that the COW design is always
> going to be limited, but a more graceful failure mode is really needed,
> somehow. For our use, the biggest gain in stability here actually comes
> from systemd's journal with its trim-before-write approach instead of the
> legacy write now, trim asynchronously approach we used to have. However,
> that only covers one specific use case: logging. Writing to proper
> persistent storage allows me to avoid the root file system overlay, but
> most of these embedded devices use CF or SD cards for storage, which have
> limited write cycles that must be respected.
>
> Is there a way to implement an artificial capacity limit that would
> prevent processes from exhausting the overlay so that the reserve might be
> used for recording the event and rebooting back to a safer state?
>
> At the very least, I think this page could benefit from a little
> stronger, more explicit wording of this failure case. While it talks a
> little about some work-arounds, it actually says very little about why they
> are needed. Only in the "Overlay Recovery" section does it hint at the
> crash potential.
>
> --
> John Florian
>
Thank you for the review! I've updated the wiki page based on your
comments,
https://fedoraproject.org/wiki/LiveOS_image
Documenting that a temporary overlay is a 0.5 GiB sparse file in a RAM
filesystem gave me the idea to try using an overlay size greater than
available memory, and hope that kernel out-of-memory warnings would
intervene before the device-mapper filesystem invalidation.
I modified /usr/sbin/dmsquash-live-root in the initramfs to create a
temporary 500 GiB sparse overlay:
dd if=/dev/null of=/overlay bs=1024 count=1 seek=$((512*1024*1024)) 2>
/dev/null
Then after booting an updated, Fedora 18 Live desktop, LiveUSB read only
and running your failure demo,
time dd if=/dev/zero of=/foo
I got out-of-memory warnings after a file of about 450 MiB was written and
the command returned--no crash!
Some post test output:
[root@localhost ~]# dmsetup status
live-osimg-min: 0 8388608 snapshot 2584/2584 24
live-rw: 0 8388608 snapshot 921720/1073741824 3600
top - 18:11:53 up 17 min, 3 users, load average: 0.68, 0.75, 0.57
Tasks: 182 total, 2 running, 180 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.6 us, 1.6 sy, 0.0 ni, 96.5 id, 0.0 wa, 0.2 hi, 0.0 si,
0.0 st
KiB Mem: 3339812 total, 3260284 used, 79528 free, 316384 buffers
KiB Swap: 3341308 total, 0 used, 3341308 free, 1948108 cached
You might test this method in your systems and let us know how it works.
--Fred
Pardon my bad observations, my above conclusion IS WRONG and unsupported by
the above test.
I deceived myself with an unfamiliar error message, and actually seem to
have tested James Heather's method in my last test.
My root filesystem size was 4 GiB with about 450 MiB free. An
out-of-disc-space warning is what actually popped up and caused the test
command to exit before another failure or crash.
To retest the oversized overlay hypothesis, I resized the LiveUSB root
filesystem to 12 GiB and repeated the test on it as an attached LiveOS
filesystem, /dev/mapper/dm-PCBV6p (mounted at /mnt/a).
[root@localhost a]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 24/1073741824 16
[root@localhost ~]# dmsetup table
dm-PCBV6p: 0 25165824 snapshot 7:9 7:10 P 8
[root@localhost ~]# losetup -a
/dev/loop8: [2081]:1833 (/run/media/fgrose/LIVE/LiveOS/squashfs.img)
/dev/loop9: [1800]:3 (/run/media/livemnt-squash-mRJNIA/LiveOS/rootfs.img)
/dev/loop10: [0017]:58601 (/run/media/tmpvJjuX7)
/dev/loop11: [2081]:1832 (/run/media/fgrose/LIVE/LiveOS/home.img)
[root@localhost ~]# df -Th
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 1.6G 0 1.6G 0% /dev
tmpfs tmpfs 1.6G 152K 1.6G 1% /dev/shm
tmpfs tmpfs 1.6G 3.3M 1.6G 1% /run
tmpfs tmpfs 1.6G 0 1.6G 0% /sys/fs/cgroup
/dev/sda1 ext4 18G 8.8G 7.7G 54% /
tmpfs tmpfs 1.6G 28K 1.6G 1% /tmp
/dev/sdc1 vfat 15G 8.8G 6.2G 59% /run/media/fgrose/LIVE
/dev/loop8 squashfs 929M 929M 0 100%
/run/media/livemnt-squash-mRJNIA
/dev/mapper/dm-PCBV6p ext4 12G 3.4G 8.2G 30% /mnt/a
/dev/loop11 ext4 380M 35M 325M 10% /mnt/a/home
[root@localhost ~]# mount
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
devtmpfs on /dev type devtmpfs
(rw,nosuid,seclabel,size=1652304k,nr_inodes=413076,mode=755)
securityfs on /sys/kernel/security type securityfs
(rw,nosuid,nodev,noexec,relatime)
selinuxfs on /sys/fs/selinux type selinuxfs (rw,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)
devpts on /dev/pts type devpts
(rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs
(rw,nosuid,nodev,noexec,seclabel,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup
(rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup
(rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup
(rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup
(rw,nosuid,nodev,noexec,relatime,perf_event)
/dev/sda1 on / type ext4 (rw,relatime,seclabel,data=ordered)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs
(rw,relatime,fd=37,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
configfs on /sys/kernel/config type configfs (rw,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime,seclabel)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
tmpfs on /tmp type tmpfs (rw,seclabel)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel)
gvfsd-fuse on /run/user/1000/gvfs type fuse.gvfsd-fuse
(rw,nosuid,nodev,relatime,user_id=1000,group_id=1000)
/dev/sdc1 on /run/media/fgrose/LIVE type vfat
(rw,nosuid,nodev,relatime,uid=1000,gid=1000,fmask=0022,dmask=0077,codepage=437,iocharset=ascii,shortname=mixed,showexec,utf8,flush,errors=remount-ro,uhelper=udisks2)
/dev/sdb2 on /var/cache/yum type ext4 (rw,relatime,seclabel,data=ordered)
/dev/loop8 on /run/media/livemnt-squash-mRJNIA type squashfs
(ro,relatime,seclabel)
/dev/mapper/dm-PCBV6p on /mnt/a type ext4
(rw,relatime,seclabel,data=ordered)
/dev/loop11 on /mnt/a/home type ext4 (rw,relatime,seclabel,data=ordered)
/dev/sdc1 on /mnt/a/run/initramfs/live type vfat
(rw,nosuid,nodev,relatime,uid=1000,gid=1000,fmask=0022,dmask=0077,codepage=437,iocharset=ascii,shortname=mixed,showexec,utf8,flush,errors=remount-ro)
The target filesystem, /dev/mapper/dm-PCBV6p, did go invalid, was changed
to ro, which led to this test command output:
[root@localhost a]# time dd if=/dev/zero of=foo
dd: writing to ‘foo’: Read-only file system
4029694+0 records in
4029693+0 records out
2063202816 bytes (2.1 GB) copied, 40.0696 s, 51.5 MB/s
real 0m40.079s
user 0m3.799s
sys 0m32.422s
In a separate terminal I manually monitored the dmsetup status:
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 184/1073741824 16
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 192/1073741824 16
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 192/1073741824 16
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 263440/1073741824 1040
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 366360/1073741824 1440
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 526608/1073741824 2064
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 703280/1073741824 2752
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 904600/1073741824 3528
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 1131568/1073741824 4416
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 1383288/1073741824 5392
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 1579432/1073741824 6160
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 1579432/1073741824 6160
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 1731568/1073741824 6752
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 2191040/1073741824 8536
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 2420840/1073741824 9432
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 2632232/1073741824 10256
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 2632288/1073741824 10256
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 2964616/1073741824 11544
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 3208632/1073741824 12496
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot Invalid
[root@localhost ~]# df -Th
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 1.6G 0 1.6G 0% /dev
tmpfs tmpfs 1.6G 504K 1.6G 1% /dev/shm
tmpfs tmpfs 1.6G 1000M 632M 62% /run
tmpfs tmpfs 1.6G 0 1.6G 0% /sys/fs/cgroup
/dev/sda1 ext4 18G 8.7G 7.7G 54% /
tmpfs tmpfs 1.6G 68K 1.6G 1% /tmp
/dev/sdc1 vfat 15G 8.8G 6.2G 59% /run/media/fgrose/LIVE
/dev/loop8 squashfs 929M 929M 0 100%
/run/media/livemnt-squash-mRJNIA
/dev/mapper/dm-PCBV6p ext4 12G 4.6G 7.1G 40% /mnt/a
/dev/loop11 ext4 380M 35M 325M 10% /mnt/a/home
The invalidation occurred at the 1.6 GB size limit applied to the /run
tmpfs where the overlay, /dev/loop10, was mounted,
[root@localhost ~]# losetup /dev/loop10
/dev/loop10: [0017]:58601 (/run/media/tmpvJjuX7)
[root@localhost ~]# ls /mnt/a
ls: cannot access /mnt/a/.readahead: Input/output error
top - 00:26:06 up 13 min, 4 users, load average: 0.55, 0.68, 0.44
Tasks: 204 total, 2 running, 202 sleeping, 0 stopped, 0 zombie
%Cpu(s): 4.3 us, 1.6 sy, 0.0 ni, 92.4 id, 1.3 wa, 0.3 hi, 0.0 si,
0.0 st
KiB Mem: 3339812 total, 3256176 used, 83636 free, 68956 buffers
KiB Swap: 3341308 total, 0 used, 3341308 free, 2312664 cached
Notice that Swap was not activated, but free memory got down to ~83 MiB.
When I tested the above on the booted LiveUSB, 2-3 GiB of swap was
activated before the fatal crash.
So an oversized overlay DOES NOT prevent device-mapper invalidation by the
above test method.
--Fred