So xen1 went down today and I was helping bring things back up. I didn't know to look in /var/log/messages for the messages from xenGuestsRunning.sh. I was wondering this:
would it make sense to have xenGuestsRunning run every hour and re-make the symlinks in /etc/xen/auto for the guests which should be running on the machine? Also - if for some reason the xen guests can't be started up automatically due to other complexities (iscsi, memory over commit, etc) we could have xenGuestsrunning auto-generate a script which can be run to re-make the xen guests which should be running.
I'd be willing to put the script together, I just wanted to ask if there was a good reason NOT to do this, so I don't waste time if I've missed something.
thanks, -sv
On Fri, 18 Apr 2008, seth vidal wrote:
So xen1 went down today and I was helping bring things back up. I didn't know to look in /var/log/messages for the messages from xenGuestsRunning.sh. I was wondering this:
would it make sense to have xenGuestsRunning run every hour and re-make the symlinks in /etc/xen/auto for the guests which should be running on the machine? Also - if for some reason the xen guests can't be started up automatically due to other complexities (iscsi, memory over commit, etc) we could have xenGuestsrunning auto-generate a script which can be run to re-make the xen guests which should be running.
I'd be willing to put the script together, I just wanted to ask if there was a good reason NOT to do this, so I don't waste time if I've missed something.
The only reason we haven't done this already is the inability to detect if the box is already up somewhere (which is something we need already) Consider this scenario:
app1 running on xen1 (which is having high load from koji1 also on xen1)
People complain about the wiki.
We move app1 to a more free box, xen7.
high load causes CRASH
xen1 reboots. Attempts to bring app1 up (already up on xen7)
Two machines try to write to the same disk - DOOM.
There is a bit of hope in this. 1) its happened before and it seems that the second guest sees the disk is already mounted and gets stuck at an fsck shell. As long as we realize that that condition potentially means the box is already up and needs to be checked... we're fine. If someone tries to type the root password and fsck the disk... DOOM.
This is all a sign of a larger problem with the lack of open source management tools for virtualization on more then one host at a time. I'm a huge fan of automation so in general I'd like to see the plan above implemented but I think we need to alter the xm creation scripts (I'm not sure what this involves) that makes sure hosts don't come up on the wrong xen host.
-Mike
On Fri, 2008-04-18 at 15:33 -0500, Mike McGrath wrote:
The only reason we haven't done this already is the inability to detect if the box is already up somewhere (which is something we need already) Consider this scenario:
app1 running on xen1 (which is having high load from koji1 also on xen1)
People complain about the wiki.
We move app1 to a more free box, xen7.
high load causes CRASH
xen1 reboots. Attempts to bring app1 up (already up on xen7)
Two machines try to write to the same disk - DOOM.
There is a bit of hope in this. 1) its happened before and it seems that the second guest sees the disk is already mounted and gets stuck at an fsck shell. As long as we realize that that condition potentially means the box is already up and needs to be checked... we're fine. If someone tries to type the root password and fsck the disk... DOOM.
This is all a sign of a larger problem with the lack of open source management tools for virtualization on more then one host at a time. I'm a huge fan of automation so in general I'd like to see the plan above implemented but I think we need to alter the xm creation scripts (I'm not sure what this involves) that makes sure hosts don't come up on the wrong xen host.
Okay so maybe we need a really-xen-startup init script which: 1. happens AFTER network, etc are up so iscsi items work 2. provides a locking capability so it can talk to 'something else' to find out which domains are already locked and allocated to determine if it should start them (and this is easy to circumvent stale locks on crashing with a good db) 3. notifies on restart.
just a few thoughts...
thanks -sv
On Fri, Apr 18, 2008 at 03:33:22PM -0500, Mike McGrath wrote:
The only reason we haven't done this already is the inability to detect if the box is already up somewhere (which is something we need already) Consider this scenario:
app1 running on xen1 (which is having high load from koji1 also on xen1)
People complain about the wiki.
We move app1 to a more free box, xen7.
high load causes CRASH
xen1 reboots. Attempts to bring app1 up (already up on xen7)
Two machines try to write to the same disk - DOOM.
What about using gfs (or maybe just nfs from the netapp) to keep the contents of /etc/xen/auto "synced" between hosts? Having /etc/xen/auto as a link to [sharedarea]/auto-xenXX and making sure that the entries there are unique will be a good start for a solution I think.
Kostas
On Fri, 18 Apr 2008, Kostas Georgiou wrote:
On Fri, Apr 18, 2008 at 03:33:22PM -0500, Mike McGrath wrote:
The only reason we haven't done this already is the inability to detect if the box is already up somewhere (which is something we need already) Consider this scenario:
app1 running on xen1 (which is having high load from koji1 also on xen1)
People complain about the wiki.
We move app1 to a more free box, xen7.
high load causes CRASH
xen1 reboots. Attempts to bring app1 up (already up on xen7)
Two machines try to write to the same disk - DOOM.
What about using gfs (or maybe just nfs from the netapp) to keep the contents of /etc/xen/auto "synced" between hosts? Having /etc/xen/auto as a link to [sharedarea]/auto-xenXX and making sure that the entries there are unique will be a good start for a solution I think.
Network file shares are ill suited for tasks like configuration management IMHO.
-Mike
On Fri, Apr 18, 2008 at 04:39:18PM -0500, Mike McGrath wrote:
On Fri, 18 Apr 2008, Kostas Georgiou wrote:
On Fri, Apr 18, 2008 at 03:33:22PM -0500, Mike McGrath wrote:
The only reason we haven't done this already is the inability to detect if the box is already up somewhere (which is something we need already) Consider this scenario:
app1 running on xen1 (which is having high load from koji1 also on xen1)
People complain about the wiki.
We move app1 to a more free box, xen7.
high load causes CRASH
xen1 reboots. Attempts to bring app1 up (already up on xen7)
Two machines try to write to the same disk - DOOM.
What about using gfs (or maybe just nfs from the netapp) to keep the contents of /etc/xen/auto "synced" between hosts? Having /etc/xen/auto as a link to [sharedarea]/auto-xenXX and making sure that the entries there are unique will be a good start for a solution I think.
Network file shares are ill suited for tasks like configuration management IMHO.
I don't disagree, I don't like it either. RedHat magazine had an article[1] recently about xen failover using conga. It might give some ideas even if you prefer not to use gfs (probably a bad idea to have host images in gfs anyway).
Kostas
[1] http://www.redhatmagazine.com/2007/08/23/automated-failover-and-recovery-of-...
On Fri, 18 Apr 2008, Kostas Georgiou wrote:
Kostas
[1] http://www.redhatmagazine.com/2007/08/23/automated-failover-and-recovery-of-...
I'd played a little bit with this not too long ago actually but found luci / ricci to be a bit to tempermental for that particular job. Plus there's no need for gfs / nfs in that particular instance if you're using clvm.
-Mike
On Fri, 18 Apr 2008, seth vidal wrote:
So xen1 went down today and I was helping bring things back up. I didn't know to look in /var/log/messages for the messages from xenGuestsRunning.sh. I was wondering this:
One more got'cha we need to fix is at some point in time xen stopped honoring the MAXMEM setting and wouldn't let guests go above the default mem setting. This caused some of our xen configs to get overcommitted (total memory size is greater then the memory on the box and upon starting we'd have to xm mem-set some of the boxes lower)
I'm almost positive this is fixed now so we can go through, set mem to something lower and have maxmem behave properly.
-Mike
infrastructure@lists.fedoraproject.org