Clement brought up that spring cleaning of our ansible playbooks would be a good idea. This is painfully obvious during our previous update/reboot cycles where we have had services not updated or restarted correctly so that systems did not come up well when we rebooted.
I have opened https://pagure.io/fedora-infrastructure/issue/7695 which is the tracking ticket for this problem. I am proposing that we do major updates something like the following in the future. We can tweak as we find better ways to do them in clusters later.
If you maintain a service, take that playbook and add comments for the following: a. Who is the current maintainer b. Date when that was last updated c. Who tested the upgrade and when d. General comments to explain what things are doing.
If the playbook should be retired, removed, killed, etc please do so. My goal will be to make our update schedules something like this:
Day 1: a. Run update playbooks on staging instances. b. Fix any problems shown by those. c. Run general update vhost_update on staging instances d. Reboot staging instances. e. Fix problems found from this.
Day 2: a. Access if day 1 was a complete failure and stop upgrade cycle b. Run update playbooks on low priority systems c. Fix any problems shown by those. d. Run general update vhost_update on staging instances e. Reboot staging instances. f. Fix problems found from this.
Day 3: a. Access if day 2 was a complete failure and stop upgrade cycle b. Run update playbooks on high priority systems c. Fix any problems shown by those. d. Run general update vhost_update on staging instances e. Reboot staging instances. f. Fix problems found from this.
This should cut down the extra long hours and extended outages we have needed to do in the last couple of reboot cycles.
On 4/5/19 8:59 AM, Stephen John Smoogen wrote:
Clement brought up that spring cleaning of our ansible playbooks would be a good idea. This is painfully obvious during our previous update/reboot cycles where we have had services not updated or restarted correctly so that systems did not come up well when we rebooted.
I have opened https://pagure.io/fedora-infrastructure/issue/7695 which is the tracking ticket for this problem. I am proposing that we do major updates something like the following in the future. We can tweak as we find better ways to do them in clusters later.
Thanks for bringing this up smooge!
I completely agree we should clean up/make sure the manual/upgrade/* playbooks are good and up to date.
If you maintain a service, take that playbook and add comments for the following: a. Who is the current maintainer b. Date when that was last updated c. Who tested the upgrade and when d. General comments to explain what things are doing.
If the playbook should be retired, removed, killed, etc please do so.
I think we need things like maintainer for all our apps. ;)
My goal will be to make our update schedules something like this:
Day 1: a. Run update playbooks on staging instances. b. Fix any problems shown by those. c. Run general update vhost_update on staging instances d. Reboot staging instances. e. Fix problems found from this.
Day 2: a. Access if day 1 was a complete failure and stop upgrade cycle b. Run update playbooks on low priority systems c. Fix any problems shown by those. d. Run general update vhost_update on staging instances e. Reboot staging instances. f. Fix problems found from this.
Day 3: a. Access if day 2 was a complete failure and stop upgrade cycle b. Run update playbooks on high priority systems c. Fix any problems shown by those. d. Run general update vhost_update on staging instances e. Reboot staging instances. f. Fix problems found from this.
Should all of those have staging? Or should it be staging then build then the rest? or staging, low pri, then high pri?
This should cut down the extra long hours and extended outages we have needed to do in the last couple of reboot cycles.
Well, so some background (not for smooge as he knows all this, but others reading):
In the past the way we did mass/update reboots has changed a few times. The most recent incarnation has been doing staging on a friday or monday, then doing the 'build' machines on one day (basically anything on bvirthost) and then doing 'the rest' on the next day. Sometimes due to time we have compressed the two things into one (long) day. During these we list out all the virthosts/hardware machines, and the sysadmins take them, update then, reboot them and confirm they come up. Then at the end we look at nagios and clear up any alerts before calling it done. The reason for this seperation was because they each had different 'users': We could notify the build outage to just devel-announce, the 'everything else' to announce. Of course now we have to announce the staging at least to centos folks due to keeping pkgs01.stg in sync via repospanner.
One thing I have done a few times that I think helped a LOT as far as time is to actually just apply updates to everything before the reboot cycle. This saved us all the time waiting for updates to apply (which can sometimes on some machines take a really long time). Of course this means that machines run for a time with the updates, but with no restarted processes.
I am not sure we will easily be able to seperate out what 'manual' playbooks to run for what servers, and additionally in most cases updates on our apps are done outside our updates cycles (ie, pagure would update to 5.4 manually when it's out/desired, we wouldn't expect a pending update when we do our normal OS update cycles)
So some radical ideas:
* What if we just daily auto-apply all updates. (We already do daily apply security updates on fedora instances). This would break things from time to time, but I suspect only particular things, not everything all at once. We would also still need reboot cycles.
* What if we just daily auto-apply security updates? (This would reduce the breakage from all updates some). Reboots still needed.
* I thought of the idea someday of doing reboots and having no outage. Unfortunately, that requires database clustering. At the time the clustering was all horrible, but it might be better these days. If we did have that however, we could just do reboots when we liked.
I'm not sure there's a great answer here... I think when we decide to do a update/reboot cycle we should apply updates up front to save time/pain and suppose if we can easily determine what manual playbooks to run we could run those too.
kevin
On Fri, 5 Apr 2019 at 13:15, Kevin Fenzi kevin@scrye.com wrote:
On 4/5/19 8:59 AM, Stephen John Smoogen wrote:
Clement brought up that spring cleaning of our ansible playbooks would be a good idea. This is painfully obvious during our previous update/reboot cycles where we have had services not updated or restarted correctly so that systems did not come up well when we rebooted.
I have opened https://pagure.io/fedora-infrastructure/issue/7695 which is the tracking ticket for this problem. I am proposing that we do major updates something like the following in the future. We can tweak as we find better ways to do them in clusters later.
Thanks for bringing this up smooge!
I completely agree we should clean up/make sure the manual/upgrade/* playbooks are good and up to date.
If you maintain a service, take that playbook and add comments for the following: a. Who is the current maintainer b. Date when that was last updated c. Who tested the upgrade and when d. General comments to explain what things are doing.
If the playbook should be retired, removed, killed, etc please do so.
I think we need things like maintainer for all our apps. ;)
My goal will be to make our update schedules something like this:
Day 1: a. Run update playbooks on staging instances. b. Fix any problems shown by those. c. Run general update vhost_update on staging instances d. Reboot staging instances. e. Fix problems found from this.
Day 2: a. Access if day 1 was a complete failure and stop upgrade cycle b. Run update playbooks on low priority systems c. Fix any problems shown by those. d. Run general update vhost_update on staging instances e. Reboot staging instances. f. Fix problems found from this.
Day 3: a. Access if day 2 was a complete failure and stop upgrade cycle b. Run update playbooks on high priority systems c. Fix any problems shown by those. d. Run general update vhost_update on staging instances e. Reboot staging instances. f. Fix problems found from this.
Should all of those have staging? Or should it be staging then build then the rest? or staging, low pri, then high pri?
So I was leaving the definitions vague as some build boxes are high priority (no redundancy or major outage) and some are low priority because they have a large amount of redundancy. I was figuring
low priority: - proxies not shared on high priority external virthosts - builders and other high redundancy build systems - openshift systems IF updated and drained properly - other external services which have low SLE
high priority: services with no redundancy: - databases - pagure - src - etc etc services with high outage effects - koji? - etc etc
This should cut down the extra long hours and extended outages we have needed to do in the last couple of reboot cycles.
Well, so some background (not for smooge as he knows all this, but others reading):
In the past the way we did mass/update reboots has changed a few times. The most recent incarnation has been doing staging on a friday or monday, then doing the 'build' machines on one day (basically anything on bvirthost) and then doing 'the rest' on the next day. Sometimes due to time we have compressed the two things into one (long) day. During these we list out all the virthosts/hardware machines, and the sysadmins take them, update then, reboot them and confirm they come up. Then at the end we look at nagios and clear up any alerts before calling it done. The reason for this seperation was because they each had different 'users': We could notify the build outage to just devel-announce, the 'everything else' to announce. Of course now we have to announce the staging at least to centos folks due to keeping pkgs01.stg in sync via repospanner.
One thing I have done a few times that I think helped a LOT as far as time is to actually just apply updates to everything before the reboot cycle. This saved us all the time waiting for updates to apply (which can sometimes on some machines take a really long time). Of course this means that machines run for a time with the updates, but with no restarted processes.
I am not sure we will easily be able to seperate out what 'manual' playbooks to run for what servers, and additionally in most cases updates on our apps are done outside our updates cycles (ie, pagure would update to 5.4 manually when it's out/desired, we wouldn't expect a pending update when we do our normal OS update cycles)
So some radical ideas:
- What if we just daily auto-apply all updates. (We already do daily
apply security updates on fedora instances). This would break things from time to time, but I suspect only particular things, not everything all at once. We would also still need reboot cycles.
- What if we just daily auto-apply security updates? (This would reduce
the breakage from all updates some). Reboots still needed.
- I thought of the idea someday of doing reboots and having no outage.
Unfortunately, that requires database clustering. At the time the clustering was all horrible, but it might be better these days. If we did have that however, we could just do reboots when we liked.
I'm not sure there's a great answer here... I think when we decide to do a update/reboot cycle we should apply updates up front to save time/pain and suppose if we can easily determine what manual playbooks to run we could run those too.
kevin
infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
On 4/5/19 10:34 AM, Stephen John Smoogen wrote:
On Fri, 5 Apr 2019 at 13:15, Kevin Fenzi kevin@scrye.com wrote:
So I was leaving the definitions vague as some build boxes are high priority (no redundancy or major outage) and some are low priority because they have a large amount of redundancy. I was figuring
Yeah, but you still need to notify people about those, and the audience is different, so you will need to notify everyone and mention the specific services. ;(
low priority:
- proxies not shared on high priority external virthosts
- builders and other high redundancy build systems
- openshift systems IF updated and drained properly
- other external services which have low SLE
Yeah, many or all of these could be done anytime, users shouldn't notice an outage (prodixes can be taken out of dns, builders disabled, openshift can migrate things around), etc.
high priority: services with no redundancy:
- databases
- pagure
- src
- etc etc
services with high outage effects
- koji?
- etc etc
Yeah, although even here... for example, we can switch to koji02 and reboot koji01 without people really noticing. Of course kojidb is another matter.
kevin
On Sun, 7 Apr 2019 at 13:44, Kevin Fenzi kevin@scrye.com wrote:
On 4/5/19 10:34 AM, Stephen John Smoogen wrote:
On Fri, 5 Apr 2019 at 13:15, Kevin Fenzi kevin@scrye.com wrote:
So I was leaving the definitions vague as some build boxes are high priority (no redundancy or major outage) and some are low priority because they have a large amount of redundancy. I was figuring
Yeah, but you still need to notify people about those, and the audience is different, so you will need to notify everyone and mention the specific services. ;(
Yes. That should still be part of the process. I should have had that.
low priority:
- proxies not shared on high priority external virthosts
- builders and other high redundancy build systems
- openshift systems IF updated and drained properly
- other external services which have low SLE
Yeah, many or all of these could be done anytime, users shouldn't notice an outage (prodixes can be taken out of dns, builders disabled, openshift can migrate things around), etc.
If there is one thing I have noticed in the last 10 years.. is that someone always notices... and probably had something Important(TM) that this broke and want to make sure we notify. Even weekend reboots of builders seem to get some module build to fail and an angry developer comment in #fedora-devel (at least the last 3 reboot cycles). The proxies get similar issues because even if we pull it out of DNS.. we don't control everyone else who have DNS servers expecting 24 hours before stuff goes away.
high priority: services with no redundancy:
- databases
- pagure
- src
- etc etc
services with high outage effects
- koji?
- etc etc
Yeah, although even here... for example, we can switch to koji02 and reboot koji01 without people really noticing. Of course kojidb is another matter.
That was the reason for the ?. Some things I know we can do and others I do and find out that 02 is actually not a redundant system but used for something else.. 'oh yeah I would rename that but was waiting until we redesigned after the next major overhaul.' So I figure we will need to go through each service and categorize this but we need to do that anyway for other reasons.
kevin
infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
infrastructure@lists.fedoraproject.org