Greetings.
Last night koji alerted due to slowness. It was not backups or anything, but rather the database hitting the limit I raised in commit c678f73b:
-autovacuum_freeze_max_age = 200000000 # maximum XID age before forced vacuum +autovacuum_freeze_max_age = 300000000 # maximum XID age before forced vacuum
What this means is basicially: postgres records the xid (transaction id) that can 'see' other transactions in the table rows. However, xid is a 32 bit value, meaning there can only be about 2.1billion transactions before it 'wraps around'. When it does so, all the 'old' XID's need to be gone or it will confuse it. It removes the old xids by marking old transactions as 'frozen' (so any other transaction should see them).
So, this value tells the autovacuumer to start processing the table for old xids and frezzing them, so by the time the wrap around happens everything will be set.
Unfortunately, it's doing this on the buildroot_listing table, which is: public | buildroot_listing | table | koji | 219 GB |
So, the i/o load is heavy and koji is slow to respond to real requests.
There's (at least) tree things we could do:
1. Bump the autovacuum_freeze_max_age up to 600million. The 100million bump I did in january gave us about 1.5 months, so if we do 600, we might last until june, when we will be migrating to the new datacenter. 600million is still a long way from 2.1 billion, so it should be fine. At that point I hope to move db-koji01 to a rhel8 instance and much newer postgresql. We could also run the vacuum duing downtime and let it finish.
2. Just let it finish now. Things will be slow, I don't know for how long. Users will complain and it will take longer for people to get things done, but at the end we should be in better shape and there's basically no action we need to take (other than handling complaints)
3. Schedule an outage and take the db offline and run the vacuum. This might be quicker than letting the autovac finish, I am not sure.
Thoughts? please +1 the freeze break of the option you thing is best, or feel free to ask more info or suggest other options.
kevin
On Sun, 2020-03-01 at 11:26 -0800, Kevin Fenzi wrote:
Greetings.
Last night koji alerted due to slowness. It was not backups or anything, but rather the database hitting the limit I raised in commit c678f73b:
-autovacuum_freeze_max_age = 200000000 # maximum XID age before forced vacuum +autovacuum_freeze_max_age = 300000000 # maximum XID age before forced vacuum
What this means is basicially: postgres records the xid (transaction id) that can 'see' other transactions in the table rows. However, xid is a 32 bit value, meaning there can only be about 2.1billion transactions before it 'wraps around'. When it does so, all the 'old' XID's need to be gone or it will confuse it. It removes the old xids by marking old transactions as 'frozen' (so any other transaction should see them).
So, this value tells the autovacuumer to start processing the table for old xids and frezzing them, so by the time the wrap around happens everything will be set.
Unfortunately, it's doing this on the buildroot_listing table, which is: public | buildroot_listing | table | koji | 219 GB |
So, the i/o load is heavy and koji is slow to respond to real requests.
There's (at least) tree things we could do:
- Bump the autovacuum_freeze_max_age up to 600million. The 100million
bump I did in january gave us about 1.5 months, so if we do 600, we might last until june, when we will be migrating to the new datacenter. 600million is still a long way from 2.1 billion, so it should be fine. At that point I hope to move db-koji01 to a rhel8 instance and much newer postgresql. We could also run the vacuum duing downtime and let it finish.
- Just let it finish now. Things will be slow, I don't know for how
long. Users will complain and it will take longer for people to get things done, but at the end we should be in better shape and there's basically no action we need to take (other than handling complaints)
- Schedule an outage and take the db offline and run the vacuum. This
might be quicker than letting the autovac finish, I am not sure.
Thoughts? please +1 the freeze break of the option you thing is best, or feel free to ask more info or suggest other options.
I'm fine with 1 or 2, but think we should definitely *not* do 3 unless a) we know quite precisely how long it will take and b) that is substantially faster than the online autovac.
On Mon, Mar 02, 2020 at 08:17:38AM -0800, Adam Williamson wrote:
I'm fine with 1 or 2, but think we should definitely *not* do 3 unless a) we know quite precisely how long it will take and b) that is substantially faster than the online autovac.
Yeah.
Just as a status update, the auto vac has been running about 1.5 days so far. Things are slow, but there hasn't been a flood of complaints yet.
Load on the db server is around 100 or so, when it's normally about 20
I did disable backups last night because backups + this auto vac definitely make it unusable.
kevin
On Mon, 2 Mar 2020 at 18:53, Kevin Fenzi kevin@scrye.com wrote:
On Mon, Mar 02, 2020 at 08:17:38AM -0800, Adam Williamson wrote:
I'm fine with 1 or 2, but think we should definitely *not* do 3 unless a) we know quite precisely how long it will take and b) that is substantially faster than the online autovac.
Yeah.
Just as a status update, the auto vac has been running about 1.5 days so far. Things are slow, but there hasn't been a flood of complaints yet.
Should we try to bite the bullet ? since we are in freeze for F32 beta there might also be less activity.
Is there any way to see some sort of progress ?
Load on the db server is around 100 or so, when it's normally about 20
I did disable backups last night because backups + this auto vac definitely make it unusable.
kevin _______________________________________________ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
On Mon, Mar 02, 2020 at 07:52:18PM +0100, Clement Verna wrote:
On Mon, 2 Mar 2020 at 18:53, Kevin Fenzi kevin@scrye.com wrote:
On Mon, Mar 02, 2020 at 08:17:38AM -0800, Adam Williamson wrote:
I'm fine with 1 or 2, but think we should definitely *not* do 3 unless a) we know quite precisely how long it will take and b) that is substantially faster than the online autovac.
Yeah.
Just as a status update, the auto vac has been running about 1.5 days so far. Things are slow, but there hasn't been a flood of complaints yet.
Should we try to bite the bullet ? since we are in freeze for F32 beta there might also be less activity.
Is there any way to see some sort of progress ?
Sadly no. There's a progress thing that you can query... in postgresql 9.6. :( Sadly we are on 9.2, and it doesn't have it.
So, all I can do is look and see if it's still running or not. ;(
koji | buildroot_listing_rpms | | RowExclusiveLock | t | postgre s | autovacuum: VACUUM public.buildroot_listing (to prevent wraparound) | 2020-03-01 05:35:18.412833+00 | 1 day 18:58:54.265952 | 31654
kevin
On Mon, Mar 02, 2020 at 09:45:10AM -0800, Kevin Fenzi wrote:
On Mon, Mar 02, 2020 at 08:17:38AM -0800, Adam Williamson wrote:
I'm fine with 1 or 2, but think we should definitely *not* do 3 unless a) we know quite precisely how long it will take and b) that is substantially faster than the online autovac.
Yeah.
Just as a status update, the auto vac has been running about 1.5 days so far. Things are slow, but there hasn't been a flood of complaints yet.
In this case I'd say, let's keep it running for a while until we start to see a flood of complaints, at which point we can re-evaluate and see if we want to go with option #1.
I did disable backups last night because backups + this auto vac definitely make it unusable.
Having no backups is always annoying but not having a usable koji is as well, so +1 to that!
Pierre
And our lazyness and indecision pays off!
The autovac is finally done and load should be back to normal.
I am going to re-enable backups.
Thanks everyone.
kevin
On Tue, 2020-03-03 at 13:33 -0800, Kevin Fenzi wrote:
And our lazyness and indecision pays off!
Woohoo, the Adam Strategy wins again!
On Tue, 3 Mar 2020 at 22:39, Kevin Fenzi kevin@scrye.com wrote:
And our lazyness and indecision pays off!
The autovac is finally done and load should be back to normal.
Should we lower down the autovacuum max age ? so that next times it runs it does not take that long ?
I am going to re-enable backups.
Thanks everyone.
kevin _______________________________________________ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
Is there an ETA for the next vacuum process? If we can wait for some months it might worth to upgrade to newer rhel and postgres after dc migration but before next autovacuum process instead of lowering now the max age and forcing another autovacuum process with current old postgres
20/3/4 09:41(e)an, Clement Verna igorleak idatzi zuen:
On Tue, 3 Mar 2020 at 22:39, Kevin Fenzi <kevin@scrye.com mailto:kevin@scrye.com> wrote:
And our lazyness and indecision pays off! The autovac is finally done and load should be back to normal.
Should we lower down the autovacuum max age ? so that next times it runs it does not take that long ?
I am going to re-enable backups. Thanks everyone. kevin _______________________________________________ infrastructure mailing list -- infrastructure@lists.fedoraproject.org <mailto:infrastructure@lists.fedoraproject.org> To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org <mailto:infrastructure-leave@lists.fedoraproject.org> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org
infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
On Wed, 4 Mar 2020 at 04:00, Julen Landa Alustiza jlanda@fedoraproject.org wrote:
Is there an ETA for the next vacuum process? If we can wait for some
There is no ETA as it depends on the number of builds that have been done since the last autovac. A large number means it is further into the future but takes much longer to do. A smaller number means it will happen sooner but shorter in time. The number of builds per time is not something that averages very well.. you have days where things seem regular and then a mass of modules and koschei and other items all hit for a couple of days, etc etc.
months it might worth to upgrade to newer rhel and postgres after dc migration but before next autovacuum process instead of lowering now the max age and forcing another autovacuum process with current old postgres
20/3/4 09:41(e)an, Clement Verna igorleak idatzi zuen:
On Tue, 3 Mar 2020 at 22:39, Kevin Fenzi <kevin@scrye.com mailto:kevin@scrye.com> wrote:
And our lazyness and indecision pays off! The autovac is finally done and load should be back to normal.
Should we lower down the autovacuum max age ? so that next times it runs it does not take that long ?
I am going to re-enable backups. Thanks everyone. kevin _______________________________________________ infrastructure mailing list -- infrastructure@lists.fedoraproject.org <mailto:infrastructure@lists.fedoraproject.org> To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org <mailto:infrastructure-leave@lists.fedoraproject.org> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines:
https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives:
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to
infrastructure-leave@lists.fedoraproject.org
Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives:
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
On Wed, Mar 04, 2020 at 07:45:29AM -0500, Stephen John Smoogen wrote:
On Wed, 4 Mar 2020 at 04:00, Julen Landa Alustiza jlanda@fedoraproject.org wrote:
Is there an ETA for the next vacuum process? If we can wait for some
There is no ETA as it depends on the number of builds that have been done since the last autovac. A large number means it is further into the future but takes much longer to do. A smaller number means it will happen sooner but shorter in time. The number of builds per time is not something that averages very well.. you have days where things seem regular and then a mass of modules and koschei and other items all hit for a couple of days, etc etc.
Well, it took it about 13 years to reach 300million, but it has obviously been speeding up since the early years. The last 100million were about 1.5 months ago (when I upped it from default 200 to 300).
So, I think we have about 4.5 months until the next one, which should be enough time to get to the migration. :) If we hit it early, I'd just bump the value then.
kevin
infrastructure@lists.fedoraproject.org