I'd apply this later tonight when things were not very busy, or early tomorrow morning. (given +1s)
We have been having the cluster fall over for still unknown reasons, but this patch should at least help prevent them:
first we increase the net_ticktime parameter from it's default of 60 to 120. rabbitmq sends 4 'ticks' to other cluster members over this time and if 25% of them are lost it assumes that cluster member is down. All these vm's are on the same net and in the same datacenter, but perhaps heavy load from other vm's causes them to sometimes not get a tick in time? http://www.rabbitmq.com/nettick.html
Also, set our partitioning strategy to autoheal. Currently if some cluster member gets booted out, it gets paused, and stops processing at all. With autoheal it will try and figure out a 'winning' partition and restart all the nodes that are not in that partition. https://www.rabbitmq.com/partitions.html
Hopefully the first thing will make partitions less likely and the second will make them repair without causing massive pain to the cluster.
Signed-off-by: Kevin Fenzi kevin@scrye.com --- roles/rabbitmq_cluster/templates/rabbitmq.config | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/roles/rabbitmq_cluster/templates/rabbitmq.config b/roles/rabbitmq_cluster/templates/rabbitmq.config index 5c38dbd..82dd444 100644 --- a/roles/rabbitmq_cluster/templates/rabbitmq.config +++ b/roles/rabbitmq_cluster/templates/rabbitmq.config @@ -21,7 +21,7 @@
%% How to respond to cluster partitions. %% Documentation: https://www.rabbitmq.com/partitions.html - {cluster_partition_handling, pause_minority}, + {cluster_partition_handling, autoheal},
%% And some general config {log_levels, [{connection, none}]}, @@ -29,9 +29,7 @@ {heartbeat, 600}, {channel_max, 128} ]}, - {kernel, - [ - ]}, + {kernel, [{net_ticktime, 120}]}, {rabbitmq_management, [ {listener, [{port, 15672},
On Thu, 2020-03-12 at 19:32 +0000, Kevin Fenzi wrote:
We have been having the cluster fall over for still unknown reasons, but this patch should at least help prevent them:
(if I get a vote) +1, makes sense to me.
We have been having the cluster fall over for still unknown reasons, but this patch should at least help prevent them
I wish I understood what's actually going on, but +1 on those changes to see if they help. If they do we may consider reverting to the default when we upgrade to the newer version maybe?
A.
On Fri, Mar 13, 2020 at 12:50:01PM +0100, Aurelien Bompard wrote:
We have been having the cluster fall over for still unknown reasons, but this patch should at least help prevent them
I wish I understood what's actually going on, but +1 on those changes to see if they help. If they do we may consider reverting to the default when we upgrade to the newer version maybe?
yeah, we could.
My only theory right now is that the vmhosts those vm's are on are under high network load (2 of them have download servers on them) and sometimes packets are dropped... but it seems far fetched. ;(
kevin
+1
On Thu, 12 Mar 2020 at 20:40, Kevin Fenzi kevin@scrye.com wrote:
We have been having the cluster fall over for still unknown reasons, but this patch should at least help prevent them:
first we increase the net_ticktime parameter from it's default of 60 to 120. rabbitmq sends 4 'ticks' to other cluster members over this time and if 25% of them are lost it assumes that cluster member is down. All these vm's are on the same net and in the same datacenter, but perhaps heavy load from other vm's causes them to sometimes not get a tick in time? http://www.rabbitmq.com/nettick.html
Also, set our partitioning strategy to autoheal. Currently if some cluster member gets booted out, it gets paused, and stops processing at all. With autoheal it will try and figure out a 'winning' partition and restart all the nodes that are not in that partition. https://www.rabbitmq.com/partitions.html
Hopefully the first thing will make partitions less likely and the second will make them repair without causing massive pain to the cluster.
Signed-off-by: Kevin Fenzi kevin@scrye.com
roles/rabbitmq_cluster/templates/rabbitmq.config | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/roles/rabbitmq_cluster/templates/rabbitmq.config b/roles/rabbitmq_cluster/templates/rabbitmq.config index 5c38dbd..82dd444 100644 --- a/roles/rabbitmq_cluster/templates/rabbitmq.config +++ b/roles/rabbitmq_cluster/templates/rabbitmq.config @@ -21,7 +21,7 @@
%% How to respond to cluster partitions. %% Documentation: https://www.rabbitmq.com/partitions.html
- {cluster_partition_handling, pause_minority},
{cluster_partition_handling, autoheal},
%% And some general config {log_levels, [{connection, none}]},
@@ -29,9 +29,7 @@ {heartbeat, 600}, {channel_max, 128} ]},
- {kernel,
- [
- ]},
- {kernel, [{net_ticktime, 120}]}, {rabbitmq_management, [ {listener, [{port, 15672},
-- 1.8.3.1 _______________________________________________ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
At this point I think we are in full outage and this doesn't need +1's in order to make things go. But I have +1 it
On Thu, 12 Mar 2020 at 15:32, Kevin Fenzi kevin@scrye.com wrote:
I'd apply this later tonight when things were not very busy, or early tomorrow morning. (given +1s) _______________________________________________ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
infrastructure@lists.fedoraproject.org