replication from 1.2.8.3 to 1.2.10.4

Wednesday, 11 July 2012

Is replication from a 1.2.8.3 server to a 1.2.10.4 server known to work or not work? 
We're having changelog issues.

Background:

We have an ldap service consisting of 3 masters, 2 hubs and 16 slaves.  All were running
1.2.8.3 since last summer with no issues.  This summer, we decided to bring them all up to
the latest stable release, 1.2.10.4.  We can't afford a lot of downtime for the
service as a whole, but with the redundancy level we have, we can take down a machine or
two at a time without user impact.

We started with one slave, did a clean install of 1.2.10.4 on it, set up replication
agreements from our 1.2.8.3 hubs to it and watched it for a week or so.  Everything looked
fine, so we started rolling through the rest of the slave servers, got them all running
1.2.10.4 and so far haven't seen any problems.

A couple of days ago, I did one of our two hubs.  The first time I bring up the daemon
after doing the initial import of our ldap data everything seems fine.  However, we start
seeing errors the first time we restart:

[11/Jul/2012:10:43:58 -0400] - slapd shutting down - signaling operation threads
[11/Jul/2012:10:43:58 -0400] - slapd shutting down - waiting for 2 threads to terminate
[11/Jul/2012:10:44:01 -0400] - slapd shutting down - closing down internal subsystems and
plugins
[11/Jul/2012:10:44:02 -0400] - Waiting for 4 database threads to stop
[11/Jul/2012:10:44:04 -0400] - All database threads now stopped
[11/Jul/2012:10:44:04 -0400] - slapd stopped.
[11/Jul/2012:10:45:00 -0400] - 389-Directory/1.2.10.4 B2012.101.2023 starting up
[11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the max CSN
[4ffdca7e000000330000] from RUV [changelog max RUV] is larger than the max CSN
[4ffb605d000000330000] from RUV [database RUV] for element [{replica 51}
4ffb602b000300330000 4ffdca7e000000330000]
[11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - replica_check_for_data_reload:
Warning: data for replica
ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu does not match the data
in the changelog. Recreating the changelog file. This could affect replication with
replica's consumers in which case the consumers should be reinitialized.
[11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the max CSN
[4ffdca70000000340000] from RUV [changelog max RUV] is larger than the max CSN
[4ffb7098000100340000] from RUV [database RUV] for element [{replica 52}
4ffb6ea2000000340000 4ffdca70000000340000]
[11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - replica_check_for_data_reload:
Warning: data for replica ou=people,dc=gted,dc=gatech,dc=edu does not match the data in
the changelog. Recreating the changelog file. This could affect replication with
replica's consumers in which case the consumers should be reinitialized.
[11/Jul/2012:10:45:08 -0400] - slapd started.  Listening on All Interfaces port 389 for
LDAP requests
[11/Jul/2012:10:45:08 -0400] - Listening on All Interfaces port 636 for LDAPS requests

The _second_ restart is even worse, we get more error messages (see below) and then the
daemon dies after it says it's listening on it's ports:

[11/Jul/2012:10:45:32 -0400] - slapd shutting down - signaling operation threads
[11/Jul/2012:10:45:32 -0400] - slapd shutting down - waiting for 29 threads to terminate
[11/Jul/2012:10:45:34 -0400] - slapd shutting down - closing down internal subsystems and
plugins
[11/Jul/2012:10:45:35 -0400] - Waiting for 4 database threads to stop
[11/Jul/2012:10:45:36 -0400] - All database threads now stopped
[11/Jul/2012:10:45:36 -0400] - slapd stopped.
[11/Jul/2012:10:46:11 -0400] - 389-Directory/1.2.10.4 B2012.101.2023 starting up
[11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max
RUV] does not contain element [{replica 68 ldap://gtedm3.iam.gatech.edu:389}
4be339e6000000440000 4ffdc9a1000000440000] which is present in RUV [database RUV]
[11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max
RUV] does not contain element [{replica 71 ldap://gtedm4.iam.gatech.edu:389}
4be6031e000000470000 4ffdc9a8000000470000] which is present in RUV [database RUV]
[11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the max CSN
[4ffb62a2000100330000] from RUV [changelog max RUV] is larger than the max CSN
[4ffb605d000000330000] from RUV [database RUV] for element [{replica 51}
4ffb605d000000330000 4ffb62a2000100330000]
[11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - replica_check_for_data_reload:
Warning: data for replica
ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu does not match the data
in the changelog. Recreating the changelog file. This could affect replication with
replica's consumers in which case the consumers should be reinitialized.
[11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max
RUV] does not contain element [{replica 69 ldap://gtedm3.iam.gatech.edu:389}
4be339e4000000450000 4ffdc9a2000000450000] which is present in RUV [database RUV]
[11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max
RUV] does not contain element [{replica 72 ldap://gtedm4.iam.gatech.edu:389}
4be6031d000000480000 4ffdc9a9000300480000] which is present in RUV [database RUV]
[11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the max CSN
[4ffb78bc000000340000] from RUV [changelog max RUV] is larger than the max CSN
[4ffb7098000100340000] from RUV [database RUV] for element [{replica 52}
4ffb7098000100340000 4ffb78bc000000340000]
[11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - replica_check_for_data_reload:
Warning: data for replica ou=people,dc=gted,dc=gatech,dc=edu does not match the data in
the changelog. Recreating the changelog file. This could affect replication with
replica's consumers in which case the consumers should be reinitialized.
[11/Jul/2012:10:46:11 -0400] - slapd started.  Listening on All Interfaces port 389 for
LDAP requests
[11/Jul/2012:10:46:11 -0400] - Listening on All Interfaces port 636 for LDAPS requests

At this point, the only way I've found to get it back is to clean out the changelog
and db directories and re-import the ldap data from scratch.  Essentially we can't
restart without having to re-import.  I've done this a couple of times already and
it's entirely reproducible.

I've checked and ensured that there's no obsolete masters that need to be
CLEANRUVed.  I've also noticed that the errors _seem_ to be only affecting our second
and third suffix.  We have three suffixes defined, but I haven't seen any error
messages for the first one.

Has anyone seen anything like this?  We're not sure if this is a general 1.2.10.4
issue or if it only occurs if when replicating from 1.2.8.3 to 1.2.10.4.  If it's the
former, we cannot proceed with getting the rest of the servers up to 1.2.10.4.  If
it's the latter, then we need to expedite getting everything up to 1.2.10.4.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005