Hi Rich,

It has been a while since we discussed the bug that turned out to be SPARC specific.

Meanwhile, I got an access to opencsw build environment so that I can build the source and test the fix you mentioned: fix on atomic operation (please, see the email thread below for the bug details).

For your reference, we use 389 DS, version 1.2.11.15 on Solaris SPARC. The bug cannot be reproduced on Solaris x86 nor on Red Hat Linux x86.

Furthermore, the only way 389 DS 1.2.11.15 on Solaris SPARC works fine in multi-master replication topology is when all other servers are on Solaris x86 platforms and the SPARC’s one is used to initialize all the others.

Can you, please, direct me as to where to apply the fix in the source code?

Thank you,

Jovan

Jovan Vukotić • Senior Software Engineer • Ambit Treasury Management • SunGard • Banking • Bulevar Milutina Milankovića 136b, Belgrade, Serbia • tel: +381.11.6555-66-1 • jovan.vukotic@sungard.com

From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Friday, June 28, 2013 4:17 PM
To: Vukotic, Jovan
Cc: 389-users@lists.fedoraproject.org; Mehta, Cyrus
Subject: Re: [389-users] FW: fresh replica reports "reloading ruv failed " just after successfull initialization

On 06/28/2013 03:30 AM, Jovan.VUKOTIC@sungard.com wrote:

Rich,

No, I do not build the code myself.

ok - looks like CSW packages.

I'm not sure if things are going to work correctly until we get the atomic op bug fixed. Unfortunately we don't have the means to build and test on Sparc. Is there someone who can help us build and test some fixes?

At the moment, with error log level set to 40960 (32768+8192) I got a bit more error messages, but they are no indicative to me whatsoever:

[28/Jun/2013:05:06:03 —0400] cache_add_tentative concurrency detected

[28/Jun/2013:05:06:09 —0400) NSMllReplicationPlugin — _replica_configure_ruv: failed to create replica ruv tombstone entry (dc=xxxxxx,dc=com); LDAP error — 68

[28/Jun/2013:05:06:39 —0400] — cache_add_tentative concurrency detected

[28/Jun/2013:05:06:39 —0400] NSMMReplicationPlugin _replica_configure_ruv: failed to create replica ruv tombstone entry (dc=xxxxxxx,dc=com); LDAP error 68

[28/Jun/2013:05:07:00 —0400] Changelog purge skipped anchor csn 51c5ec28000000020000

[28/Jun/2013:05:07:09 —0400] cache_add_tentative concurrency detected

[28/Jun/2013:05:07:09 —0400] NSMMMReplicationPlugin _replica_configure_ruv: failed to create replica ruv tombstone entry (dc=xxxxxx,dc=com): LDAP error 68

[28/Jun/2013:05:07:39 —0400] cache_add_tentative concurrency detected

[28/Jun/2013:05:07:39 —0400] NSMMReplicationPlugin — _replica_configure_ruv: failed to create replica ruv tombstone entry (dc=xxxxxxx,dc=com); LDAP error — 68

[28/Jun/2013:05:08:09 —0400] — cache_add_tentative concurrency detected

[28/Jun/2013:05:08:09 —0400] NSMMReplicationPlugin — _replica_configure_ruv: failed to create replica ruv tombstone entry (dc=xxxxxxx,dc=com); LDAP error — 68

[28/Jun/2013:05:08:39 —0400] — cache_add_tentative concurrency detected

[28/Jun/2013:05:08:39 —0400] NStlllReplicationPlugin — _replica_configure_ruv: failed to create replica ruv tombstone entry (dc=xxxxxx,dc=com): LDAP error — 68

[28/Jun/2013:05:09:04 —0400] NSMMReplicationPlugin — changelog program — _cl5GetDBFile: found DB object 13f5c40 for database /var/opt/csw/lib/dirsrv/slapd—inst—dr02/changelogdb/686eae02—ldd2llb2—b3b3aede—af5e4e28_51c5c8ae000000020000.db4

[28/Jun/2013:05:09:04 —0400] NSMMReplicationPlugin — changelog program — cl5CetOperationCount: found DB object 13f5c40

[28/Jun/2013:05:09:09 —0400] — cache_add_tentative concurrency detected

[28/Jun/2013:05:09:09 —0400] NSMMReplicationPlugin — _replica_configure_ruv: failed to create replica ruv tombstone entry (dc=xxxxxxx,dc=com): LDAP error — 68

[28/Jun/2013:05:09:39 —0400] — cache_add_tentative concurrency detected

[28/Jun/2013:05:09:39 —0400] NSMMReplicationPlugin — _replica_configure_ruv: failed to create replica ruv tombstone entry (dc=xxxxxxx,dc=com); LDAP error — 68

[28/Jun/2013:05:10:09 —0400] — cache_add_tentative concurrency detected

[28/Jun/2013:05:10:09 —0400] NSMMReplicationPlugin — _replica_configure_ruv: failed to create replica ruv tombstone entry (dc=xxxxxxx,dc=com); LDAP error — 68

Thanks,

Join the online conversation with SunGard’s customers, partners and Industry experts and find an event near you at: www.sungard.com/ten.

From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Thursday, June 27, 2013 6:20 PM
To: Vukotic, Jovan
Cc: 389-users@lists.fedoraproject.org; Mehta, Cyrus
Subject: Re: [389-users] FW: fresh replica reports "reloading ruv failed " just after successfull initialization

On 06/27/2013 09:14 AM, Jovan.VUKOTIC@sungard.com wrote:

Rich,

On Linux x86_64 and Solaris x86_64 the error cannot be reproduced, only on Solaris SPARC.

On the other hand, Solaris SPARC works fine only if it is the first master replica in the multi-master array, that is, the one that initializes other replicas.

Do you, perhaps, have any suggestion as to how to tune Solaris SPARC platform?

I think there is a bug in the way we handle atomic operations on SPARC. We don't develop or test on SPARC, so it's not surprising we have a bug in this area. Do you build the code yourself?

I am going to add a more detailed logging to the errors file.

Thanks,
Jovan

Join the online conversation with SunGard’s customers, partners and Industry experts and find an event near you at: www.sungard.com/ten.

From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Monday, June 24, 2013 10:45 PM
To: General discussion list for the 389 Directory server project.
Cc: Vukotic, Jovan; Mehta, Cyrus
Subject: Re: [389-users] FW: fresh replica reports "reloading ruv failed " just after successfull initialization

On 06/24/2013 09:34 AM, Jovan.VUKOTIC@sungard.com wrote:

Hi,

I would like to link the issue I reported on Saturday with the bug 723937 filed some two years ago.

There, just as in my case, dn/entry cache entries have been reported prior to the initialization of master replica.

I repeated the replication configuration today, where the multi-master replica that was initialized by other replica having only one entry in userRoot datase prior the initialization( root object)

First, two entries were found, then 5… and then 918 (matches the number of entries from the master database)

24/Jun/2013:08:16:03 -0400) - entrycache_clear_int: there are still 2 entries in the entry cache.

[24/Jun/2013:08:16:03 -0400) — dncache_clear_int: there are still 2 dn’s in the dn cache. :/

[24/Jun/2013:08:16:03 -0400) - WARNNG Import is running with nsslapd-db-private-import-mem on: No other process is allowed to access the database

[24/Jun/2013:08:16:07 -04001 - import userRoot: Workers finished: cleaning p...

[24/Jun/2013:08:16:07 -0400) — import userRoot: Workers cleaned up.

[24/Jun/2013:08:16:07 -0400) - import userRoot: Indexing complete. Post-processing...

[24/Jun/2013:08:16:07 -0400) - import userRoot: Generating numSubordinates complete.

[24/Jun/2013:08:16:07 —0400) - import userRoot: Flushing caches...

[24/Jun/2013:08:16:07 —0400) — import userRoot: Closing files...

[24/Jun/2013:08:16:07 —0400) — entrycache_clear_int: there are still 5 entries in the entry cache.

[24/Jun/2013:08:16:07 -0400) - dncache_clear-int: there are still 918 dn’s in the dn cache. :/

[24/Jun/2013:08:16:07 -0400) - import userRoot: Import complete. Processed 918 entries in 4 seconds. (229.50 entries/sac)

[24/Jun/2013:08:16:07 -0400] NSMMReplicationPlugin - multimastar_be_state_change: replica dc:xxxxxx,dc=com is coming on

line: enabling replication

[24/Jun/2013:08:16:07 -0400] NSMMReplicationPlugin — replica_configure_ruv: failed to create replica ruv tombstone entry (dc=xxxxxx,dc—com): LDAP error — 68

I would like to add that all replicas that could not be configured due to the reported errors were installed on Solaris 10 on Sparc processors, whereas the only replica that was initialized successfully was installed on Solaris 10 on i386 processors.

Any chance you could try to reproduce this on a Linux x86_64 system?

Thanks,
Jovan

Join the online conversation with SunGard’s customers, partners and Industry experts and find an event near you at: www.sungard.com/ten.

From: Vukotic, Jovan
Sent: Saturday, June 22, 2013 11:59 PM
To: '389-users@lists.fedoraproject.org'
Subject: fresh replica reports "reloading ruv failed " just after successfull initialization.

Hi,

We have four 389 DS, version 1.2.11 that we are organizing in multi-master replication topology.

After I enabled all four multi-master replicas and initialized them - from the one, referent replica M1 and Incremental Replication started, it turned out that only two of them are included in replication, the referent M1 and M2 (replication is working in both direction)

I tried to fix M3 and M4 in the following way:

M3 example:

removed replication agreement M1-M3 (M2-M3 did not existed, M4 switched off)

After several database restores of pre-replication state and reconfiguration of that replica, I removed 389 DS instance M3 completely and reinstalled it again: remove-ds-admin.pl + setup-ds-admin.pl. I configured TLS/SSL (as before), restarted the DS and enabled replica from 389 Console.

Then I returned to M1, recreated the agreement and did initialization of M3. It was successful again, in terms that M3 imported all the data, but immediately after that, to me strange errors were reported:

What confuses me is that LDAP 68 means that an entry already exits… even if it is a new replica. Why a tombstone?

Or to make the long story short: Is the only remedy to reinstall all four replica again?

22/Jun/2013:16:30:50 - 0400] — All database tnreaas now stopped // this is from a backup done before replication configuration

[22/Jun/2013:16:43:25 —0400] NSMMReplicationPlugin — multimaster_be_state_change: replica xxxxxxxxxx is going off line; disablin

g replication

[22/Jun/2013:16:43:25 —0400] — entrycache_clear_int: there are still 20 entries in the entry cache,

[22/Jun/2013:16:43:25 —0400] — dncache_clear_int: there are still 20 dns in the dn cache. :/

[22/Jun/2013:16:43:25 —0400] — WARNING: Import is running with nsslapd—db—private—import—mem on; No other process is allowed to access th

e database

[22/Jun/2013:16:43:30 —0400] — import userRoot: Workers finished; cleaning up..

[22/Jun/2013:16:43:30 —0400] — import userRoot: Workers cleaned up.

[22/Jun/2013:16:43:30 —0400] — import userRoot: Indexing complete. Post—processing...

[22/Jun/2013:16:43:30 —0400] — import userRoot: Generating numSubordinates complete.

[22/Jun/2013:16:43:30 —0400] — import userRoot: Flushing caches.

[22/Jun/2013:16:43:30 —0400] — import userRoot: Closing files.

[22/Jun/2013:16:43:30 —0400] — entrycache_clear_int: there are still 20 entries in the entry cache.

[22/Jun/2013:16:43:30 —0400] — dncache_clear_int: there are still 917 dn’s in the dn cache. :/

[22/Jun/2013:16:43:30 —0400] — import userRoot: Import complete. Processed 917 entries in 4 seconds, (229.25 entries/sec)

[22/Jun/2013:16:43:30 —0400] NSMMRep1 icationPlugin — multimaster_be_state_change: replica xxxxxxxxxxx is coming online; enabling

replication

[22/Jun/2013:16:43:30 —0400] NSMMReplicationPlugin — replica_configure_ruv: failed to create replica ruy tombstone entry (xxxxxxxxxx); LDAP error — 68

[22/Jun/2013:16:43:30 —0400] NSMMReplicationPlugin — replica_enable_replication: reloading ruv failed

[22/Jun/2013:16:43:32 —0400] NSMMReplicationPlugin — _replica_configure_ruv: failed to create replica ruv tombstone entry (xxxxxxxxx); LDAP error — 68

[22/Jun/2013:16:44:02 —0400] NSMMReplicationPlugin — replica_configure_ruv: failed to create replica ruv tombstone entry (xxxxxxxxxx); LDAP error — 68

[22/Jun/2013:16:44:32 —0400] NSMMReplicationPlugin — _replica_configure_ruv: failed to create replica ruv tombstone entry (xxxxxxxxx); LDAP error — 68

[22/Jun/2013:16:45:02 —0400] NSMMReplicationPluyin — _replica_confiyure_ruv: failed to create replica ruv tombstone entry (xxxxxxxx); LDAP error — 68

[22/Jun/2013:16:45:32 —0400] NSMMReplicationPlugin — _replica_configure_ruv: failed to create replica ruv tombstone entry (xxxxxxxxx); LDAP error — 68

[22/Jun/2013:16:46:02 —0400] NSMMReplicationPlugin — _replica_configure_ruv: failed to create replica ruv tombstone entry (xxxxxxxxx); LDAP error — 68

Any help will be appreciated.

Thank you.

Join the online conversation with SunGard’s customers, partners and Industry experts and find an event near you at: www.sungard.com/ten.

--

389 users mailing list

389-users@lists.fedoraproject.org

https://admin.fedoraproject.org/mailman/listinfo/389-users