[389-users] Can't locate CSN

Discussion:

[389-users] Can't locate CSN - replica issue

Marco Favero

2021-06-01 12:29:43 UTC

Hello,

I'm dealing with the update from 389-ds 1.3.9.1 to 1.4.3.22 in three multimaster servers:

- srv1
- srv2
- srv3

srv1 replicates to srv2 and srv3.
srv2 replicates to srv1 and srv3.
srv3 replicates to srv1 and srv2.

Let suppose I reinstall srv3 with 389ds 1.4.3.22, and I initialize it from srv1. This happens with success as expected. The replica is fine.

Then, I reinstall srv2, and I initialize it from srv3. This happens with success as expected, but just at the initialization end, the agreement from srv3 to srv1 stops to works.

In the console appears "Error (18) Can't acquire replica (Incremental update transient warning. Backing off, will retry update later.)" in the status of the agreements from srv3 to srv1. In the logs I see errors like

repl_plugin_name_cl - agmt="srv3 to srv1" (srv1:389): CSN 596c6868000075320000 not found, we aren't as up to date, or we purged
clcache_load_buffer - Can't locate CSN 596c6868000075320000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized.

The changelogdb Maximun Age is "7d", equals to the default nsDS5ReplicaPurgeDelay for the suffix.

This happens always, for every suffix.

To resolve the issue I have to re-initialize from srv3 to srv1 again and after the end of initialization from srv3 to srv2.

Resuming:

1) install srv2 OK
2) initialize srv1 to srv3 OK
3) initialize srv2 to srv3: the agreement srv1 to srv3 stops to work
4) initialize srv1 to srv3 again

I would like to know how to configure the Directory Server in order to avoid the above scenario.
The problem is very similar to

https://access.redhat.com/solutions/2690611

but that document says that the problem was already fixed in 389-ds-base-1.3.5.10-15.el7_3 or later.

Could you help me?

Thank you very much
Marco
_______________________________________________
389-users mailing list -- 389-***@lists.fedoraproject.org
To unsubscribe send an email to 389-users-***@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-***@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagu

Marc Sauton

2021-06-01 17:45:52 UTC

Permalink

does srv2 run 1.4.3.22 ?

you could try to delete the BDB region files, first stop the LDAP service,
then delete the files
/var/lib/dirsrv/slapd-xx/db/"__db.00*

or try a more recent 1.4.4.15 or 1.4.4.16

Thanks,
M.

Post by Marco Favero
Hello,
- srv1
- srv2
- srv3
srv1 replicates to srv2 and srv3.
srv2 replicates to srv1 and srv3.
srv3 replicates to srv1 and srv2.
Let suppose I reinstall srv3 with 389ds 1.4.3.22, and I initialize it from
srv1. This happens with success as expected. The replica is fine.
Then, I reinstall srv2, and I initialize it from srv3. This happens with
success as expected, but just at the initialization end, the agreement from
srv3 to srv1 stops to works.
In the console appears "Error (18) Can't acquire replica (Incremental
update transient warning. Backing off, will retry update later.)" in the
status of the agreements from srv3 to srv1. In the logs I see errors like
repl_plugin_name_cl - agmt="srv3 to srv1" (srv1:389): CSN
596c6868000075320000 not found, we aren't as up to date, or we purged
clcache_load_buffer - Can't locate CSN 596c6868000075320000 in the
changelog (DB rc=-30988). If replication stops, the consumer may need to be
reinitialized.
The changelogdb Maximun Age is "7d", equals to the default
nsDS5ReplicaPurgeDelay for the suffix.
This happens always, for every suffix.
To resolve the issue I have to re-initialize from srv3 to srv1 again and
after the end of initialization from srv3 to srv2.
1) install srv2 OK
2) initialize srv1 to srv3 OK
3) initialize srv2 to srv3: the agreement srv1 to srv3 stops to work
4) initialize srv1 to srv3 again
I would like to know how to configure the Directory Server in order to
avoid the above scenario.
The problem is very similar to
https://access.redhat.com/solutions/2690611
but that document says that the problem was already fixed in
389-ds-base-1.3.5.10-15.el7_3 or later.
Could you help me?
Thank you very much
Marco
_______________________________________________
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
https://pagure.io/fedora-infrastructure

Marco Favero

2021-06-03 06:53:54 UTC

Permalink

Post by Marc Sauton
does srv2 run 1.4.3.22 ?

Yes, it is a fresh installation.

Post by Marc Sauton
you could try to delete the BDB region files, first stop the LDAP service,
then delete the files
/var/lib/dirsrv/slapd-xx/db/"__db.00*

uhm... I keep these file in a RAM fs. srv2 and srv3 are fresh installations and received his first initialization.

Post by Marc Sauton
or try a more recent 1.4.4.15 or 1.4.4.16

I'll see if I have a chance to try.

Thank you very much
Kin Regards
Marco
_______________________________________________
389-users mailing list -- 389-***@lists.fedoraproject.org
To unsubscribe send an email to 389-users-***@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-***@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastruc

Marco Favero

2021-06-07 06:51:44 UTC

Permalink

Even if I fix as described the above issue, if I restart a replica, then the suppliers stop to send the update and they claim with

"The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica."

So, after a restart, I have to reinitialize the restarted server in order to receive update :(

If I reboot the replica in place of a dirsrv restart (so I delete the "__db.00* in /dev/shm) the problem is still the same.

Kind Regards
Marco
_______________________________________________
389-users mailing list -- 389-***@lists.fedoraproject.org
To unsubscribe send an email to 389-users-***@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-***@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://p

Marco Favero

2021-06-07 07:39:18 UTC

Permalink

Gasp, I suspect the problem seems to be here. In the agreements I see

dn: cn=it 2--\3E1,cn=replica,cn=c\3Dit,cn=mapping tree,cn=config
objectClass: top
objectClass: nsds5replicationagreement
cn: it 2-->1
cn: it 2--\>1
nsDS5ReplicaRoot: c=it
description: it 2-->1
nsDS5ReplicaHost: srv1.example.com
nsDS5ReplicaPort: 389
nsDS5ReplicaBindMethod: simple
nsDS5ReplicaTransportInfo: LDAP
nsDS5ReplicaBindDN: cn=replication manager,cn=config
nsds50ruv: {replicageneration} 60704f730000c3500000
nsds50ruv: {replica 50001 ldap://srv1.example.com:389} 607424dd0000c3510
000 60ba18fb0000c3510000
nsds50ruv: {replica 50000 ldap://srv.example.com:389} 6074264a0000c3500000 6
0ba190f0000c3500000
nsds50ruv: {replica 50002 ldap://srv2.example.com:389} 607426410000c3520
000 60ba19050000c3520000
nsruvReplicaLastModified: {replica 50001 ldap://srv1.example.com:389} 00
000000
nsruvReplicaLastModified: {replica 50000 ldap://srv.example.com:389} 0000000
0
nsruvReplicaLastModified: {replica 50002 ldap://srv2.example.com:389} 00
000000
nsds5replicareapactive: 0
nsds5replicaLastUpdateStart: 20210604124542Z
nsds5replicaLastUpdateEnd: 20210604124542Z
nsds5replicaChangesSentSinceStartup:: NTAwMDI6NC8wIA==

The replica ID 50000 corresponds to the server srv3.example.com, the first host installed in a set of three multimaster servers. The balancer host is srv.example.com. As suggested by dscreate I put the balancer host in the parameter "full_machine_name" for all LDAP servers. For a reason which I don't know the full_machine_name (the load balancer host) has been written in the ruv in place of the fqdn of the machine host containing the dirsrv installation. In this case, srv.example.com in place of srv3.example.com.

I suspect that if I reinstall all servers with their hostname in "full_machine_name" I resolve my issue.

Any idea?

Thank you very much
_______________________________________________
389-users mailing list -- 389-***@lists.fedoraproject.org
To unsubscribe send an email to 389-users-***@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-***@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora

Thierry Bordaz

2021-06-07 09:13:12 UTC

Permalink

Post by Marco Favero
Gasp, I suspect the problem seems to be here. In the agreements I see
dn: cn=it 2--\3E1,cn=replica,cn=c\3Dit,cn=mapping tree,cn=config
objectClass: top
objectClass: nsds5replicationagreement
cn: it 2-->1
cn: it 2--\>1
nsDS5ReplicaRoot: c=it
description: it 2-->1
nsDS5ReplicaHost: srv1.example.com
nsDS5ReplicaPort: 389
nsDS5ReplicaBindMethod: simple
nsDS5ReplicaTransportInfo: LDAP
nsDS5ReplicaBindDN: cn=replication manager,cn=config
nsds50ruv: {replicageneration} 60704f730000c3500000
nsds50ruv: {replica 50001 ldap://srv1.example.com:389} 607424dd0000c3510
000 60ba18fb0000c3510000
nsds50ruv: {replica 50000 ldap://srv.example.com:389} 6074264a0000c3500000 6
0ba190f0000c3500000
nsds50ruv: {replica 50002 ldap://srv2.example.com:389} 607426410000c3520
000 60ba19050000c3520000
nsruvReplicaLastModified: {replica 50001 ldap://srv1.example.com:389} 00
000000
nsruvReplicaLastModified: {replica 50000 ldap://srv.example.com:389} 0000000
0
nsruvReplicaLastModified: {replica 50002 ldap://srv2.example.com:389} 00
000000
nsds5replicareapactive: 0
nsds5replicaLastUpdateStart: 20210604124542Z
nsds5replicaLastUpdateEnd: 20210604124542Z
nsds5replicaChangesSentSinceStartup:: NTAwMDI6NC8wIA==
The replica ID 50000 corresponds to the server srv3.example.com, the first host installed in a set of three multimaster servers. The balancer host is srv.example.com. As suggested by dscreate I put the balancer host in the parameter "full_machine_name" for all LDAP servers. For a reason which I don't know the full_machine_name (the load balancer host) has been written in the ruv in place of the fqdn of the machine host containing the dirsrv installation. In this case, srv.example.com in place of srv3.example.com.

Hi marco,

the hostname in the RUV (nsds50ruv) is coming from 'nsslapd-localhost'
attribute in the 'cn=config' entry (dse.ldif). I am unsure of the impact
of this erroneously value (srv.example.com instead of srv3.example.com)
in the RUV.

IMHO what is important for the RA to start a replication session is
nsds5ReplicaHost and replicageneration. Of course it would be better
that hosts are valid in RUV element but not sure it explains that
srv1->srv3 stopped working.

If you can reproduce the problem, I would recommend that you enable
replication logging (nsslapd-errorlog-level: 8192) on both side (srv1
and srv3) and reproduce the failure of the RA. Then isolated from access
logs and error logs the replication session that fails.

regards
thierry

Post by Marco Favero
I suspect that if I reinstall all servers with their hostname in "full_machine_name" I resolve my issue.
Any idea?
Thank you very much
_______________________________________________
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

_______________________________________________
389-users mailing list -- 389-***@lists.fedoraproject.org
To unsubscribe send an email to 389-users-***@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-***@lists.fedoraproject.org
Do not reply to spam on the list, report it: