Replication doesn't start, floods log with errors

I have a fresh install of 6.7.15 with a free license running on one host with one aggregator and one leaf.

I have the exact same setup on another host and I’m trying to test replication by replicating a database from one host (cluster) to the other. So from the secondary host I run:

replicate database test from root:‘PW’@10.0.2.12;

From then on I get an infinite loop of error messages in the logs for both the leaf and the aggregator.

Below are some sample messages from the logs. Let me know if I need to look for some other messages, there are so many it’s hard for me to tell what’s relevant.

2424463364 2019-04-02 21:20:14.552 ERROR: ProcessHandshakeResponsePacket() failed. Sending back 1045: Access denied for user ‘distributed’@‘localhost’ (using password: YES)
2424463562 2019-04-02 21:20:14.552 ERROR: Failed to connect to ‘127.0.0.1:3307/test_1’: Bad user or password.

2412029554 2019-04-02 21:20:06.559 ERROR: ProcessNotifications Thread 115087: AdjustReferenceDatabases: Failure updating reference databases at LSN 59. Succeeded updating 1, needed 2.
2412455322 2019-04-02 21:20:06.985 ERROR: ProcessHandshakeResponsePacket() failed. Sending back 1045: Access denied for user ‘distributed’@‘localhost’ (using password: YES)
2412455473 2019-04-02 21:20:06.985 ERROR: Thread 99999: SetConnFromMetadataNode: Connection failure connecting to node id 1: distributed:x@‘127.0.0.1’:3306. Failed with error: Access denied for user ‘distributed’@‘localhost’ (using password: YES)

The errors indicate that there is a login issue from the cluster trying to retrieve data from the other cluster. This uses an internal user (called distributed). It looks like maybe you have a different user that is shadowing it and thus causing the login failures.

Did you create any users like distributed@127.0.0.1?

No, I’ve never had a user called distributed. I did have a few users so I deleted them all except root and I’m getting the same result.

Here is what’s in the log on the primary cluster side:

3391936656896 2019-04-03 02:05:46.647 INFO: Thread 99999: SlaveOperationCommon: cluster log: Entered slave operation 0.
3391936657004 2019-04-03 02:05:46.647 WARN: Attempted to set network write buffer size to 0x800000 bytes, but the value is capped at 0x34000 bytes.
3391936657069 2019-04-03 02:05:46.647 WARN: Attempted to set network read buffer size to 0x800000 bytes, but the value is capped at 0x34000 bytes.
3391936807866 2019-04-03 02:05:46.798 INFO: Thread 99999: PerformHandshakeMaster: cluster log: Received replication request from node -5793996282194787538, starting LSN 0x0, previous page term 0.
3391936807924 2019-04-03 02:05:46.798 INFO: Thread 99999: SendLogRange: cluster log: Sending data to node -5793996282194787538, startLSN 0x0, endLSN 0x6d.
3391936807980 2019-04-03 02:05:46.798 INFO: Thread 99999: SendFile: cluster log: Sending file id 0x0 to node -5793996282194787538, starting offset 0x0, requested size 0x6d000.
3391936808031 2019-04-03 02:05:46.798 INFO: Thread 99999: SendFile: cluster log: Sending file id 0x0 to node -5793996282194787538, starting offset 0x0, actual size 0x6d000.
3391936814831 2019-04-03 02:05:46.805 INFO: Thread 99999: operator(): cluster log: Adding slave id 3, node id -5793996282194787538 at barrier number 20.
3391936814894 2019-04-03 02:05:46.805 INFO: Thread 99999: AddSlave: cluster log: Addition mentioned above took effect at LSN 0x6d.
3392081725113 2019-04-03 02:08:11.715 ERROR: Thread 115105: operator(): cluster log: Disconnecting slave at node -5793996282194787538 because of connection error.
3392081725215 2019-04-03 02:08:11.715 INFO: Thread 115105: operator(): cluster: Removing slave id 3, node id -5793996282194787538 at barrier number 21.
3392081725251 2019-04-03 02:08:11.715 INFO: Thread 115105: RemoveSlave: cluster log: Removal mentioned above took effect at LSN 0x6d.
3392081725272 2019-04-03 02:08:11.715 INFO: Thread 115105: DisconnectSlave: cluster log: Removed slave at node -5793996282194787538 because of disconnect.

I see this in one of the secondary cluster logs:

19552115386 2019-04-03 02:05:46.645 WARN: Attempted to set network write buffer size to 0x800000 bytes, but the value is capped at 0x34000 bytes.
19552115448 2019-04-03 02:05:46.645 WARN: Attempted to set network read buffer size to 0x800000 bytes, but the value is capped at 0x34000 bytes.
19552161376 2019-04-03 02:05:46.691 ERROR: ProcessNotifications Thread 115087: AdjustReferenceDatabases: Failure updating reference databases at LSN 69. Succeeded updating 1, needed 2.

Is there some linux config that needs to be done? This is on CentOS 7. I haven’t done all the linux tuning recommendations yet but none of them seem to apply to this.

I did the below linux tuning which got rid of the warnings about the network buffers but it didn’t change the outcome, I still get the same errors.

net.core.wmem_max = 12582912
net.core.rmem_max = 12582912
net.ipv4.tcp_rmem = 10240 87380 12582912
net.ipv4.tcp_wmem = 10240 87380 12582912

The warnings in the tracelog related to memory limits are unrelated - it can have an effect on replication performance, but it doesn’t affect correctness.

From looking at your original tracelogs, it seems like the DR replicas are trying to connect to ‘127.0.0.1’ or ‘localhost’. How did you setup your primary cluster? What’s the output of SHOW LEAVES and SHOW AGGREGATORS?

From the tracelogs it seems like you setup your primary cluster with non-public IPs (bootstrap aggregators '127.0.0.1' and/or add leaf root:PW@'127.0.0.1':3307). That does not work with DR - the address you provide in the REPLICATE command is only used for initial setup, and then the addresses used when setting up the primary cluster are used to actually replicate the data (this is because the leaves in the secondary cluster connect directly to the primary cluster’s leaves). The addresses don’t need to be public, but need to be correctly routable to from the secondary cluster.

memsql> show aggregators\G
*************************** 1. row ***************************
                        Host: 127.0.0.1
                        Port: 3306
                       State: online
          Opened_Connections: 1
Average_Roundtrip_Latency_ms: NULL
           Master_Aggregator: 1
                      NodeId: 1
1 row in set (0.00 sec)

memsql> show leaves\G
*************************** 1. row ***************************
                        Host: 127.0.0.1
                        Port: 3307
          Availability_Group: 1
                   Pair_Host: NULL
                   Pair_Port: NULL
                       State: online
          Opened_Connections: 33
Average_Roundtrip_Latency_ms: 0.331
                      NodeId: 2
1 row in set (0.00 sec)

Is it possible to change the IP addresses or do I need to reinstall?

There is a way to change the addresses without reinstalling. You can change the leaf’s address by removing it and re-adding it:

remove leaf '127.0.0.1':3307;
add leaf root:$PW@'$PUBLIC_IP':3307

In a bigger cluster with High Availability, you would do this one availability group at a time to stay online, but for a test environment this is fine (your data should all still be there when you re-add the leaf).

For the MA, you can run the following command on it to change its address:

bootstrap aggregator '$PUBLIC_IP' force;

For this second command, be extremely careful, especially if running in a bigger cluster (avoid it at all costs) - if you run it on a Child Aggregator by accident, then you are essentially running aggregator set as master, with all the associated risks (data loss/cluster metadata inconsistency - it’s a command meant only for emegencies), but skipping a few checks. If you run it on the Master Aggregator, you will just change its address in metadata, so it is safe for your case.

Replication works now, thank you!