Leaf won't leave "Recovering" state (split brain)

alex · October 31, 2019, 3:09pm

Hey! So I ran memsql-admin register-node -y --json --force ... and then memsql-admin start-node -y --memsql-id <ID> against a leaf node (but on the master aggregator node of course) last night and when I woke up the leaf node was still in the “Recovering” state, so I called start-node again and waited for about 10 minutes. Here’s what memsql-admin list-nodes shows me:

+------------+--------+-------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| MemSQL ID  |  Role  |    Host     | Port | Process State | Connectable? | Version | Recovery State | Availability Group | Bind Address |
+------------+--------+-------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| 967171DBBC | Master | 10.0.45.3   | 3306 | Running       | True         | 6.8.11  | Online         |                    | 0.0.0.0      |
| 101B4127FD | Leaf   | 10.0.39.177 | 3306 | Running       | True         | 6.8.11  | Recovering     | 1                  | 0.0.0.0      |
+------------+--------+-------------+------+---------------+--------------+---------+----------------+--------------------+--------------+

And here’s what sudo tail -n 100 /mnt/raid/memsql/tracelogs/memsql.log shows me so far:

133270214 2019-10-31 15:02:19.696  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
139790686 2019-10-31 15:02:26.217   INFO: Recovering secondary key `deft_2`.`datoms_ref`.`ivt`
143277894 2019-10-31 15:02:29.704  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
153284812 2019-10-31 15:02:39.711  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
159675906 2019-10-31 15:02:46.102   INFO: Recovering secondary key `deft_1`.`datoms_string`.`eavt`
162415437 2019-10-31 15:02:48.841   INFO: Recovering secondary key `deft_2`.`datoms_ref`.`eavt`
163297285 2019-10-31 15:02:49.723  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
166546699 2019-10-31 15:02:52.973   INFO: Recovering secondary key `deft_2`.`datoms_ref`.`avt`
173305009 2019-10-31 15:02:59.731  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
174770068 2019-10-31 15:03:01.196   INFO: Recovering secondary key `deft_1`.`datoms_string`.`avt`
180796892 2019-10-31 15:03:07.223   INFO: Recovering secondary key `deft_2`.`datoms_ref`.`vt`
183314803 2019-10-31 15:03:09.741  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
193329751 2019-10-31 15:03:19.756  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
203011969 2019-10-31 15:03:29.438   INFO: Recovering secondary key `deft_2`.`datoms_ref`.`tt`
203347296 2019-10-31 15:03:29.773  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
208494557 2019-10-31 15:03:34.921   INFO: Recovering secondary key `deft_1`.`datoms_string`.`vt`
208868212 2019-10-31 15:03:35.294   INFO: Recovering secondary key `deft_2`.`datoms_ref`.`p`
212574665 2019-10-31 15:03:39.001   INFO: Preparing to recover secondary keys for table `deft_2`.`datoms_short`
212574729 2019-10-31 15:03:39.001   INFO: Recovering secondary key `deft_2`.`datoms_short`.`ieavt`
212575379 2019-10-31 15:03:39.001   INFO: Recovering secondary key `deft_2`.`datoms_short`.`iavt`
212575804 2019-10-31 15:03:39.002   INFO: Recovering secondary key `deft_2`.`datoms_short`.`ivt`
212576035 2019-10-31 15:03:39.002   INFO: Recovering secondary key `deft_2`.`datoms_short`.`eavt`
212576386 2019-10-31 15:03:39.002   INFO: Recovering secondary key `deft_2`.`datoms_short`.`avt`
212576803 2019-10-31 15:03:39.003   INFO: Recovering secondary key `deft_2`.`datoms_short`.`vt`
212577053 2019-10-31 15:03:39.003   INFO: Recovering secondary key `deft_2`.`datoms_short`.`tt`
212577469 2019-10-31 15:03:39.003   INFO: Recovering secondary key `deft_2`.`datoms_short`.`p`
212577863 2019-10-31 15:03:39.004   INFO: Preparing to recover secondary keys for table `deft_2`.`datoms_string`
212765739 2019-10-31 15:03:39.192   INFO: Recovering secondary key `deft_2`.`datoms_string`.`ieavt`
213354939 2019-10-31 15:03:39.781  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
220860837 2019-10-31 15:03:47.287   INFO: Recovering secondary key `deft_2`.`datoms_string`.`iavt`
223367782 2019-10-31 15:03:49.794  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
233382269 2019-10-31 15:03:59.808  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
243391441 2019-10-31 15:04:09.817  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
245806903 2019-10-31 15:04:12.233   INFO: Recovering secondary key `deft_2`.`datoms_string`.`ivt`
253402606 2019-10-31 15:04:19.829  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4

…

295026679 2019-10-31 15:05:01.453   INFO: Clearing snapshot version for database `deft_1` as it is about to go online.
295062983 2019-10-31 15:05:01.489   INFO: RecoverDatabaseThread[deft_1]: End
295483958 2019-10-31 15:05:01.910   INFO: STOP REPLICATING deft_1
295484017 2019-10-31 15:05:01.910   INFO: Thread 115083: StopSlaveThreads: Stopping slave thread for database `deft_1`.
295484070 2019-10-31 15:05:01.910   INFO: Completed STOP REPLICATING deft_1
295484093 2019-10-31 15:05:01.910   INFO: DB deft_1: has original Id 11395525453320891103.
295484104 2019-10-31 15:05:01.910   INFO: PROMOTE ALL LOCKS RELEASED deft_1
303494218 2019-10-31 15:05:09.920  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 3 out of 4

…

423633071 2019-10-31 15:07:10.059  ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 3 out of 4
426466282 2019-10-31 15:07:12.892   INFO: Recovering secondary key `deft_2`.`datoms_string`.`p`
430241989 2019-10-31 15:07:16.668   INFO: [deft_2] Scanning columnstore segments: Completed, elapsed: 0 ms, disk: 0 ms
430242484 2019-10-31 15:07:16.668   INFO: Clearing snapshot version for database `deft_2` as it is about to go online.
430279079 2019-10-31 15:07:16.705   INFO: RecoverDatabaseThread[deft_2]: End
430279189 2019-10-31 15:07:16.705   INFO: Completed recovery of user databases.
430640805 2019-10-31 15:07:17.067   INFO: STOP REPLICATING deft_2
430640881 2019-10-31 15:07:17.067   INFO: Thread 115080: StopSlaveThreads: Stopping slave thread for database `deft_2`.
430640930 2019-10-31 15:07:17.067   INFO: Completed STOP REPLICATING deft_2
430640977 2019-10-31 15:07:17.067   INFO: DB deft_2: has original Id 11395525453320891103.
430640995 2019-10-31 15:07:17.067   INFO: PROMOTE ALL LOCKS RELEASED deft_2

The logs do show repeated errors, but looks like there’s been forward progress of some sort. However memsql-admin list-nodes still shows the leaf as “Recovering”, and it’s been 8 minutes since the log has been written to. I have a pretty grim outlook at the moment since my first thought is that the data is corrupted. Luckily I do have backups, but they’re potentially slightly after the last known good state; we’ll see.

Any insight you can bring to bear on this would be extremely appreciated since we’re dead in the water till we fix this! Thanks!

sgl-memsql · October 31, 2019, 5:16pm

Alex,

That leaf log looks good. Once a leaf logs the message Completed recovery of user databases this means it has finished replaying data into memory, so there is no data corruption.

What does the master aggregator log look like? Have you always had just 1 leaf in this cluster? Typically with MemSQL clusters you would have at least 2 leaves so that you can enable high availability. I would strongly recommend deploying a second leaf so that you can benefit from the resiliency provided by high availability. For more information see the MemSQL docs: SingleStoreDB Cloud · SingleStore Documentation

Best,

alex · October 31, 2019, 5:48pm

Thanks so much for taking a look! That does make me feel better. Still strange it’s stuck in a Recovering state. The latest output of the master aggregator log is a bunch of statistics queries and prior to that it says there is an availability group with no leaves. I have an urgent meeting so I will have to get back to you with more specifics about that within an hour or so.

We’ve always had just one leaf in the cluster, but we just upped the instance size from 16GB to 32GB RAM. We actually have the leaf running on an AWS spot instance heh. We’re an early-stage startup so running lean is much more critical for us than high availability at this point. We can tolerate outages of an hour or longer, but the current half a day’s outage is a bit much.

alex · October 31, 2019, 5:48pm

@sgl-memsql said that, based on an analysis of the cluster report, the master aggregator failed to perform the automatic rebalance of partitions when the leaf came online because high availability is enabled, and there is only 1 leaf in the cluster. The answer is to either disable high availability, or deploy a second leaf. In our case we’re disabling HA. (Thanks so much for your help @sgl-memsql!)

EDIT: @@GLOBAL.redundancy_level is already set to 1, so high availability already seems to be disabled. Not sure what’s up; researching more. Note that memsql-admin show-leaves says “no leaves found” whereas memsql-admin list-nodes clearly shows the leaf in question. However I’m told that this is likely because show-leaves is meant to show only online leaves.

alex · November 1, 2019, 4:15pm

@sgl-memsql (et al.) — This is still a problem. Tried a nuclear option of scrapping the master aggregator node, configuring it afresh, bringing it back online, and connecting the leaf node, but no luck. Scoured the internet for this issue but also no luck. The fundamental issue is this:

On the leaf node it says in the log:

421201526 2019-10-31 22:57:49.637   INFO: RecoverDatabaseThread[deft_2]: End
421201656 2019-10-31 22:57:49.638   INFO: Completed recovery of user databases.
421230918 2019-10-31 22:57:49.667   INFO: STOP REPLICATING deft_2
421231013 2019-10-31 22:57:49.667   INFO: Thread 115085: StopSlaveThreads: Stopping slave thread for database `deft_2`.
421231075 2019-10-31 22:57:49.667   INFO: Completed STOP REPLICATING deft_2
421231129 2019-10-31 22:57:49.667   INFO: DB deft_2: has original Id 11395525453320891103.
421231158 2019-10-31 22:57:49.667   INFO: PROMOTE ALL LOCKS RELEASED deft_2

However on the leaf node:

sudo memsqlctl list-nodes
+------------+------+------+---------------+--------------+---------+----------------+--------------------+
| MemSQL ID  | Role | Port | Process State | Connectable? | Version | Recovery State | Availability Group |
+------------+------+------+---------------+--------------+---------+----------------+--------------------+
| 101B4127FD | Leaf | 3306 | Running       | True         | 6.8.11  | Recovering     | 1                  |
+------------+------+------+---------------+--------------+---------+----------------+--------------------+

And on the master node:

memsql-admin list-nodes
+------------+--------+-------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| MemSQL ID  |  Role  |    Host     | Port | Process State | Connectable? | Version | Recovery State | Availability Group | Bind Address |
+------------+--------+-------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| 4E0E616787 | Master | 10.0.61.70  | 3306 | Running       | True         | 6.8.11  | Online         |                    | 0.0.0.0      |
| 101B4127FD | Leaf   | 10.0.39.177 | 3306 | Running       | True         | 6.8.11  | Recovering     | 1                  | 0.0.0.0      |
+------------+--------+-------------+------+---------------+--------------+---------+----------------+--------------------+--------------+

And to be clear:

memsql-admin query --sql 'SELECT @@GLOBAL.redundancy_level' --host <master host> --password  <password>
memsql-admin query --sql 'SELECT @@GLOBAL.redundancy_level' --host <leaf host> --password  <password>
+---------------------------+
| @@GLOBAL.redundancy_level |
+---------------------------+
| 1                         |
+---------------------------+

Maybe this is the root cause?

memsql-admin add-leaf -y --host $leaf_ip --password $password
# > stderr: Node already has role Leaf

memsql-admin remove-leaf -y --host $leaf_ip
# > stderr: Node 10.0.39.177:3306 is not a leaf in the same cluster as the master aggregator

# Latest log on leaf node (referencing a now nonexistent IP)
966314676 2019-11-01 17:41:27.255  ERROR: Failure during heartbeat connecting to 10.0.45.3:3306

Also you asked for the master aggregator log so here it is:

02157778 2019-11-01 15:17:34.810   INFO: Completed recovery of user databases.
02159103 2019-11-01 15:17:34.812   INFO: Thread -1: SkynetInitInternal: Initializing Skynet.
02159145 2019-11-01 15:17:34.812   INFO: Consensus node started...
...
12256100 2019-11-01 15:17:44.909   INFO: Transitioning node to aggregator after setting identity
12256115 2019-11-01 15:17:44.909   INFO: Setting aggregator id to 1
12258688 2019-11-01 15:17:44.911   INFO: Thread 100000: SetIdentity: Dropping all connections after setting aggregator identity.
12258779 2019-11-01 15:17:44.911   INFO: Consensus state moving from NonVotingMember to Leader
12258850 2019-11-01 15:17:44.911   INFO: Used cluster capacity: 0 units, Maximum cluster capacity: 4 units
12260864 2019-11-01 15:17:44.913   INFO: Cluster Bootstrapped, node id=1
323500084 2019-11-01 15:22:56.153   INFO: Query information_schema.'SELECT ...'
3161514142 2019-11-01 16:10:14.167   INFO: Query information_schema.'SELECT @@basedir' submitted for asynchronous compilation
3161617934 2019-11-01 16:10:14.271   INFO: Query information_schema.'SELECT @@basedir' submitted 112 milliseconds ago, queued for 10 milliseconds, compiled asynchronously in 102 milliseconds
3359497920 2019-11-01 16:13:32.151   INFO: Query information_schema.'SELECT @@GLOBAL.redundancy_level' submitted for asynchronous compilation
3359602226 2019-11-01 16:13:32.255   INFO: Query information_schema.'SELECT @@GLOBAL.redundancy_level' submitted 112 milliseconds ago, queued for 10 milliseconds, compiled asynchronously in 102 milliseconds

Not sure what I’m missing here. Would love your thoughts! Let me know if you’d like me to Slack you the latest cluster report.

alex · November 1, 2019, 8:32pm

I think I diagnosed the issue (unknown solution as of yet). Looks like there’s a split brain / split cluster situation going on.

For reference, the master aggregator is running on 10.0.61.70, and the leaf is running on 10.0.39.177.

The leaf node is showing an unrelated IP (a killed AWS spot instance) as the leaf:

memsql-admin query --host $leaf_ip --password $password --sql "select * from information_schema.leaves"
+-------------+------+--------------------+-----------+-----------+--------+--------------------+---------------------------+---------+
|    HOST     | PORT | AVAILABILITY_GROUP | PAIR_HOST | PAIR_PORT | STATE  | OPENED_CONNECTIONS | AVERAGE_ROUNDTRIP_LATENCY | NODE_ID |
+-------------+------+--------------------+-----------+-----------+--------+--------------------+---------------------------+---------+
| 10.0.50.142 | 3306 | 1                  | NULL      | NULL      | online | 0                  | NULL                      | 2       |
+-------------+------+--------------------+-----------+-----------+--------+--------------------+---------------------------+---------+

However the master aggregator shows no leaves:

memsql-admin query --host $master_aggregator_ip --password $password --sql "select * from information_schema.leaves"
Empty result set

And the leaf shows yet another unrelated IP (a killed AWS spot instance) as the master aggregator:

memsql-admin query --host $leaf_ip --password $password --sql "select * from information_schema.aggregators"
+-----------+------+--------+--------------------+---------------------------+-------------------+--------+---------+
|   HOST    | PORT | STATE  | OPENED_CONNECTIONS | AVERAGE_ROUNDTRIP_LATENCY | MASTER_AGGREGATOR |  ROLE  | NODE_ID |
+-----------+------+--------+--------------------+---------------------------+-------------------+--------+---------+
| 10.0.45.3 | 3306 | online | 0                  | NULL                      | 1                 | Leader | 1       |
+-----------+------+--------+--------------------+---------------------------+-------------------+--------+---------+

While the master aggregator shows itself:

memsql-admin query --host $master_aggregator_ip --password $password --sql "select * from information_schema.aggregators"
+------------+------+--------+--------------------+---------------------------+-------------------+--------+---------+
|    HOST    | PORT | STATE  | OPENED_CONNECTIONS | AVERAGE_ROUNDTRIP_LATENCY | MASTER_AGGREGATOR |  ROLE  | NODE_ID |
+------------+------+--------+--------------------+---------------------------+-------------------+--------+---------+
| 10.0.61.70 | 3306 | online | 1                  | NULL                      | 1                 | Leader | 1       |
+------------+------+--------+--------------------+---------------------------+-------------------+--------+---------+

How do we put the split brain back together?

alex · November 2, 2019, 5:10pm

Made a last-ditch effort and used NAT on both nodes to make the old IPs line up to the new ones to see whether that would help:

# Master destination forwarding (executed on both nodes)
sudo iptables -t nat -A OUTPUT -d 10.0.45.3 -j DNAT --to-destination 10.0.61.70
# Leaf destination forwarding (executed on both nodes)
sudo iptables -t nat -A OUTPUT -d 10.0.50.142 -j DNAT --to-destination 10.0.39.177
# Leaf source forwarding (on master node)
sudo iptables -t nat -A INPUT -p tcp -d 10.0.61.70 -s 10.0.39.177 -j SNAT --to-source 10.0.50.142
# Master source forwarding (on leaf node)
sudo iptables -t nat -A INPUT -p tcp -d 10.0.39.177 -s 10.0.61.70 -j SNAT --to-source 10.0.45.3

Now master and leaf say the same thing about aggregators which is good:

sudo memsqlctl query --password $password --sql "select * from information_schema.aggregators"
+-----------+------+--------+--------------------+---------------------------+-------------------+--------+---------+
|   HOST    | PORT | STATE  | OPENED_CONNECTIONS | AVERAGE_ROUNDTRIP_LATENCY | MASTER_AGGREGATOR |  ROLE  | NODE_ID |
+-----------+------+--------+--------------------+---------------------------+-------------------+--------+---------+
| 10.0.45.3 | 3306 | online | 0                  | NULL                      | 1                 | Leader | 1       |
+-----------+------+--------+--------------------+---------------------------+-------------------+--------+---------+

However they differ about leaves:

# On master
sudo memsqlctl query --password $password --sql "select * from information_schema.leaves"
Empty result set

# On leaf
sudo memsqlctl query --password $password --sql "select * from information_schema.leaves"
+-------------+------+--------------------+-----------+-----------+--------+--------------------+---------------------------+---------+
|    HOST     | PORT | AVAILABILITY_GROUP | PAIR_HOST | PAIR_PORT | STATE  | OPENED_CONNECTIONS | AVERAGE_ROUNDTRIP_LATENCY | NODE_ID |
+-------------+------+--------------------+-----------+-----------+--------+--------------------+---------------------------+---------+
| 10.0.50.142 | 3306 | 1                  | NULL      | NULL      | online | 0                  | NULL                      | 2       |
+-------------+------+--------------------+-----------+-----------+--------+--------------------+---------------------------+---------+

And, tellingly, the leaf says:

423497575 2019-11-02 16:43:24.161   INFO: Completed recovery of user databases.
...
733031514 2019-11-02 16:48:33.695  ERROR: Failure during heartbeat connecting to 10.0.45.3:3306

And the master says:

7942257491 2019-11-02 18:44:55.644  ERROR: ProcessHandshakeResponsePacket() failed. Sending back 1045: Access denied for user 'distributed'@'10.0.50.142' (using password: YES)

When I do

sudo memsqlctl query --sql "ADD LEAF root:'$password'@'10.0.50.142':3306" --password $password

I get

Error 1007: The node 10.0.50.142:3306 already has a cluster database and is already member of a cluster, or has a vestigial cluster database. Delete the data directory on the aggregator before joining it to a cluster

So, unless I’m missing something, it looks like reattaching the split brain is not possible. (Please let me know if I’m wrong @sgl-memsql!)

Tragically, we’re planning on scrapping our DB (while saving a copy of the /mnt/raid/memsql directory just in case) since the leaf node can’t come online for us to do a backup+restore, since it can’t find a master aggregator it can successfully connect to. Luckily we’re an early-stage startup so it’s not the worst thing that could happen, and I’m glad it happened now rather than later, but it’s still a loss.

To make sure this issue doesn’t happen in the future, we’re going to peg each node to a stable hostname rather than an IP (e.g. node-0.leaf.db.us-west-2.prod.deft.private), since private IPs are transient for us given that we use spot instances for cost reasons. Might be good to add that recommendation into the documentation so people don’t have to wade through the unfortunate problem we’ve had to.

rodrigo · November 3, 2019, 12:18am

Hi Alex,

I can hopefully shed some light on this, based on what you shared:

The status that you see from memsql-admin refers to more than just whether the data has finished recovering, it’s a summary of the state of the entire leaf. That means it won’t show Online until the leaf has rejoined the cluster. In this case, from what you said, the leaf had a different IP when it started than the one passed to ADD LEAF, so the MA does not know how to connect to it, thus starting your current predicament.

I made a request to our docs team to mention that addresses used when running bootstrap / add leaf / add aggregator should be stable addresses for optimal function, in order to avoid issues like this in the future.

If you do encounter this kind of situation again, however, the solution would have been to run remove leaf in the aggregator followed by add leaf with the new address.

Since you cleaned out the master aggregator, however, that becomes impossible. The master aggregator is the central cluster metadata manager, and if you clean its state, you’ll end up with an unusable cluster, unless you have a child aggregator you can run AGGREGATOR SET AS MASTER on.

At this point you have the equivalent of two clusters running, which memsqlctl is not optimized to represent. You have one cluster with just an MA, and one cluster with a dead MA, and a leaf. Stitching two clusters together is something we block to avoid issues with merging incompatible state from different clusters.

The recommended course of action is what you have suggested - scrap the leaf node, configure it anew, add it to the cluster, being careful to use a stable address, and restore everything from backup. I would also recommend reconfiguring the master aggregator with a stable address during bootstrap aggregator, otherwise you might hit the same issue with the master.

In the future, if you need to change the address of a node, you need to remove it from the cluster and re-add it. Make sure to always have an available master aggregator, otherwise you could end up losing the database.

alex · November 6, 2019, 2:00am

Hey @rodrigo! Thank you so much for taking the time to bring your insights to bear on the situation. They really helped in confirming our findings and in refining our planned course of action and future remediation. I appreciate you making that request to the docs team!