Partition has no master instance


#1

Every few minutes, memsql seems to be crashing. When I try to run a query, it says:

ERROR 1735 (HY000): Unable to connect to leaf @127.0.0.1:3307 with user distributed, using password YES: [2004] Error reading from socket. Errno=104 (Connection reset by peer)

And then if I try running a query it says:

ERROR 1777 (HY000): Partition records:0 has no master instance.

After a few minutes, it comes back and starts working, but then it goes back down and gives the same error.

Seemed similar to the “mysql has gone away” error, so I tried increasing the max allowed packet size to 600 mb, but no luck. There are around 40 connections to the server which has over 100 GB of ram and 24 cores. I am not seeing anything in the memsql.log files that indicates whats wrong.

This is on version 6.7. Any idea what might be causing this, or how to debug it?


#2

Here is a discussion of error 1735: https://docs.memsql.com/troubleshooting/latest/error-codes/#error-1735-unable-to-connect-timed-out-reading-from-socket


#3

Thanks Hanson, that definitely got me much closer! I tried the following part of the documentation:

One way to verify connectivity is to run the command FILL CONNECTION POOLS on all MemSQL nodes. If this fails with the same error, then a node is unable to connect to another node.

Upon running “FILL CONNECTION POOLS”, I did indeed get an error:

ERROR 1735 (HY000): Unable to connect to leaf @127.0.0.1:3307 with user distributed, using password YES: [2004] Cannot connect to '127.0.0.1':3307. Errno=99 (Cannot assign requested address)

Both nodes are on the same machine and were setup using the “cluster in a box” automated installation. I checked UFW rules and nothing is being blocked (I even tried disabling UFW). Also, a user is setup to be able to connect on any ip and the bind-address for both nodes are set to 0.0.0.0. Interestingly, the server is working, I can run all queries, but I run memsqlctl list-nodes it says the leaf node is False for connectable and Unknown for recovery state.

nodes

I restarted the node, but after running “FILL CONNECTION POOLS” it ended up back to looking like that screenshot.

Does the documentation go deeper in-depth on what to do if the nodes can’t communicate with each other?

UPDATE:

I fixed the error with FILL CONNECTION POOLS by going into each node’s memsql.cnf file and commenting out the socket parameter so the nodes use TCP. I’m running into a different error now where if I try to sort by a column on the table, I got the error “ERROR 1777 (HY000): Partition records:2 has no master instance.”. The table has around 17 million rows. The query uses the shard key in the where clause so it’s running on around 8 million of those records. The server has over 100 GB of ram. My hunch is I have something misconfigured.


#4

Generate a cluster report and send it over to us via bug-report@memsql.com so we can check on what is going on in more detail: (https://docs.memsql.com/operational-manual/v6.7/generating-a-cluster-report/).