2 Server Cluster Config: External Nodes Stay in an Offline State After Restart - SSH Issue?

ymonye · April 19, 2020, 8:44pm

Hey all,

I’m currently running MemSQL in a 2 server setup on the same private network, with 2 hosts total & 4 nodes total, 1 host per server & 2 nodes per server. Both servers are running CentOS 7.x:

Host 1 (Local Host 192.168.1.3): Master & Leaf
Host 2 (192.168.1.5): Aggregator & Leaf

I’m able to successfully configure each node and everything works fine, until I restart either the aggregator or second leaf node on Server 2 from Server 1. If I were to restart them, those nodes stay in a permanent offline state according to MemSQL Studio.

Because these nodes exist on another server on the network, I’m unsure if there’s a missed handshake due to an improper SSH configuration, or firewall config preventing those nodes on Server 2 from fully appearing online on Server 1. My SSH keys are saved in the following path on both servers as the ‘root’ user, and were configured under the ‘–identity-file’ flag during the installation process: (/root/.ssh/id_rsa).

I’d temporarily disabled the firewall across both servers, and there’s still no luck with these offline nodes. I am however able to ping each server & verify TCP ports 22, 443, 3306, 3307, 3308, 3309, 3310, and 8080 are open & accessible via ‘nc -zv’. IP routes between servers exist. The only other thing that stands out is I’m defining a custom ‘–base-install-dir’ for each node during the ‘memsql-admin create-node’ process, while giving the user ‘memsql’ full access to those directories.

In summary, I know I at least have a working configuration when creating the nodes and assigning aggregators / leaves to them, as I can see a fully online config. The problem comes if I need to restart those nodes for any particular reason; they refuse to fully come back online

Below are the steps taken during the installation process, from the SSH key generation through firewall config, and finally the MemSQL installation steps. Please let me know if there were any additional steps missed in the process for a successful 2 server cluster configuration on the same local network, thanks.

SSH Key Generation for Server 1, as ‘root’:

ssh-keygen
ssh-copy-id root@192.168.1.5

SSH Key Generation for Server 2, as ‘root’:

ssh-keygen
ssh-copy-id root@192.168.1.3

Firewall config changes on both servers:

firewall-cmd --add-port=22/tcp --permanent
firewall-cmd --add-port=443/tcp --permanent
firewall-cmd --add-port=3306/tcp --permanent
firewall-cmd --add-port=3307/tcp --permanent
firewall-cmd --add-port=3308/tcp --permanent
firewall-cmd --add-port=3309/tcp --permanent
firewall-cmd --add-port=3310/tcp --permanent
firewall-cmd --add-port=8080/tcp --permanent
firewall-cmd --reload
firewall-cmd --list-all

Folders created on both servers, giving user ‘memsql’ full access:

mkdir /memsql_base_dir
chown memsql:memsql /memsql_base_dir

Installation Steps:

sudo yum-config-manager --add-repo https://release.memsql.com/production/rpm/x86_64/repodata/memsql.repo
sudo yum repolist
rpm -q which
sudo yum install -y memsql-toolbox memsql-client memsql-studio
memsql-toolbox-config register-host --localhost --host 192.168.1.3
memsql-toolbox-config register-host --identity-file /root/.ssh/id_rsa --host 192.168.1.5
memsql-deploy install --all
memsql-admin create-node --host 192.168.1.3 --password "" -P 3307 --base-install-dir /memsql_base_dir/master
memsql-admin create-node --host 192.168.1.3 --password "" -P 3308 --base-install-dir /memsql_base_dir/leaf
memsql-admin create-node --host 192.168.1.5 --password "" -P 3309 --base-install-dir /memsql_base_dir/aggregator
memsql-admin create-node --host 192.168.1.5 --password "" -P 3310 --base-install-dir /memsql_base_dir/leaf
memsql-admin bootstrap-aggregator --memsql-id <MemSQL_ID> --license <license>
memsql-admin add-aggregator --memsql-id <MemSQL_ID>
memsql-admin add-leaf --memsql-id <MemSQL_ID>
memsql-admin add-leaf --memsql-id <MemSQL_ID>
memsql-admin optimize
memsql-admin restart-node --all
sudo systemctl enable memsql-studio
sudo systemctl start memsql-studio

hanson · April 20, 2020, 5:44pm

Why do you suspect ssh configuration is an issue? Did you get any errors or warnings about that?

And how exactly did you restart the nodes?

Did you try stopping everything and then starting everything to see if everything would come up healthy?

What does memsql-admin list-nodes say?

ymonye · April 20, 2020, 8:12pm

Hanson, I haven’t any idea for certain if the SSH config is the issue, but I’d suspected it could be since I was able to fully verify routes between all servers, as well as disable the firewall, yet still had this connectivity issue. Of course I can be completely wrong.

I’m restarting the nodes using the following command:

memsql-admin restart-node --memsql-id <MemSQL_ID>

The only reason I’d noticed the connectivity issue was because I was originally attempting to resize my nodes to a maximum memory of 16GB. Below is the command I used:

memsql-admin update-config --key maximum_memory --value 16384 --memsql-id <MemSQL_ID>

I had another terminal window logged into mysql from the master node server, to verify if the max memory change was properly applied. After restarting, it returned a NULL value for the max memory, as well as reporting an offline state. Below are the queries I ran in mysql and the resulting table:

use information_schema
select IP_ADDR, STATE, MAX_MEMORY_MB from mv_nodes;

±-------------±--------±--------------+
| IP_ADDR | STATE | MAX_MEMORY_MB |
±-------------±--------±--------------+
| 192.168.1.5 | offline | NULL |
| 127.0.0.1 | online | 16384 |
±-------------±--------±--------------+

From there, I was uncertain if the problem was with the memory resizing or something else. I'd then decided to remove all nodes, reinstall just two nodes: 1 master on Server 1 and 1 aggregator on Server 2, then attempt to restart the node on Server 2 without attempting to make any config changes. I receive the same offline state error.

Running:

memsql-admin list-nodes

Returns:

| 327B22A875 | Master | 127.0.0.1 | 3307 | Running | True | 7.0.15 | Online | | 0.0.0.0 |
| 9F475F10B3 | Aggregator | 192.168.1.5 | 3308 | Running | True | 7.0.15 | Online | | 0.0.0.0 |

Yet, I still see the same status in MemSQL Studio:

hanson · April 20, 2020, 10:47pm

Can you connect to your new master and run queries?

What does

select * from information_schema.mv_nodes;

say?

ymonye · April 20, 2020, 11:32pm

Here’s what I’ve done.

I’ve re-installed a new 3 node setup: Master on Server 1, Aggregator + 1 Leaf on Server 2, without restarting any nodes. Everything appears online, and I’m able to connect to mysql to run some queries including creating a new database.

select * from information_schema.mv_nodes;

| ID | IP_ADDR | PORT | TYPE | STATE | AVAILABILITY_GROUP | NUM_CPUS | MAX_MEMORY_MB | MAX_TABLE_MEMORY_MB | MEMORY_USED_MB | TABLE_MEMORY_USED_MB | TOTAL_DATA_DISK_MB | AVAILABLE_DATA_DISK_MB | UPTIME |
| 10 | 192.168.3.21 | 3309 | LEAF | online | 1 | 24 | 118484 | 106635 | 432 | 51 | 563086 | 529825 | 363 |
| 9 | 192.168.2.21 | 3308 | CA | online | NULL | 24 | 118484 | 106635 | 365 | 25 | 563086 | 529825 | 404 |
| 1 | 127.0.0.1 | 3307 | MA | online | NULL | 24 | 16384 | 14745 | 386 | 27 | 12488913 | 3645256 | 107724 |

If I reboot the aggregator node with ‘memsql-admin restart-node’, I am still able to run queries / create a new database. Here’s what the ‘mv_nodes’ query returns:

| ID | IP_ADDR | PORT | TYPE | STATE | AVAILABILITY_GROUP | NUM_CPUS | MAX_MEMORY_MB | MAX_TABLE_MEMORY_MB | MEMORY_USED_MB | TABLE_MEMORY_USED_MB | TOTAL_DATA_DISK_MB | AVAILABLE_DATA_DISK_MB | UPTIME |
| 10 | 192.168.2.21 | 3309 | LEAF | online | 1 | 24 | 118484 | 106635 | 492 | 79 | 563086 | 525598 | 843 |
| 9 | 192.168.2.21 | 3308 | CA | offline | NULL | NULL | NULL | NULL | NULL | NULL | 140446977353872 | 1112 | NULL |
| 1 | 127.0.0.1 | 3307 | MA | online | NULL | 24 | 16384 | 14745 | 393 | 30 | 12488913 | 3643623 | 108203 |

If I reboot the leaf node, I am no longer able to run queries / create a new database, as here’s the error I receive:

ERROR 1748 (HY000): You must add at least one leaf to the system to run CREATE DATABASE

Here’s what the ‘mv_nodes’ query returns:

| ID | IP_ADDR | PORT | TYPE | STATE | AVAILABILITY_GROUP | NUM_CPUS | MAX_MEMORY_MB | MAX_TABLE_MEMORY_MB | MEMORY_USED_MB | TABLE_MEMORY_USED_MB | TOTAL_DATA_DISK_MB | AVAILABLE_DATA_DISK_MB | UPTIME |
| 10 | 192.168.2.21 | 3309 | LEAF | offline | 1 | NULL | NULL | NULL | NULL | NULL | 0 | 4294967298 | NULL |
| 9 | 192.168.2.21 | 3308 | CA | offline | NULL | NULL | NULL | NULL | NULL | NULL | 0 | 4294967298 | NULL |
| 1 | 127.0.0.1 | 3307 | MA | online | NULL | 24 | 16384 | 14745 | 395 | 30 | 12488913 | 3644731 | 108455 |

hanson · April 20, 2020, 11:54pm

If you do memsql-admin restart-node --all then do the nodes come back online?

I’m reaching my limits of knowledge on this, so will wait for others to chime in if they can.

One thing that is mildly concerning to me is that the MA is at 127.0.0.1 and not a standard IP addr. Maybe you didn’t specify the IP on one step when you added it? But I don’t know if that is a problem or not.

ymonye · April 21, 2020, 12:09am

I think I may’ve installed the master node with:

memsql-admin create-node --host 127.0.0.1 --password "" -P 3307 --base-install-dir /memsql_base_dir/master

The primary host was also installed with:

memsql-toolbox-config register-host --localhost --host 127.0.0.1

I can try to redo everything while including the actual server IP for the primary host & master node, and see if that resolves anything. Will report back with the results.

Running ‘restart-node --all’ did not seem to help anything.

memsql-admin restart-node --all
Toolbox is about to perform the following actions:
· Restart all nodes in the cluster

Would you like to continue? [y/N]: y
The target user user on host 127.0.0.1 does not have the privileges to perform this action.
Please enter your password to proceed with sudo. (For details, see SingleStore Tools Issues · SingleStore Documentation).
sudo password for user@127.0.0.1:
✓ Successfully connected to host 127.0.0.1
sudo password for user@192.168.2.21:
✓ Successfully connected to host 192.168.2.21
sudo password for user@192.168.2.31:
✓ Successfully connected to host 192.168.2.31
✓ Stopped Master node on 127.0.0.1 (1/1)
✓ Successfully stopped Master node on 1 host
✓ Stopped Master node
✓ Stopped Aggregator nodes on 192.168.2.21 (1/1)
✓ Successfully stopped Aggregator nodes on 1 host
✓ Stopped Aggregator node
✓ Stopped Leaf nodes on 192.168.2.21 (1/1)
✓ Successfully stopped Leaf nodes on 1 host
✓ Stopped Leaf node
✓ Started Leaf nodes on 192.168.2.21 (1/1)
✓ Successfully started Leaf nodes on 1 host
✓ Successfully connected to Leaf node
✓ Started Aggregator nodes on 192.168.2.21 (1/1)
✓ Successfully started Aggregator nodes on 1 host
✓ Successfully connected to Aggregator node
✓ Started Master node on 127.0.0.1 (1/1)
✓ Successfully started Master node on 1 host
✓ Successfully connected to Master node
Operation completed successfully

ymonye · April 21, 2020, 12:49am

Looks like defining the localhost server IP with the actual 192.xxx IP did the trick! I no longer receive an error when rebooting any of the external nodes

memsql-toolbox-config list-hosts

±-------------±-----------±-------------±------------------+
| Host | Local Host | SSH address | Identity File |
±-------------±-----------±-------------±------------------+
| 192.168.2.11 | Yes | | |
| 192.168.2.21 | No | 192.168.2.21 | /root/.ssh/id_rsa |
| 192.168.2.31 | No | 192.168.2.31 | /root/.ssh/id_rsa |
±-------------±-----------±-------------±------------------+

memsql-admin list-nodes

±-----------±-----------±-------------±-----±--------------±-------------±--------±---------------±-------------------±-------------+
| MemSQL ID | Role | Host | Port | Process State | Connectable? | Version | Recovery State | Availability Group | Bind Address |
±-----------±-----------±-------------±-----±--------------±-------------±--------±---------------±-------------------±-------------+
| 8B0C23AE8F | Master | 192.168.2.11 | 3307 | Running | True | 7.0.15 | Online | | 0.0.0.0 |
| 5EB34D5011 | Aggregator | 192.168.2.21 | 3308 | Running | True | 7.0.15 | Online | | 0.0.0.0 |
| 7D89A20929 | Leaf | 192.168.2.21 | 3309 | Running | True | 7.0.15 | Online | 1 | 0.0.0.0 |
±-----------±-----------±-------------±-----±--------------±-------------±--------±---------------±-------------------±-------------+

ymonye · April 21, 2020, 1:10am

Now one last thing, looks like I’m able to successfully resize each of the nodes to the following config:

Master - 16GB
Aggregator - 16GB
Leaf - 48GB

select IP_ADDR, STATE, MAX_MEMORY_MB from mv_nodes;

±-------------±-------±--------------+
| IP_ADDR | STATE | MAX_MEMORY_MB |
±-------------±-------±--------------+
| 192.168.2.21 | online | 49152 |
| 192.168.2.21 | online | 16384 |
| 192.168.2.11 | online | 16384 |
±-------------±-------±--------------+

I’d like to add a 2nd leaf to my 3rd server, using the remaining RAM under the free 128GB license, however I’m unable to since MemSQL states that would use a total of 8 licenses out of the provided 4 licenses.

How is this possible when the maximum RAM of the other nodes fall under 128GB?

memsql-admin add-leaf --memsql-id DE426A7EA9
Toolbox will perform the following actions on host 192.168.2.11:
· Run ‘memsqlctl add-leaf --host 192.168.2.31 --port 3310 --user root --password ●●●●●●’

Would you like to continue? [y/N]: y
✘ Failed to run ‘memsqlctl add-leaf’
error running memsqlctl: error running command: "/usr/bin/memsqlctl" "--json" "--yes" "add-leaf" "--host" "192.168.2.31" "--port" "3310" "--user" "root" "--secure-password" "<license>: exit status 1
stderr: Error 2374: Leaf node could not be added because you are using the MemSQL free license which has a limit of 4 license units and after adding the node you would be using 8 license units. To purchase more capacity, contact team@memsql.com. See https://docs.memsql.com/redir/capacity-limit-error for instructions on how to create a cluster with up to 4 license units.
Could not add leaf to cluster. Verify network connectivity between the node at 192.168.2.31:3310 and the master aggregator.
Please ensure that you have entered the correct password for the node.
If the node does not yet exist, run ‘memsql-admin create-node’ and then retry this command.

carl · April 21, 2020, 3:17am

Hi ymonye!
First off, awesome job setting up your cluster and working through this! You hit the issue exactly on it’s head, the problem is that you had defined the master with it’s localhost ip address. I am going to file a task to see if we can error somewhere when we detect that this has happened since it will probably happen to someone else. Easy mistake to make!

Regarding your second question, I fear the issue is related to how we calculate license units. The gist of the issue is that we calculate license units based on the resources of the host machine, not the resources allocated to an individual leaf node. So even though you constrained the maximum memory on the node, the license calculation will ignore that constraint. You can learn more about how our unit-based licenses work here: SingleStoreDB Cloud · SingleStore Documentation

I’m sorry to say that the best work around to this is to run on smaller machines such that the total amount of available ram/cores to any host running a leaf node is equal to 128 GB/32 cores.

On the plus side, we only license based on leaf nodes so that may give you a bit more flexibility to have a larger cluster than you were planning.

Please let us know if you have any other questions! Thanks for using MemSQL!

ymonye · April 21, 2020, 10:38pm

Thanks Carl,

All is well, I have a meeting with the product team to see if additional licenses make sense.

Cheers