Problem with socket ETIMEDOUT

tomas · April 2, 2020, 12:09pm

Dear MemSQL team,
we have a cluster 6.5.19 with 6 aggregators and 20 leafs each with 64GB RAM. We add 5 new leafs because we were ruuning out of memory. After I added new leaves to the cluster using memsql-ops UI, I manually copied some partitions from the current 20 leaves to the new ones. Immediately after COPY PARTITION, I promote the new partition on the new leaf and drop the old one. Normally every leaf have about 250 partitions and the new ones sometimes have as little as 50 and what happens is that the new leaf (any of those new ones) are unresponsive. Memsql-ops said there are in unknown state and I can not even ping that new leaf. Only thing I can do is the hard reset of the whole server. This is what is in the logs:
12618508212 2020-04-01 12:38:10.704 WARN: socket (908) ETIMEDOUT in poll
12621514187 2020-04-01 12:38:13.710 WARN: socket (2973) ETIMEDOUT in poll
12621514195 2020-04-01 12:38:13.710 WARN: socket (1789) ETIMEDOUT in poll
12624520143 2020-04-01 12:38:16.716 WARN: socket (1803) ETIMEDOUT in poll
12624520213 2020-04-01 12:38:16.716 WARN: socket (2477) ETIMEDOUT in poll
12625025963 2020-04-01 12:38:17.222 WARN: socket (1342) ETIMEDOUT in poll
12627526315 2020-04-01 12:38:19.722 WARN: socket (4140) ETIMEDOUT in send
12627526343 2020-04-01 12:38:19.722 WARN: socket (1401) ETIMEDOUT in recv
12627526509 2020-04-01 12:38:19.723 WARN: socket (1849) ETIMEDOUT in recv
12627526519 2020-04-01 12:38:19.723 WARN: socket (1383) ETIMEDOUT in recv
12627526528 2020-04-01 12:38:19.723 WARN: socket (2500) ETIMEDOUT in recv
12627526538 2020-04-01 12:38:19.723 WARN: socket (2053) ETIMEDOUT in recv
12627526547 2020-04-01 12:38:19.723 WARN: socket (1878) ETIMEDOUT in recv

Sometimes it happens shortly after the COPING, PROMOTING and DROPING the partitions and sometimes it happens hours after that. When I reset the server after attaching that leaf everything runs ok for a few hours and after that it happens again and again and again untill I copy the partitions back from the new leaves.

The new leaves are the exact same configuration as the other leaves. I tried change timeout, sysctl variables, nothing really helps. MemSQL just floods all the connection sockets making the whole server unreachable.

What can we try or why is it happening only on the new servers? Any help would be really appreciated. Thank you!

tomas · April 7, 2020, 10:05am

The problem is still persisting. I tried changing the sysctl variables like:
vm.max_map_count=1000000000
vm.min_free_kbytes=500000
fs.file-max = 10000000
fs.nr_open = 10000000
net.ipv4.tcp_keepalive_time = 200
net.ipv4.tcp_keepalive_intvl = 200
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_max_syn_backlog = 100000
net.core.netdev_max_backlog = 100000
net.core.somaxconn = 65534
net.ipv4.ip_local_port_range = 1024 64999
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_reuse = 0 (also tried 1)
net.ipv4.tcp_tw_recycle = 0 (also tried 1)

also trying another values but it has clearly no impact on the problem. Is it a BUG in MemSQL?

adam · April 7, 2020, 2:07pm

Hi Tomas,

You shouldn’t need to manually copy and promote partitions. You can run REBALANCE PARTITION to do that for you.

We likely need to see a cluster report to get more details about what is going on. Are the ulimits around file descriptors set as well (the “increase file descriptor …” part of SingleStoreDB Cloud · SingleStore Documentation)? If you have access to memsql support I would open a ticket with a cluster report.

-Adam

tomas · April 8, 2020, 8:01am

Thank you for your response adam,

I had file descriptors set properly. I tried to do REBALANCE PARTITIONS on one database yesterday and I got almost 500k WARN: socket ETIMEDOUT messages in the logs, but after that none of the new nodes crashed. I will try to do REBALANCE on more databases today and see what will happen.

Thank you.