We provisioned a MemSQL 6.7.14 cluster using the MemSQL-rpvoded AWS CloudFormation template with eight r4.2xlarge leaf instances (configured for HA) and two m4.large instances acting as aggregators. Without a significant increase in data in the cluster or configuration change, we recently started seeing leaf nodes go offline sporadically (every few hours) after a spurt of messages in their tracelogs like the following:
11150142746 2019-07-08 00:24:09.581 WARN: Failed to allocate 8388608 bytes of memory from the operating system (Error 12: Cannot allocate memory). This is usually due to a miss-configured operating system or virtualization technology. See https://docs.memsql.com/troubleshooting/latest/memory-errors.
The messages are sometimes followed with log lines like the following (some kind of crash information?):
query: _REPL 0 3002 -1 16 1 0 0 query: _REPL 0 4000 -1 16 1 0 0
memsqld process appears to restart:
34 2019-07-08 23:22:39.922 INFO: Log opened 01737143 2019-07-08 23:22:41.659 INFO: Initializing OpenSSL 01738123 2019-07-08 23:22:41.660 INFO: MemSQL version hash: fa416b0a536adcfcf95d0607be2d6086a0d58796 (Mon Mar 4 15:00:38 2019 -0500) ...
The memory allocation error usually starts to appear amidst other log messages about replication, but not always.
According to our DataDog monitoring, the usable RAM on the host remains high (~24 GiB, 7 GiB immediately available, plus 17 GiB more after Linux cache flush) at the time of the error. Likewise, querying the
information_schema.mv_nodes table consistently shows lots of headroom between the memory allocated and max memory available for all nodes, e.g. for one affected node:
MAX_MEMORY_MB = 55261 MEMORY_USED_MB = 30751 MAX_TABLE_MEMORY_MB = 49734 TABLE_MEMORY_USED_MB = 24540
The error message makes it sound like the CloudFormation template may have missed some configuration at the OS level at the time of install. I’ve spot-checked the sysctl, hugepage, etc. settings mentioned in https://docs.memsql.com/installation/v6.7/system-requirements/#configure-linux-vm-settings on a few of the impacted hosts and didn’t see anything mismatched.
Has anyone else experienced a similar problem using MemSQL on AWS? Is it possible the OS is misconfigured like the memory warning message says or is that statement a catch-all and red herring?