Failed to allocate memory, "miss-configred operating system or VM"


#1

We provisioned a MemSQL 6.7.14 cluster using the MemSQL-rpvoded AWS CloudFormation template with eight r4.2xlarge leaf instances (configured for HA) and two m4.large instances acting as aggregators. Without a significant increase in data in the cluster or configuration change, we recently started seeing leaf nodes go offline sporadically (every few hours) after a spurt of messages in their tracelogs like the following:

11150142746 2019-07-08 00:24:09.581   WARN: Failed to allocate 8388608 bytes of memory from the
 operating system (Error 12: Cannot allocate memory). This is usually due to a miss-configured operating
 system or virtualization technology. See https://docs.memsql.com/troubleshooting/latest/memory-errors.

The messages are sometimes followed with log lines like the following (some kind of crash information?):

query: _REPL 0 3002 -1 16 1 0 0
query: _REPL 0 4000 -1 16 1 0 0

Then the memsqld process appears to restart:

34 2019-07-08 23:22:39.922 INFO: Log opened
01737143 2019-07-08 23:22:41.659   INFO: Initializing OpenSSL
01738123 2019-07-08 23:22:41.660   INFO: MemSQL version hash: fa416b0a536adcfcf95d0607be2d6086a0d58796 (Mon Mar 4 15:00:38 2019 -0500)
...

The memory allocation error usually starts to appear amidst other log messages about replication, but not always.

According to our DataDog monitoring, the usable RAM on the host remains high (~24 GiB, 7 GiB immediately available, plus 17 GiB more after Linux cache flush) at the time of the error. Likewise, querying the information_schema.mv_nodes table consistently shows lots of headroom between the memory allocated and max memory available for all nodes, e.g. for one affected node:

MAX_MEMORY_MB        = 55261
MEMORY_USED_MB       = 30751
MAX_TABLE_MEMORY_MB  = 49734
TABLE_MEMORY_USED_MB = 24540

The error message makes it sound like the CloudFormation template may have missed some configuration at the OS level at the time of install. I’ve spot-checked the sysctl, hugepage, etc. settings mentioned in https://docs.memsql.com/installation/v6.7/system-requirements/#configure-linux-vm-settings on a few of the impacted hosts and didn’t see anything mismatched.

Has anyone else experienced a similar problem using MemSQL on AWS? Is it possible the OS is misconfigured like the memory warning message says or is that statement a catch-all and red herring?


#2

That error means Linux refused a memory allocation request (and MemSQLs memory use was under the maximum_memory variable at the time. You’ll get a different error if MemSQL is refusing to allocate memory because its memory use has reached maximum_memory).

The first thing to check is run memsql-report check --all as a sanity check on the configuration.
(https://docs.memsql.com/memsql-tools-reference/latest/check/)

If things look okay in that output then generate a cluster report (memsql-report collect) and send it to bug-report@memsql.com
(https://docs.memsql.com/memsql-tools-reference/latest/collect/)


#3

Thanks @adam. I collected a report on one of the affected nodes right after a crash like the one I described with:

memsql-report collect-local --all

I then ran a check on the report. (The memsql-report tool tells me I have to specify a path to report file when running check. ¯\_(ツ)_/¯)

memsql-report check --all --report-path ./report-2019-07-09T120753.tar.gz

Every line in the check output indicates a PASS. I’ll file a ticket with support as you suggest.

(Ticket link: https://support.memsql.com/hc/en-us/requests/8622)


#4

I should also mention …

I noticed there are core dump files on each of the impacted leaf nodes with timestamps coinciding with the times they went offline. I’ll make sure I mention this fact in the ticket, but wanted to also include that information here in public for posterity.