ERROR: Failed to complete ANALYZE MEMORY for workload management

s.stgermain · September 11, 2019, 3:11pm

Hi,
Over the past two days my aggregator has become unresponsive on two occasions.In both instances, going through the memsql.log I see the following:

165904625914 2019-09-11 02:52:06.205  ERROR: Failed to complete ANALYZE MEMORY for workload management: Error [2005] Leaf Error (127.0.0.1:3306): Timed out reading from socket after 5 seconds
166141621732 2019-09-11 02:56:04.207   WARN: socket (366) ETIMEDOUT in send
166141621749 2019-09-11 02:56:04.207   WARN: socket (384) ETIMEDOUT in send
166517812638 2019-09-11 03:02:20.089  ERROR: Failed to complete ANALYZE MEMORY for workload management: Error [2005] Leaf Error (:0): Timed out reading from socket after 5 seconds

These are the last entries in the log. The cluster does not recover and I have to restart the cluster.
Any ideas what this may be indicative of or how to go about troubleshooting?

My current cluster consists of 1 aggregator (c5.2xl) and 3 leaf nodes (c5.4xl)

jack · September 11, 2019, 9:23pm

This error may be related to having run out of threads. Do you see any “This workload needs more threads.” errors in the memsql.log? Or are there any other related errors/warnings just before the point where the aggregator became unresponsive?

s.stgermain · September 11, 2019, 10:00pm

Hi Thanks for your response. No, we do not see that error or any others on the aggregator. However, we did find on the leafs (about 9 mins prior) the following error:

206769094700 2019-09-11 02:41:35.045 ERROR: Nonfatal buffer manager memory allocation failure. Memory use (27828.125000 MB) has reached the maximum_memory parameter (28028 MB).

It is unclear to me if this is related or not, but is the most significant information we have found in the vicinity of the time we lose the aggregator.

jack · September 13, 2019, 4:39pm

For reference, our support team received and reviewed your cluster report off-thread and found that it appears that you were experiencing an issue with out-of-memory due to one of the bugs discovered in an earlier version of MemSQL which was subsequently resolved in 6.8.4. This issue caused queries to get stuck in the compilation state, and consume a lot of memory for query compilation.