Pipeline with S3, how do you find why batches are failing


#1

Pipeline with S3, how do you find why batches are failing?

I created the pipeline fine, I have run a test of the pipeline just fine as well.

but when I start the pipeline all I get is failed batches and no data being inserted.

Only thing I can seem to get a message from is when I did PROFILE PIPELINE.

It came back with:
ERROR 1712 ER_MEMSQL_OOM: Leaf Error (xxx.xxx.xxx.xxx:3306): Memory used by MemSQL (6989.25 Mb) has reached the ‘maximum_memory’ setting (7189 Mb) on this node. Possible causes include (1) available query execution memory has been used up for table memory (in use table memory: 4208.50 Mb) and (2) the query is large and complex and requires more query execution memory than is available (in use query execution memory 0.00 Mb). See https://docs.memsql.com/troubleshooting/latest/memory-errors for additional information.

Does this mean I need more ram on the servers? I know I am working with some large files but this was just for testing, so followed the setup and it asked for 4 servers of 8GB of ram so that is what I did. If I need to load up a bigger server for testing I can, but want to make sure what server need this if that is the problem.

Thank you


#2

Welcome to the support forum and thanks for trying out S3 pipelines. It looks like at least one node in your cluster is running out of memory.

Let’s review how S3 pipelines work first. MemSQL aggregator will query for a list of files at the specified S3 location, pick a subset of them as a batch and assign each file to a MemSQL partition to be ingested. The partition will download the file to the leaf first, load it into memory, and perform an insert into your desired table. The available memory for this operation is maximum_memory minus table memory.

One possibility in your case is that the leaf could be trying to load too many files at once, and their total size does not fit into a little less than 3 Gb of available memory. Keep in mind that a file may take up more space when loaded into memory than on disk, due to the overhead of table data structures, or if your destination table is rowstore.

You can lower the parallelism of pipeline batches by using MAX_PARTITIONS_PER_BATCH syntax in your CREATE PIPELINE statement. It is described more fully in the docs https://docs.memsql.com/sql-reference/v6.7/create-pipeline/

Other things you could try is using fewer partitions per leaf. If that is not possible, you could try using an AGGREGATOR pipeline. This could help for a pipeline that loads a single large S3 file and if aggregators have more available memory than leaves.

hope this helps and let us know if you are still having trouble.


#3

I was using Rowstore and with how much data I was trying to load in it was more memory then I had in my cluster. Once I changed over to Columnstore, I was able to get the data loaded in just fine.