File system pipeline into stored procedure

anandh.nagarajan · May 5, 2020, 10:57pm

Our challenge : What is the best way to load large amount of data (about 20 million records, 18GB memsql size) thru FS pipelines without making the table unavailable for read

We load about 30 tables into memsql daily night from a NAS drive by running the pipelines in foreground. During this load we make the entire cluster unavailable and switch the user traffic to other memsql cluster that we have. During this load, we truncate the entire table and reload the entire data. Once we finish loading all the tables, we make the cluster active again and follow the same process for the second cluster.

But we have another set of tables that receive data at various times after the initial load. We need to load these tables also in its entirety.

We developed a pipeline which will write into a SP and SP will check the value for each row and update if the row is already there or insert if the row is not there. Since SP process one row at a time,it takes a long time to process.

Is there any better way of handling this ? Our requirment is that the table should be available for read when we are loading the data.

Thanks.

hanson · May 5, 2020, 11:44pm

While you are updating a table with a pipeline, the table will be available for read. MemSQL has multi-version concurrency control and readers and writers don’t block each other under normal circumstances.

You said that you process one row at a time to do upsert logic in an SP. Can you process the data in batches with set-oriented SQL statements? That could speed things up a lot. Maybe you can do this even though we don’t support INSERT ON DUPLICATE KEY UPDATE on columnstores yet. I’m assuming you are using a columnstore.

If you are using a rowstore you can use an upsert statement to save a lot of time. Again, try to process rows in batches, set-oriented, rather than one at a time. Maybe you can take a batch of say, 10000 rows from one file, and run one statement to figure out what to do with the inserts and one to figure out what to do with the updates. You may or may not need to use temp tables for intermediate data.

Finally, some people use a staged process to do what you are trying to do, loading data into rowstore tables to do INSERT ON DUP KEY UPDATE (upsert) and then migrating data to a columnstore after de-duplication. This can get complex, so only do that if a single-table solution won’t work.

I don’t know how to control batch sizes on columnstore loads, other than to control the size of the input files. Maybe somebody else can comment about that.

anandh.nagarajan · May 6, 2020, 12:08pm

Hanson, thanks for your reply. All our tables are rowstore. Should have mentioned it earlier.

I was able to use the upsert logic and it works quite well…i was able to update 20 million rows within like 30 secs…the input file was split into 16 files ( same as the no of partitions) and loaded using a PIPELINE into a SP.

one question that i still have is, how we do handle the deletes…the rows which are there currently in the table but not in the file…

One idea is to have a staging table loaded with the same data and compare the main and staging table…and then delete any entry that is available in main table but not in the staging table…

is there any better way to do this…