Best way to replicate table (Columnstore -> Memory) periodically?

Hello,

We have an scenario where we have two tables
One COLUMNSTORE table (on disk) that has all our data from the past 6 months.
One table (on memory) that has data only from the last 4 weeks (same data as the other table, with fewer columns).

We have this setup, as we require some questions to be answered faster (therefore the memory table)

What is the best approach to replicate this data, and keep only the last 30 days?
Thought about pipelines, but didn’t have an option to “drop” observations older than 30 days (each day).

Currently we are using a Python script running each day that:

  • Drops everything inside the memory table
  • Loads data from the last 30 days
  • Repeat each day (but of course takes time/bandwidth)

Would like to know if MemSQL has a better approach for this.

Regards,

Can you describe your scenario? Why do you use columnstore/rowstore and what performance expectations you have for either?

The columnstore table has data from the past year (doesn’t fit into memory) and is used for answering analytical questions that aggregate data (big picture). Here we need better performance for analytical questions that would take much longer time in regular databases.

The rowstore table just includes data from the last month (does fit into memory) and is used for answering simple operational queries. Here we have a quick snapshot of specific data from the last month, this simple questions help the operational team to execute current actions.

Typically people write a stored procedure. In the SP you can open a transaction, run INSERT/SELECT to move data into columnstore, DELETE from rowstore, and LOAD into the rowstore table.

The next version may allow you to run the whole workload off of the columnstore - we are adding columnstore indexes. If you are interested, please reach out to nikita@memsql.com and share your operational workload - we will be able to tell if the next version columnstore (we call it singlestore) will do it all.

@ssalgadom a variation of what Nikita described that seems to fit your scenario is:

  • Put all INSERTS and LOADS in both tables.
  • Target UPDATES and DELETES to both tables too unless you are sure only one table needs them.
  • Have query router logic in a stored proc or external client application that routes the query to one table or the other depending on the date range covered by the query and the type of the query. You could even use dynamic SQL in a stored proc to substitute in the table name in the query string to avoid redundancy.
  • Run a job nightly to delete all data from the row store table older than 30 days old.

Thanks Nikita and Hanson for your answers, gave us a better idea on how to handle this scenario.
(looking forward for next version, but not in a hurry :wink: )