ENGINEERING

Replicating PostgreSQL into MemSQL’s Columnstore

Oryan Moshe

Thanks to Oryan Moshe for this awesome blog post, which originally appeared on DEV Community. In the blog post, Oryan describes how to achieve the high performance of MemSQL’s columnstore for queries while keeping transaction data in PostgreSQL for updates – Ed.

Making the impossible hard

MemSQL cluster with capacity of 550GB that has 914GB of data in it.
I… don’t think this is how it should look like.

So it’s this time of the year again, we need to upgrade our MemSQL cluster and expand our contract to fit the new cluster topology.

We really outdid ourselves this time. Expanding to a 1TB cluster is impressive, especially when it’s completely not justified.

The background

Wait. A 1TB Cluster?

Yeah yeah, call us spoiled, but querying on PostgreSQL (PG from now on) is just not the same.

Sure, you can get ok speeds if you’re using the correct indexes and optimize your queries, but it’s not even comparable to the performance you get from the memory based rowstore in MemSQL (Mem from now on), or the insanely fast aggregations of the columnstore.

A Short (Short) Summary of Mem’s Different Storage Types

So we basically have 2 types of storage in Mem, rowstore and columnstore.

The rowstore is stored pretty much like any other database, but in the memory instead of the disk (crazy fast). This means each row is stored together with all of its columns.

The columnstore is sort of a transposed rowstore. Instead of storing rows, we store columns (thank you Captain Obvious), which allows us to make aggregations stupid fast. (Think about it; instead of going to each row and summing the “cost” column, we can just go to the “cost” column and sum it up.) The columnstore is stored on the disk.

(The MemSQL blog has an article on making the most of both rowstore and columnstore tables. – Ed.)

The issue is MemSQL’s license costs more as we have more memory in our cluster, not to mention the cost of the machines themselves (1TB of memory isn’t exactly cheap).

“So why not store everything in the columnstore? It’s cheaper both license and infrastructure wise, and it’s stupid fast!,” you might ask (if you talk to yourself while reading tech articles).

So here’s the catch – the way the data is stored in a columnstore makes it incredibly fast in aggregated queries, and allows amazing compression, but updating a row is slow.

(Some people here at MemSQL are thinking creatively about a possible solution to the problem of slow updates to columnstore tables that Oryan mentions here; stay tuned. – Ed.)

How slow? If we need to update some columns for rows in a specific day, it’s faster for us to delete the data from this day and re-insert the updated one instead of updating the existing rows.

So, How Do We Store Our Data?

Well, in my team we use 7 flavors of databases (might be more, can’t really keep track these days) but the main ones are PostgreSQL, hosted and managed by AWS RDS (for transactional processing) and MemSQL, hosted on EC2 and managed by yours truly (for analytical processing – including, but not limited to, analytics and transactions.).

Instinctively, most of our data is stored in PG (excluding some large columnstore tables containing North of 8B records).

The problem is, once you go Mem you never go back, so we created a replication service that can replicate a row from PG to Mem’s rowstore in real-time. This allows us to enrich our columnstore-only tables, create ETLs, and most importantly, speed up queries.

If you’re here, you either use Mem and thus know its performance, or just like to go around dev.to, reading random articles about niche DBs. If you’re the latter, let me hit you with some numbers.

A completely reasonable query, consisting of 6 joins, took 30 minutes to run on PG. After optimizing it for 2–3 hours, adding indexes, banging my head against the wall and praying for a swift ending, I was able to cut it down to 3 minutes.

Taking exactly the original query (the 30 minutes one) and running it on Mem, it took 1.87 seconds.

The Real Deal

Problems Definition, AKA What’s Making Me Lose Sleep

So Mem is expensive, we’re almost at our new license limit (after more than doubling it) and there’s no way we can go back to querying exclusively on PG.

The solution seems simple: move big tables to the columnstore, free up some memory so you don’t have to increase your license, and upgrade your machines.

For this article I’ll use our table touch_points as an example, it’s our largest (both in memory and row count) table stored in a rowstore - it has over 180M rows, and weighs more than 190GB.
Why is it in our rowstore? First, cause we replicate it from PG, and so far our service only supports replicating to rowstore tables. But, more importantly, it needs to be updated. Out of 30 columns, 2 might get updated  -  visitor_id and cost.

Solutions

The First Solution

So this was the “correct” solution, design-wise.

In short, using ActiveRecord callbacks, I kept 2 tables up to date. One is the touch_points table in the columnstore, containing all columns that exist presently on touch_points except the 2 that get updated. Other than touch_points, I created a table called touch_points_extra_data in the rowstore, containing the 2 missing columns and 1 ID column that allows me to connect the 2 tables.

As I said, this was the correct solution design-wise. The problem is that so much could go wrong. With so many moving parts, all dependent on Tails hooks, we were sure to get out of sync sometime. Not to mention the fact that we’ll have to edit all of our queries from touch_points to add that extra JOIN.

The Second Solution, AKA “The Bruteforce”

So we realized our top priority is to keep the data correct, and we were willing to make some compromises (foreshadowing).

I decided to replicate the whole table, as is, from PG once in a while. This way we can make sure that (up to the moment of replicating) our data will be identical in both DBs.

The compromise is that we are used to having this data updated in real time, and now it’ll be outdated until the next replication. This is a compromise I’m willing to take.

The Technical Part

Easier Said Than Done

So apparently replicating a whole table from one DB to another isn’t as straightforward as you would think. Especially when the two DBs run on different engines entirely.

The first thing I tried is using pg_dump, with the plain file format (which essentially creates a file with loads of INSERT statements) and then convert it to MySQL syntax and load to Mem.

Sounds great, right? I started the pg_dump, and 5 hours later it wasn’t even close to finishing, while the dump file was already at 60GB. pg_dump with the plain option is the most inefficient way to store data. 5 hours delay in replication is unacceptable.

If at First You Don’t Succeed… Fail Again

The next thing I tried was using the COPY command of PG, this command can copy (duh) a table, or a query into a FILE, a PROGRAM, or STDOUT.

First I tried using the STDOUT option (the simplest one, and it doesn’t create a footprint of a huge dump file).

psql -U read_user -h very-cool-hostname.rds.amazonaws.com -p 5432 -d very_cool_db -c\
"\COPY (SELECT * FROM touch_points) TO STDOUT\
WITH(DELIMITER ',', FORMAT CSV, NULL 'NULL', QUOTE '\"');" > touch_points.csv

And it worked! I got a “dump” file from PG containing our whole touch_points table, in just under 20 minutes.

Now we just need to import it to Mem, but why do I need the file? I can just pipe the result right from PG straight into Mem!

So I needed to create the part where Mem receives this csv-like table and loads it into the db. Luckily Mem is MySQL-compatible and provides us with the LOAD DATA clause!

LOAD DATA LOCAL INFILE '/dev/stdin'
  SKIP DUPLICATE KEY ERRORS
  INTO TABLE touch_points_columnstore
  FIELDS
    TERMINATED BY ','
    ENCLOSED BY '"'
    ESCAPED BY ''
  LINES
    TERMINATED BY '\n'
  MAX_ERRORS 1000000;

Now, as I said we want to pipe that data right into Mem, so we need to create a connection to our DB:

mysql -h memsql.very-cool-hostname.com -u write_user -P 3306 -D very_cool_db\
-p'4m4z1nglyS3cur3P455w0rd' -A --local-infile --default-auth=mysql_native_password -e\
"LOAD DATA LOCAL INFILE '/dev/stdin' SKIP DUPLICATE KEY ERRORS\
INTO TABLE touch_points_columnstore FIELDS TERMINATED BY ','\
ENCLOSED BY '\\\"' ESCAPED BY '' LINES TERMINATED BY '\\n' MAX_ERRORS 1000000;"

And then just pipe the data from PG to that connection!

psql -U read_user -h very-cool-hostname.rds.amazonaws.com -p 5432 -d very_cool_db -c\
"\COPY (SELECT * FROM touch_points) TO STDOUT\
WITH(DELIMITER ',', FORMAT CSV, NULL 'NULL', QUOTE '\"');" |\
mysql -h memsql.very-cool-hostname.com -u write_user -P 3306 -D very_cool_db\
-p'4m4z1nglyS3cur3P455w0rd' -A --local-infile --default-auth=mysql_native_password -e\
"LOAD DATA LOCAL INFILE '/dev/stdin' SKIP DUPLICATE KEY ERRORS\
INTO TABLE touch_points_columnstore FIELDS TERMINATED BY ','\
ENCLOSED BY '\\\"' ESCAPED BY '' LINES TERMINATED BY '\\n' MAX_ERRORS 1000000;"

And… It worked! But it took 2 hours to complete. I’m sure we can do better than that.

Compression is Your Friend

So two cool things important to understand about loading data into Mem are:

  1. When inserting a data file into Mem, it copies the file locally to the aggregator and splits the file between the nodes of the cluster, speeding up the data load significantly.
  2. Mem supports receiving gzip-compressed data files.

Combining these two pieces of information made me understand that creating the file in the middle maybe isn’t as bad as I thought.

I can compress that file, making storage a non-issue. It’ll also speed up the transfer of the file to the aggregator (before splitting) by cutting out most of the network related latency, and it’ll allow Mem to split the data between the nodes.

Let’s do it!

First of all I need to modify the PG part so instead of piping the content to STDIN, it pipes it to a PROGRAM, and in our case, gzip.

psql -U read_user -h very-cool-hostname.rds.amazonaws.com -p 5432 -d very_cool_db -c\
"\COPY (SELECT * FROM touch_points) TO PROGRAM 'gzip > /data/tmp/replication/touch_points_columnstore.gz'\
WITH(DELIMITER ',', FORMAT CSV, NULL 'NULL', QUOTE '\"');"

After we created this tmp file we need to load it. Luckily the only thing we have to do is to change the source of the input file!
Our finished script looks like this:

psql -U read_user -h very-cool-hostname.rds.amazonaws.com -p 5432 -d very_cool_db -c\
"\COPY (SELECT * FROM touch_points) TO PROGRAM 'gzip > /data/tmp/replication/touch_points_columnstore.gz'\
WITH(DELIMITER ',', FORMAT CSV, NULL 'NULL', QUOTE '\"');" &&\
mysql -h memsql.very-cool-hostname.com -u write_user -P 3306 -D very_cool_db\
-p'4m4z1nglyS3cur3P455w0rd' -A --local-infile --default-auth=mysql_native_password -e\
"LOAD DATA LOCAL INFILE '/data/tmp/replication/touch_points_columnstore.gz' SKIP DUPLICATE KEY ERRORS\
INTO TABLE touch_points_columnstore FIELDS TERMINATED BY ','\
ENCLOSED BY '\\\"' ESCAPED BY '' LINES TERMINATED BY '\\n' MAX_ERRORS 1000000;"

And that’s it!

The created file weighs 7GB, and the whole process takes less than 20 minutes, so we can run it once an hour and have semi-realtime data!

Obviously this wasn’t the end, I wrapped it up in a nice Rails module that allows me to replicate any query from PG to Mem easily, including truncating the old data and using 2 tables to minimize the downtime during replication.

Feel free to contact me with any questions! (Twitter: @oryanmoshe. Github: oryanmoshe.)

MemSQL Helios eclipse
Introducing
MemSQL Helios
The World’s Fastest Cloud Database