Performance Improvements in MemSQL 6.5

Eric Hanson
Eric Hanson

With MemSQL 6.5, the fastest database just got a lot faster. MemSQL is a performance-oriented product that capitalizes on several techniques to give excellent overall speed. These techniques include optimized disk-based and in-memory data structures, parallelism, scale out, compilation of queries to machine code, vectorization, and single instruction, multiple data (SIMD).

In MemSQL 6.5, we advanced performance in many dimensions, including query execution (QE), query optimization (QO), data loading, and system management. MemSQL 6.0 was already fast. Very fast. But we were not satisfied. QE performance is always a high priority for us since it’s what users see most when exploring data. We improved QE performance for the following operations:

  1. Shuffle (data redistribution)
  2. High-cardinality GROUP BY
  3. IN-list filtering
  4. Filtering of data stored with integer run-length encoding (RLE)
  5. Internal memory allocation

For many users, the improved shuffle and high-cardinality GROUP BY performance will be the most noticeable advancements. In an internal test workload that we run based on the TPC-H benchmark, the average query speed improved by 2.2X, over a set of 22 queries. The IN-list filter, integer RLE filtering, and memory allocation improvements can all give 10X or more query speedups in the right situation.

The query optimizer and statistics systems have improved significantly. The blazing-fast SIMD QE we used to demonstrate a trillion-row-per-second query with our 6.0 release had gotten so fast that our query optimizer didn’t always know when to use it. So we re-calibrated the optimizer to use it when it’s best suited. Also, we’ve improved our statistics (histograms) to have much higher resolution, which will lead to better query plans.

Finally, loading and system management have improved. Loading from Amazon S3 cloud storage using our Pipelines feature can start in seconds in MemSQL 6.5 versus minutes in MemSQL 6.0 for large S3 buckets. Load speed for wide comma-separated value list files with quoted strings is over twice as fast. Columnstore loading and all other operations that require large sorts are far faster. And finally, restart recover for clusters with hundreds of databases is more than ten times as fast.

Query Performance

Here, we’ll examine all the query performance improvements in MemSQL 6.5 in more depth.

Fast Shuffle

MemSQL is a distributed, shared-nothing database. Applications connect to aggregator nodes. Aggregators compile queries, start execution, and assemble results. The data is stored across multiple leaf nodes using hash partitioning (sharding). The execution of queries requires data to be moved between leaf nodes, as well as from leaves to aggregators.

Data movement takes time. And we don’t like it when queries take too much time. Which lead us to create Project Sanic. When Ya Gotta Go Fast! (Google images of “Sanic” and you’ll see our project mascot). Sanic is all about speeding up data movement in MemSQL.

Before MemSQL 6.5, shuffle heavily relied on converting data to text form, then back to an internal binary form for processing at the receiving node. 6.5 shuffles data primarily in binary format. For certain types of data, especially numbers and dates, this saves a lot of CPU time on both the sending and receiving ends, because we don’t have to take time to convert to string form and back again.

Consider a simple example. Suppose we create this table:

create table t(a int, b int, shard(a), key(a));

Then we insert about 164 million rows into it, with unique integer values for every row for both columns a and b.

This table is sharded (hash partitioned) by column a across leaf nodes. Each leaf node has an index on column a. Now, suppose we want to run this query:

select count(distinct b) from t;

A highly parallelizable version of this query repartitions data by column b using hashing, then performs a local distincting operation on each leaf. Finally, these results are concatenated together at the aggregator, and the result is returned to the client.

The query above runs on a 4-leaf system with 16 cores and 16 partitions total in the following times:

MemSQL 6.0 MemSQL 6.5 Speedup (times)
26.57 sec 8.74 sec 3.04

This healthy 3.04 times speedup is due completely to shuffle performance improvements.

High-Cardinality GROUP BY

High-cardinality GROUP BY queries are ones that have many distinct grouping keys (say 10s of thousands or more). In 6.0, these queries computed a local grouping for each partition, and each of these was forwarded to the aggregator, which did the final grouping. If there were many partitions per node, this could have involved a lot of data transmission causing a network bottleneck and leaving the aggregator with a lot of work to do to form the final aggregate set.

In 6.5, performance of queries like this is improved by (a) doing a local aggregate at the leaf level of all the partitions on that leaf, (b) spilling the rows to the network if it turns out almost every GROUP BY key is distinct so the local aggregation is not helping, and (c) reducing memory usage by having each thread handle a subset of the keys.

The realistic best-case scenario query for this improvement is a GROUP BY that is shuffling (not shard key matching) and each partition has a large set of keys but the set of keys for each partition is roughly the same. Shard-key-matching GROUP BY is not affected by this change.

The key part of this improvement is the local aggregate at the leaf level, since it allows more of the work to be done by the leaves and less by the aggregators, and it reduces network traffic.

Performance Improvement:
This improvement can give several-times speedups for some queries. For example, query q22 in one of our internal test workloads derived from TPC-H speeded up by more than a factor of 2 based solely on this change.

IN-list Filter Improvement

Performance has been improved for queries that use IN-list filters. 6.0 required a full scan of the IN-list for each row. 6.5 uses a Bloom filter to check if there is any match to help increase scan speed.

For IN-list filters that disqualify the large majority of rows, this can give order-of-magnitude speedups. This improvement works for row store and column store tables. It is always enabled.

Here’s an example that shows the dramatic speedup possible for IN-list queries in 6.5. For a table t(i int) with 8 million rows of unique integers, in row store format, we ran the following queries using 6.0 and 6.5:

Query 6.5 6.0
select count(*) from t where i in (1, 100, 200); 0.23 sec same
select count(*) from t where i in (1, 2, ... 904); 0.20 sec (60x speedup!) 12.0 sec

The second query, with an IN-list containing 904 elements, speeds up tremendously, while the first one, with a short IN-list, stays the same.

Integer RLE Filter Pushdown

Run-length encoding (RLE) is one of several compression strategies used for our columnstore tables. MemSQL 6.5 can now push integer filters down to RLE-encoded data in the columnstore scan layer, giving significant speedup.Here’s a performance example. The table t(i int) is a columnstore with 10 million rows and 1,000 unique values, keyed on i. So it will be encoded with RLE.

select count(*) from t where i = 2;

The performance results on two cores with two partitions are as follows:

6.0 6.5
0.11 sec 0.002 sec

As you can see, this enhancement can provide order-of-magnitude improvements. It’s always on; there is no need to set any switches to control it.

Internal Memory Allocation Speedup

Query execution memory allocation performance was significantly improved. Some TPC-DS benchmark queries speed up by over 2X based on this change. This microbenchmark speeded up by a factor of 20 compared to MemSQL 6.0:

select count(substr(str, 0, 9)) from t;

The improvement is in the internal memory allocator used by queries that contain non-trivial string expressions or create temporary rows. So queries such as this can become far faster.

Query Optimizer and Statistics

QE improvements are fantastic, but the QO system is always critical to make sure the right QE operators are used in the best possible way. Our query optimizer is the guidance system of the MemSQL DBMS.

Automatic Selection of Hash Group to Get Operations on Encoded Data

As we publicized in our Trillion Rows Per Second demoin MemSQL 6.0 we improved our query execution dramatically for single-table GROUP BY/aggregate queries, by using a new implementation of HashGroupBy that uses SIMD and operations on encoded data. In 6.5, we’ve improved the costing code in the query optimizer by calibrating it so it understands more accurately the relative speed of HashGroupBy and StreamingGroupBy operations. So queries like this now routinely get a HashGroupBy instead of StreamingGroupBy:

select stock_symbol, count(*) as c 
from trade
group by stock_symbol 
order by c desc 
limit 10;

The table trade is a columnstore, sharded on id or stock_symbol and with key on stock_symbol.

In effect, our query optimizer has caught up with the new capabilities available in our query execution system. The result is that queries may now run up to 80 times faster, automatically, with no need for hints or other workarounds.

Advanced Histograms for Improved Filter Cardinality Estimation

MemSQL 6.5 introduces a new type of histogram to represent what we sometimes refer to as “range” statistics. These histograms are termed “advanced histograms” and the existing histograms will be called “legacy histograms.” The advanced histograms hold substantially more information than the legacy histograms for certain data distributions. So, non-uniform data distributions with skew typically have much better estimates for less frequent values, and also better average estimates, than in MemSQL 6.0. For example, for this data distribution:

| v    | count(*) |
|    1 |   320000 |
|    2 |      320 |
|    3 |      352 |
|    4 |      256 |
|    5 |        1 |
|    6 |      512 |
|    7 |      512 |
|    8 |      544 |
|    9 |      512 |
|   10 |      608 |
|   11 |        3 |
|   12 |      608 |
|   13 |      576 |
|   14 |      288 |
|   15 |      288 |
|   16 |      288 |
|   17 |      160 |
|   18 |      352 |
|   19 |        2 |
|   20 |      480 |

And this query:

select count(*) from t where v = <constant>;

We get these estimates:

Value of <constant> actual 6.0 estimated 6.5 estimated
1 320,000 320,047 320,174
3 352 450 366
5 1 450 1

In 6.0, the high-frequency values (such as 1) crowded out information about the lower-frequency values (such as 5) in the histogram, causing loss of resolution. In 6.5, this crowding-out effect happens much less quickly. Normally up to 50 data values can be represented nearly exactly, regardless of data frequency skew. In a larger query, errors such as this can compound and cause a sub-optimal strategy to be chosen for a join (e.g. hash instead of nested loop), or a distributed operation (e.g. shuffle instead of broadcast). Since these errors are reduced in 6.5, overall query performance will improve.

Before-and-After Results, 6.0 to 6.5, TPC-H

Query performance in one of our internal test workloads derived from the TPC-H benchmark improved dramatically, about 2.2X on average, from 6.0 to 6.5. The primary reason for this is improved shuffle performance and, to a lesser extent, the high-cardinality GROUP BY performance improvement. Also, query q19 improved due to a rewrite to extract common conditions for OR ((a or b) or (a or c)) → (a and (b or c)).

MemSQL TPC-H Scale Factor 1000, May 2018, Hardware: 3 Amazon Web Services (AWS) r3.8xlarge leaves, 32 vcores each
Query 6.0 Time (sec) 6.5 Time (sec) Speedup (%) Speedup (times) Notes                                     
q1 5.824 5.38 7.62% 1.1
q2 3.537 2.977 15.83% 1.2
q3 45.661 18.052 60.47% 2.5
q4 6.437 6.47 -0.51% 1.0
q5 22.127 12.669 42.74% 1.7
q6 3.334 3.392 -1.74% 1.0
q7 17.478 9.641 44.84% 1.8
q8 4.97 4.023 19.05% 1.2
q9 78.55 33.866 56.89% 2.3
q10 15.395 11.287 26.68% 1.4
q11 3.654 3.495 4.35% 1.0
q12 3.821 4.252 -11.28% 0.9
q13 62.558 39.001 37.66% 1.6
q14 75.244 21.528 71.39% 3.5
q15 24.668 9.931 59.74% 2.5
q16 40.461 12.24 69.75% 3.3
q17 7.21 3.21 55.48% 2.2
q18 300 72.387 75.87% 4.1 query timed out with error on 6.0
q19 6.691 4.998 25.30% 1.3
q20 70.276 26 63.00% 2.7
q21 149.832 143.061 4.52% 1.0
q22 46.297 9.69 79.07% 4.8
Averages 33.0 20.8 37% 2.0

†q18 did not run in under 300 sec on 6.0 so its time is given as the timeout of 300 sec.

Load Performance

Before you can experience the benefits of improved query speed, you have to get data into your database. Load performance is historically a strength of MemSQL. We’ve made it even better 6.5.

Speedup of CSV File Parsing

LOAD DATA commands have been made substantially faster. For wide tables with long comma-separated value (CSV) input lines, using strings enclosed with quotes or other characters, speedup of more than 2X is possible.

We micro-optimized parsing. The most important change only applies to LOAD DATA with a FIELDS [OPTIONALLY] ENCLOSED BY clause, and is always “on.” It’s expected to be most beneficial for tables with many columns and relatively long lines. It only affects work done on the aggregator during the LOAD, and it doesn’t apply to pipelines.

Loading 6GB of data – with average CSV line length of 1KB – into a 170 column table on a cluster on AWS with an m4.10xlarge aggregator + 4 m4.4xlarge leaves and 32 partitions total, we saw:

Query OK, 5029152 rows affected (31.14 sec)

Before the change, the same load took about twice as long:

Query OK, 5029152 rows affected (1 min 9.76 sec)

Pipelines from S3 Speedup

The time for the CREATE PIPELINE statement to return, when used on Amazon S3 cloud storage, has been reduced dramatically for S3 prefixes with very large numbers of files under them. For example, for a million files under a prefix, it previously took about 5 minutes, and now it typically takes under 2 seconds.

In addition, under an S3 path with a large number of files, when a new file is added, it will be loaded and visible in the MemSQL database much more quickly than before, typically in seconds rather than minutes.

Previously the time taken to create an S3 pipeline grows with the number of objects in the source S3 bucket (more precisely, the number of objects matching the specified prefix) and is unbounded. For example, create pipeline took about 8 minutes on a bucket with 1.6 million objects. The root cause of the issue is the long time it can take to list out all the files under and S3 bucket. In 6.5 we fixed this issue and the number of network round trips taken equals (number of MemSQL partitions) / 1,000, rounded up. Usually this translates to no more than 2 seconds.

The second use case that is improved in 6.5 follows the “continuous loading” pattern where news files are added to the bucket and we expect to be able to query the data in MemSQL as soon as possible after the files are uploaded. Previously, this latency grew linearly with the number of existing files in the bucket. For example, suppose there are 200,000 files and a new file is uploaded. The rows from these files will not start to be loaded for about 20 seconds. If there are 2 million files, the latency would be over 3 minutes. Since S3 is often used for historical data, it is not uncommon that buckets grow to the extent that real-time analytics becomes infeasible. (Note that the overall bandwidth is not affected and therefore one-time load from S3 using MemSQL pipelines prior to 6.5 is not problematic.) In 6.5, this latency is not affected by the number of files in the bucket, given that newly added keys are greater in sort order.

Best practices: To take best advantage of this improvement, it is recommended that customers add keys to their S3 bucket in ascending sort order. A random order of names in a large set of files can still cause a long delay before seeing the contents of a new file in the database. For example, if you name files with a string of the form “prefix_name_<timestamp>” where <timestamp> is of the form YYYY-MM-DD-HH:MM:SS.FFF then the file names will naturally be in sort order, so you’ll benefit from this new improvement.

Columnstore Compression Performance

Columnstore compression performance has been improved, particularly for dictionary-compressed column segments with under 2^16 values.

Overall, LOAD and heavy INSERT workloads for low- to medium-cardinality strings are about 2X faster.

This enhancement improves the speed of the background merger and flusher too.

Sort Speed Performance

The speed of the internal sort routine was improved. This is related to the improvements in the internal memory allocator that were previously mentioned. This improves performance of the background merger used by the columnstore, and any other internal system processes that sort data.


MemSQL has fundamental architectural advantages for performance compared with legacy SQL database systems, including scale out, compilation, in-memory structures, vectorization, and use of SIMD. In the 6.5 release, we’ve built on these architectural pillars to deliver tremendous performance gains. Some queries have speeded up over 60X, and some entire workloads have more than doubled in speed, with no application changes required.

MemSQL is a database on the move. We have a talented engineering team that is keeping up a tremendous pace of improvement from release to release. Given our fundamental advantages, run-anywhere flexibility, and great performance that is only getting better, MemSQL is a great train to get aboard with your demanding data management applications.

MemSQL Helios eclipse
MemSQL Helios
The World’s Fastest Cloud Database