Benchmark: Filter on integers is significant faster than small varchars on rowstore but not on columnstore

mpskovvang · September 25, 2019, 7:21pm

A month ago we found that joining on integer / binary columns is significantly faster than char(32):

The binary data type was 2.3 times faster than char . The bigint data type was 1.5 times faster than binary and 3.5 times faster than char.
What is best practice for CLUSTERED COLUMNSTORE keys - #6 by hanson - Query Tuning - SingleStore Forums

Today, we experienced this is also the case when filtering data even with full index coverage, but only on rowstore tables - with columnstore tables we experienced the opposite.

I’ll give our example to show the performance improvement:

We have a table with keywords. Each keyword has a location (geotarget). The original table used varchar to store an ISO country code of two letters, e.g. DK:

CREATE TABLE `keywords-varchar` (
  `id` binary(16) NOT NULL,
  `location` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
  PRIMARY KEY (`id`),
  KEY `location` (`location`)
)

We decided to test with location IDs consisting of four digit numbers, eg. 1000:

CREATE TABLE `keywords-int` (
  `id` binary(16) NOT NULL,
  `location` int(10) unsigned DEFAULT null,
  PRIMARY KEY (`id`),
  KEY `location` (`location`)
)

We used a stored procedure to measure execution time while ignoring client-communication (inspired by @hanson from his blog post).

Results based on ~20M rows (all with same location):

call measure_q("select count(*) from `keywords-int` where location = 1000", 100);

155,289

call measure_q("select count(*) from `keywords-varchar` where location = 'DK'", 100);

482,675

Integers is 3 times faster!

The above queries is just to illustrate the performance difference. With actual real-application queries we’re seeing 10x faster execution.

Funny thing. We tried using mediumint (8) as well to reduce memory usage, but it performs not far as good as int (10). Not sure if we’re missing something:

CREATE TABLE `keywords-mediumint` (
  `id` binary(16) NOT NULL,
  `location` mediumint(8) unsigned DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `location` (`location`)
)

call measure_q("select count(*) from `keywords-mediumint` where location = 2208", 100);

389,273

Disclaimer: In the example, we had a single cardinality. The results might be different with a higher cardinality. I’ll provide an update when we migrate our production environment.

However, we tried the same with our columnstore table (50M rows) but experienced a 10-20 ms slower execution time. Apparently, columnstore must have some varchar optimization in it’s compression, but based on the rowstore results I wouldn’t have expected columnstore to be slower by using integers.

mpskovvang · September 25, 2019, 10:33pm

But since 3 bytes integers are not native for CPUs, mediumint needs to be converted before it can be used (memcpy into a 4/8 bytes integer) causing a overhead, while 4 bytes integers are native for CPUs and can be used directly.
Source: Mysql medium int vs. int performance - Stack Overflow

Could this be the reason why int (10) unsigned is faster than mediumint (8) unsigned in MemSQL as well?

hanson · September 28, 2019, 1:26am

Thanks for sharing. Columnstore query execution for equality filters is quite different from rowstore because columnstore uses compression and can run filters on compressed data. So the comparison overhead per row might be quite different.

In 6.8 and earlier, you can’t do index seeks on a columnstore, which factors into the performance you’ll get.

In 7.0, you can use hash indexes on columnstores, which again will change things.

mpskovvang · September 28, 2019, 5:46am

Thanks for the explanation. We’re excited to try out the hash indexes on columnstore.

What about our finding with mediumint being quite slower than int on rowstore?

hanson · September 30, 2019, 10:46pm

Regarding mediumint, we don’t know a definite reason for that. Sticking with int or bigint probably makes sense for good performance unless you really are trying to save every byte, given what you saw, but I can’t be sure.

The equality lookup time for only one value might be affected by data distribution, like if the number of unique values retrieved was different. Make sure they were the same.

If it really matters to you, you could try a more comprehensive test like using an assortment of random values rather than just one value in the lookups, making sure the number of values matched by the WHERE clause was always the same. And then gather PROFILE JSON plans for the different queries too and compare them. Feel free to attach them and we’ll take a look if we have time.