What is best practice for CLUSTERED COLUMNSTORE keys

I’m using columnar fact tables for our star schemas, with dimensions as in-memory tables.

Typically, if a fact table contains a time dimension, I use it as the key. But this is more a heuristic I developed and I’m not quite sure what is best practice here.

What’s the right way to decide what the key(s) for a columnar table should be, and in what order should they be declared?

Here’s an example table definition:

CREATE TABLE fact_survey_submissions (
  -- Time
  `timestamp` DATETIME NOT NULL,
  
  -- Foreign Keys
  account_id VARCHAR(32) NOT NULL,
  user_id VARCHAR(32) NOT NULL,
  response_id VARCHAR(32) NOT NULL,
  survey_id VARCHAR(32) NOT NULL,

  -- Measures
  survey_rating TINYINT,

  KEY(`timestamp`) USING CLUSTERED COLUMNSTORE
);

SingleStoreDB Cloud · SingleStore Documentation has a lot of good advice on this question. (The first part is background on what the key is, the advice starts with “The single most important consideration for columnstore keys”…)

This is unrelated to your sort key decision, but I would recommend using INT or BIGINT for your foreign key columns assuming you will be joining them to dimension tables. Joins on integer columns are faster than on strings.

1 Like

Thanks Hanson. It’s on my todo list to turn these proper surrogate keys instead of using the primary keys from the source system. I know this is very bad practice, but there are currently bonafide strings.

In response to Hanson, I guess you might be storing UUIDs as your foreign keys. If that is the case, the binary(16) might be a better choice until you gets the possibility to migrate to integers. We’re currently testing the binary type as an replacement to char(32) and we’re experiencing a huge performance improvement with JOINs.

Bonus: we’ll save 2.75 times the space consumption with binary instead of varchar. However, columnstore compresses the data, so the actual savings will be less.

@hanson, do you have any insight on how binary compares to integers. Would integers also beat binary significantly?

There are performance optimizations for joins on “integer” data types which internally are any structured types that fit in 8 bytes or less, including int, bigint, date, datetime, etc., but not including strings. So joining on bigint will be faster than joining on binary(16) by a lot in some cases.

1 Like

I did a simple test comparing:

  • char(32) (96 bytes)
  • binary(16) (16 bytes)
  • bigint(20) (8 bytes)

I ran a basic inner joining query, counting the matching rows on my local machine. Table A had 426,252 rows and table B had 74,226 rows. All rows from Table B had a match in table A.

The binary data type was 2.3 times faster than char. The bigint data type was 1.5 times faster than binary and 3.5 times faster than char.

This is in line with @hanson statement, proving that integers is significantly faster in e.g. JOIN queries. However, the execution times for simple lookups with indexes remain the same.

2 Likes

@mpskovvang was that for row store or columnstore tables. Columnstore has more sophisticated join optimizations for hash joins on integer columns. So the perf difference for joins in integers may be bigger for columnstores that for row stores.

Hi everyone!

I noticed that there have been some increased interest regarding table data structures / keys especially for columnstore.

We do have a document regarding optimizing key guidelines and data structures, but it seems it does not satisfy everyone’s technical needs. We are curious how what you’d like to see improved in it, what’s missing, etc, so we could better serve our customers?

@scott

Thanks @Jacky. That document looks pretty good. I just missed it when I made my initial post.

@hanson the test was against two row store tables. We do have a some columnstore tables joining with row store tables as well, so it could be interesting to test the performance in such cases too.

Hi Scott,

I agree with Hanson, using Integer for your fact table surrogate keys is a better approach.
It is also good practice to have 2 dimensions for date / time. One for dates only and one for time.
Slicing data by date is quite different than slicing it by time. What is quite important is the grain of your fact table, is the minimal grain hour, min, or second ?
I’m also assuming that account_id, user_id, response_id, and survey_id points to dimensions. It’s good practice to user surrogate keys : account_dim_id, user_dim_id etc …
Depending on your dimension type (type 1, type2, type 3), you want you surrogate key to be totally independent from your OLTP system. The account_id (referred to natural key in data warehouse world) should be stored in your account_dim (account dimensions).

Glad to see users using MemSQL as a data warehouse (facts and Dimensions). Still waiting to see a MemSQL version for DBT (getdbt.com)

1 Like