What is best practice for CLUSTERED COLUMNSTORE keys


#1

I’m using columnar fact tables for our star schemas, with dimensions as in-memory tables.

Typically, if a fact table contains a time dimension, I use it as the key. But this is more a heuristic I developed and I’m not quite sure what is best practice here.

What’s the right way to decide what the key(s) for a columnar table should be, and in what order should they be declared?

Here’s an example table definition:

CREATE TABLE fact_survey_submissions (
  -- Time
  `timestamp` DATETIME NOT NULL,
  
  -- Foreign Keys
  account_id VARCHAR(32) NOT NULL,
  user_id VARCHAR(32) NOT NULL,
  response_id VARCHAR(32) NOT NULL,
  survey_id VARCHAR(32) NOT NULL,

  -- Measures
  survey_rating TINYINT,

  KEY(`timestamp`) USING CLUSTERED COLUMNSTORE
);

#2

https://docs.memsql.com/tutorials/v6.7/optimizing-table-data-structures/#choosing-a-columnstore-key has a lot of good advice on this question. (The first part is background on what the key is, the advice starts with “The single most important consideration for columnstore keys”…)


#3

This is unrelated to your sort key decision, but I would recommend using INT or BIGINT for your foreign key columns assuming you will be joining them to dimension tables. Joins on integer columns are faster than on strings.