Columnstore key guidelines


#1

I’m creating quick reference guidelines for key selection based on the MemSQL documentation for my team. Here is what I have for columnstore keys.

  • A columnstore table has one columnstore key, up to one shard key, and no other keys including primary keys.
  • The columnstore key should consist of the fields on which there are always, or most commonly, filters ordered from lowest to highest cardinality (from least to most unique).
  • Queries that join columnstore tables on columnstore key fields perform better with less memory overhead.
  • In general, the columnstore key should not be more precise than common query filters. For example, it may be better to include a date rather timestamp field followed by another commonly filtered field to leverage both.

I have a few questions regarding these guidelines.

  1. If a query does not filter or join on the first field in a columnstore key, are there any circumstances in which the remaining key fields have any benefit?
  2. When should shard key fields be or not be included in the columnstore key? If a table is always joined on a shard key, should it always be included in the columnstore key? If not, why does a common filter have different benefits from a common join?
  3. If common filter fields are not yet known, are the shard and/or (undefined) primary key fields generally the best default option? (I think so because we can at least expect them to be used in joins.)
  4. The documentation states that columnstore tables joined on columnstore key fields perform better with less memory overhead. Is this only true for joins on the full columnstore key? If it can be a subset, does it have to be the first field(s) similar to question 1?
  5. The documentation focuses on queries dictating key selection but I think that it may be a good guideline to suggest that key selection can dictate queries. One key cannot accommodate all queries, so we must guide users to query in a way that leverages the keys effectively. In some cases, a table’s data distribution may dictate a specific key and thereby specific kinds of queries to perform effectively. Does that sound reasonable?

#2

The answers to your questions depend on your data and workload so there are no exact answers. The answers below try to address common cases.

  1. In general, no. However, if you do key(A,B) and a has low cardinality so that many segments have the same A value, then a filter only on B might benefit from segment elimination.
  2. Usually you want the shard key to be on a different column that the columnstore key. Reserve the columnstore key for the datetime sort order in most cases. You don’t want to shard on that normally because then your query work will be skewed when queries filter on datetime.
  3. Not clear what you mean. Sharding on one of the join columns is often a good idea.
  4. It should help if you are joining on a prefix of the key.
  5. Certainly, your physical DB design choices may make some queries perform worse so users won’t want to run them much. As long as our design makes the most common queries run fast enough, you should be okay.

#3

Thanks for your response!

  1. I understand why you don’t necessarily want to shard on everything in the columnstore key, but why wouldn’t you include the shard key in the columnstore key if you always join and/or filter on it? Why would it be different from any other common join or filter?
  2. I mean to ask whether the shard and/or (undefined) primary key are a good default option for the columnstore key when common joins and filters are not otherwise known. I say “undefined” because we may know what the primary key would be even if we cannot define it on a columnstore table.

#4

re: 2., I can see a situation where you might have a columnstore fact table with a date_key column and that column would be used both as the columnstore key (for range elimination) and to join on. But sharding on that column would probably be a bad idea because data volume is usually skewed by date, so you could get partition size skew. Plus, you would not be able to get full parallelism for a date range filter or a join with the date dimension table.

re: 3., sharding on a unique column that is not used for range filters is often a good idea because it tends to give very uniform partition sizes, i.e. it avoids skew.


#5

Here’s some good documentation about how to design and optimize table structures:


#6

I’m asking about columnstore key guidelines and it seems that you’re speaking to shard key guidelines. For example, question 3 concerns a good default option for columnstore keys and you’re describing a good idea for shard keys. Could you please reconsider questions 2 and 3 in the context of columnstore keys? To restate them:

  1. If we know that we will always join or filter on shard key fields for a table, should we generally include those fields in its columnstore key?
  2. Are primary and/or shard key fields a good default option for columnstore key fields when other common joins and filters are not known?

#7

For 2, yes, since it can enable better segment elimination. However, keep an eye open for use of merge join on that column in plans for such queries; it may be the case that hash join is faster. The presence of the key could lead to an optimizer error in some cases (choosing a less efficient join strategy due to a costing error). Not likely but it could happen. Leaving the key off could enable hash join which can be faster.

For 3, if you are doing range filter queries on those fields, then yes, otherwise no. You can just omit specifying a key in the later case, (sometimes called “keyless sharded”).


#8

Is there any benefit to specifying additional columnstore key fields after the point of uniqueness? I don’t think so because the sort has been fully determined.


#9

Correct – no benefit after uniqueness


#10

I have a couple more questions on this topic:

  1. Do columnstore keys benefit aggregation? For example, filters and joins aside, if I am aggregating by date, will including the date in the columnstore key make aggregation any more efficient?
  2. Is it a good practice to always define an unenforced unique constraint on columnstore tables where applicable? If nothing else, it’s helpful documentation. It also informs statistics (with or without the rely option) and query plans (with the rely option).