Decimal data type in columnstore table


#1

Hi -

I was reviewing our use of the Decimal data type in our columnstore table and noticed this bit in the MemSQL docs (https://docs.memsql.com/sql-reference/v6.7/datatypes/):

When creating decimal type columns that are aggregated in queries, for best performance, it is recommended to use precision of 18 digits or less if your application can accommodate this (i.e. the extra precision is not needed to represent meaningful information). Decimal values of 18 digits or less can be processed more efficiently than others because they can be handled internally as 64-bit integers during aggregation. Processing these 64-bit values takes much less time than interpreting decimal values of more than 18 digits, which must be handled in a more general way.

Currently we use DECIMAL(20, 5). When I switch to DECIMAL(18, 5) I notice two things:

  1. When we switch to DECIMAL(18, 5), the disk usage for this column jumps by about 80% (80GB => 147GB)
  2. Performance of aggregations on the DECIMAL(18, 5) seems to only be advantageous once I’m selecting enough rows (e.g. the entire table there’s a clear improvement in performance). With a small set of data being selected, performance seems to be better when using DECIMAL(20, 5).

My suspicion for the second point is that performance is better when using DECIMAL(20, 5) because the data is compressed better and allows MemSQL to more quickly scan the data.

I was a little surprised to see the disk usage go up for the smaller decimal column type, but perhaps that has to do with the way in which MemSQL changes the underlying data representation.

Does this align with expectations?

Thanks!


#2

That is interesting. I’d speculate that the type of compression encodings used changed due to some artifact of how compression is implemented that changes when you go above 18 digits (64 bits) in a decimal value. You can run this query to see your encodings on the before and after version of the table to check this.

SELECT database_name, 
       table_name, 
       column_name, 
       encoding, 
       cardinality, 
       null_count, 
       Count(*), 
       Sum(compressed_size), 
       Sum(uncompressed_size) 
FROM   information_schema.columnar_segments cs 
       LEFT JOIN (SELECT table_name, 
                         column_name, 
                         database_name, 
                         cardinality, 
                         null_count 
                  FROM   information_schema.optimizer_statistics 
                  GROUP  BY 1, 2, 3, 4, 5) os 
                    using (database_name, table_name, column_name) 
GROUP  BY 1, 2, 3, 4, 5, 6
ORDER  BY 1, 2, 3, 4, 5, 6;

#3

I expect that the issue is that MemSQL 6.7 doesn’t support integer dictionary encoding for columnstore tables. When decimal goes to 18 digits or less, it is stored internally as an integer. Depending on your data distribution, when you lose ability to get dictionary compression, the data can take more space. We’re looking at adding support for this in the future but I can’t give you a timeline.


#4

Looks like this is exactly it. In the DECIMAL(20, 5) table, the encoding is StringDictionary. In the DECIMAL(18, 5) table, the encoding is IntegerRunLength.

Sounds like we should stick with DECIMAL(20, 5) until support for integer dictionary encoding is supported.

Appreciate your insights!