Function in filters evaluated multiple times

Hello,

I have been benchmarking a query with following kind of filters against columnstore table:

  AND convert_tz(time, 'utc', dr.tz) >= date('2019-01-01') AND time >= dr.begin_time
  AND convert_tz(time, 'utc', dr.tz) < date('2020-01-01') AND time < dr.end_time
  AND (time(convert_tz(time, 'utc', dr.tz)) >= daily_start_time OR daily_start_time IS NULL)
  AND (time(convert_tz(time, 'utc', dr.tz)) < daily_end_time OR daily_end_time IS NULL)

It seems that every convert_tz added adds more time to the query, so it seems that even if the function has exactly same parameters 4 times, the function is evaluated four times also? If I leave out convert_tz one-by-one I get 1 seconds faster performance per every removed conversion.

Is there any workaround for this?

Is that a join predicate or a single-table filter? What’s the whole query and what’s the profile plan?

Hello,

and thank you for looking into this. Here are table definitions and queries + profiles.

-------- table definitions ------------

CREATE REFERENCE TABLE `all_devices_ref` (
  `id` int(11) NOT NULL,
  `device` int(11) DEFAULT NULL,
  `group` int(11) DEFAULT NULL,
  `category` int(11) DEFAULT NULL,
  `begin_time` datetime(6) DEFAULT NULL,
  `end_time` datetime(6) DEFAULT NULL,
  `daily_start_time` time DEFAULT NULL,
  `daily_end_time` time DEFAULT NULL,
  `tz` text CHARACTER SET utf8 COLLATE utf8_general_ci,
  PRIMARY KEY (`id`),
  KEY `all_devices_ref_group_index` (`group`),
  KEY `all_devices_ref_category_index` (`category`),
  KEY `all_devices_ref_device_index` (`device`)
);

CREATE TABLE result (
     device INT,
     time DATETIME(6),
     idx SMALLINT,
     count SMALLINT,
     KEY (device, time) USING CLUSTERED COLUMNSTORE,
     SHARD KEY (device)
);

------ query with 4 time zone conversions -------

SELECT idx,
       SUM(COUNT)
   FROM result sr
   JOIN all_devices_ref dr ON sr.device = dr.device
   AND dr.group = ?
   AND dr.category in (?)
  WHERE sr.time >= '2018-12-28'
  AND sr.time < '2020-01-02'
  AND convert_tz(time, 'utc', dr.tz) >= date('2019-01-01') AND time >= dr.begin_time
  AND convert_tz(time, 'utc', dr.tz) < date('2020-01-01') AND time < dr.end_time
  AND (time(convert_tz(time, 'utc', dr.tz)) >= daily_start_time OR daily_start_time IS NULL)
  AND (time(convert_tz(time, 'utc', dr.tz)) < daily_end_time OR daily_end_time IS NULL)
GROUP BY 1

----- profile with 4 time zone conversions ------

Project [remote_0.idx, `SUM(COUNT)`] est_rows:4 actual_rows: 4 exec_time: 0ms start_time: 00:00:05.075 network_traffic: 0.041000 KB network_time: 0ms
HashGroupBy [SUM(remote_0.`SUM(COUNT)`) AS `SUM(COUNT)`] groups:[remote_0.idx] actual_rows: 4 exec_time: 0ms start_time: 00:00:02.909 memory_usage: 131.072006 KB
Gather partitions:all est_rows:4 alias:remote_0 actual_rows: 48 exec_time: 0ms start_time: 00:00:05.075 end_time: 00:00:05.075
Project [sr.idx, `SUM(COUNT)`] est_rows:4 est_select_cost:3,586 actual_rows: 48 exec_time: 0ms start_time: [00:00:02.908, 00:00:05.074] network_traffic: 0.372000 KB network_time: 0ms
HashGroupBy [SUM(sr.count) AS `SUM(COUNT)`] groups:[sr.idx] actual_rows: 48 exec_time: 19ms start_time: [00:00:00.008, 00:00:00.020] memory_usage: 1,572.864014 KB
Filter [CONVERT_TZ(sr.time,?,dr.tz) >= DATE(?) AND sr.time >= dr.begin_time AND CONVERT_TZ(sr.time,?,dr.tz) < DATE(?) AND sr.time < dr.end_time AND (CAST(CONVERT_TZ(sr.time,?,dr.tz) AS TIME(6)) >= dr.daily_start_time OR dr.daily_start_time IS NULL) AND (CAST(CONVERT_TZ(sr.time,?,dr.tz) AS TIME(6)) < dr.daily_end_time OR dr.daily_end_time IS NULL)] actual_rows: 39,909,406 exec_time: 0ms
HashJoin [sr.device = dr.device] actual_rows: 40,804,557 exec_time: 2,330ms start_time: [00:00:00.003, 00:00:00.016]
|---HashTableBuild shared_build:yes alias:dr actual_rows: 1,185 exec_time: 0ms start_time: 00:00:00.001 memory_usage: 786.432007 KB
|   Project [dr_0.tz, dr_0.begin_time, dr_0.end_time, dr_0.daily_start_time, dr_0.daily_end_time, dr_0.device] est_rows:1,793 actual_rows: 1,185 exec_time: 1ms start_time: [00:00:00.000, 00:00:00.001]
|   ChoosePlan
|   |   :estimate
|   |       SELECT COUNT(*) AS cost FROM bench.all_devices_ref WHERE dr_0.group = ?
|   |       SELECT COUNT(*) AS cost FROM bench.all_devices_ref WHERE dr_0.category IN (...)
|   |---Filter [dr_0.end_time > dr_0.begin_time AND dr_0.begin_time < (?!:>datetime(6) NULL) AND dr_0.end_time > (?!:>datetime(6) NULL) AND dr_0.category IN (...)] actual_rows: 790 exec_time: 0ms
|   |   IndexRangeScan bench.all_devices_ref AS dr_0, KEY all_devices_ref_group_index (group) scan:[group = ?] table_type:reference_rowstore est_table_rows:573,784 est_filtered:1,794 actual_rows: 894 exec_time: 0ms start_time: [00:00:00.000, 00:00:00.001]
|   +---Filter [dr_0.end_time > dr_0.begin_time AND dr_0.begin_time < (?!:>datetime(6) NULL) AND dr_0.end_time > (?!:>datetime(6) NULL) AND dr_0.group = ?] actual_rows: 395 exec_time: 0ms
|       IndexRangeScan bench.all_devices_ref AS dr_0, KEY all_devices_ref_category_index (category) scan:[category IN (...)] table_type:reference_rowstore est_table_rows:573,784 est_filtered:1,794 actual_rows: 527 exec_time: 0ms start_time: 00:00:00.001
ColumnStoreFilter [sr.time >= (?!:>datetime(6) NULL) AND sr.time < (?!:>datetime(6) NULL) AND sr.device = dr.device bloom] actual_rows: 40,833,720 exec_time: 64ms start_time: [00:00:00.001, 00:00:00.002] total_rows_in: 1,217,921,159 average_filters_per_row: 1.109892 average_index_filters_per_row: 0.000000 average_bloom_filters_per_row: 1.000000
ColumnStoreScan bench.result AS sr, KEY device (device, time) USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore est_table_rows:1,968,061,528 est_filtered:457,859,507 actual_rows: 1,217,921,159 exec_time: 2,648ms start_time: 00:00:00.001 memory_usage: 6,291.456055 KB segments_scanned: 1,195 segments_skipped: 734 segments_fully_contained: 0
Compile Total Time: 0ms

------- query without conversion —

SELECT idx,
       SUM(COUNT)
   FROM result sr
   JOIN all_devices_ref dr ON sr.device = dr.device
   AND dr.group = ?
   AND dr.category in (?)
  WHERE sr.time >= '2018-12-28'
  AND sr.time < '2020-01-02'
  AND time >= date('2019-01-01') AND time >= dr.begin_time
  AND time < date('2020-01-01') AND time < dr.end_time
  AND (time(time) >= daily_start_time OR daily_start_time IS NULL)
  AND (time(time) < daily_end_time OR daily_end_time IS NULL)
GROUP BY 1

------ profile without conversion -----

Project [remote_0.idx, `SUM(COUNT)`] est_rows:4 actual_rows: 4 exec_time: 0ms start_time: 00:00:00.824 network_traffic: 0.041000 KB network_time: 0ms
HashGroupBy [SUM(remote_0.`SUM(COUNT)`) AS `SUM(COUNT)`] groups:[remote_0.idx] actual_rows: 4 exec_time: 0ms start_time: 00:00:00.449 memory_usage: 131.072006 KB
Gather partitions:all est_rows:4 alias:remote_0 actual_rows: 48 exec_time: 0ms start_time: 00:00:00.824 end_time: 00:00:00.824
Project [sr.idx, `SUM(COUNT)`] est_rows:4 est_select_cost:896 actual_rows: 48 exec_time: 0ms start_time: [00:00:00.449, 00:00:00.823] network_traffic: 0.372000 KB network_time: 0ms
HashGroupBy [SUM(sr.count) AS `SUM(COUNT)`] groups:[sr.idx] actual_rows: 48 exec_time: 16ms start_time: [00:00:00.006, 00:00:00.019] memory_usage: 1,572.864014 KB
Filter [sr.time >= dr.begin_time AND sr.time < dr.end_time AND (CAST(sr.time AS TIME(6)) >= dr.daily_start_time OR dr.daily_start_time IS NULL) AND (CAST(sr.time AS TIME(6)) < dr.daily_end_time OR dr.daily_end_time IS NULL)] actual_rows: 39,456,921 exec_time: 0ms
HashJoin [sr.device = dr.device] actual_rows: 40,429,012 exec_time: 233ms start_time: [00:00:00.004, 00:00:00.016]
|---HashTableBuild shared_build:yes alias:dr actual_rows: 1,185 exec_time: 0ms start_time: 00:00:00.001 memory_usage: 393.216003 KB
|   Project [dr_0.begin_time, dr_0.end_time, dr_0.daily_start_time, dr_0.daily_end_time, dr_0.device] est_rows:448 actual_rows: 1,185 exec_time: 0ms start_time: [00:00:00.000, 00:00:00.001]
|   ChoosePlan
|   |   :estimate
|   |       SELECT COUNT(*) AS cost FROM bench.all_devices_ref WHERE dr_0.group = ?
|   |       SELECT COUNT(*) AS cost FROM bench.all_devices_ref WHERE dr_0.category IN (...)
|   |---Filter [dr_0.end_time > dr_0.begin_time AND dr_0.begin_time < (?!:>datetime(6) NULL) AND dr_0.begin_time < DATE(?) AND dr_0.end_time > (?!:>datetime(6) NULL) AND dr_0.end_time > DATE(?) AND dr_0.category IN (...)] actual_rows: 790 exec_time: 0ms
|   |   IndexRangeScan bench.all_devices_ref AS dr_0, KEY all_devices_ref_group_index (group) scan:[group = ?] table_type:reference_rowstore est_table_rows:573,784 est_filtered:449 actual_rows: 894 exec_time: 0ms start_time: [00:00:00.000, 00:00:00.001]
|   +---Filter [dr_0.end_time > dr_0.begin_time AND dr_0.begin_time < (?!:>datetime(6) NULL) AND dr_0.begin_time < DATE(?) AND dr_0.end_time > (?!:>datetime(6) NULL) AND dr_0.end_time > DATE(?) AND dr_0.group = ?] actual_rows: 395 exec_time: 0ms
|       IndexRangeScan bench.all_devices_ref AS dr_0, KEY all_devices_ref_category_index (category) scan:[category IN (...)] table_type:reference_rowstore est_table_rows:573,784 est_filtered:449 actual_rows: 527 exec_time: 0ms start_time: 00:00:00.001
ColumnStoreFilter [sr.time >= (?!:>datetime(6) NULL) AND sr.time < (?!:>datetime(6) NULL) AND sr.time >= DATE(?) AND sr.time < DATE(?) AND sr.device = dr.device bloom] actual_rows: 40,450,167 exec_time: 99ms start_time: [00:00:00.001, 00:00:00.002] total_rows_in: 1,217,921,159 average_filters_per_row: 1.110044 average_index_filters_per_row: 0.000000 average_bloom_filters_per_row: 0.999311
ColumnStoreScan bench.result AS sr, KEY device (device, time) USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore est_table_rows:1,968,061,528 est_filtered:451,442,123 actual_rows: 1,217,921,159 exec_time: 473ms start_time: 00:00:00.001 memory_usage: 6,291.456055 KB segments_scanned: 1,195 segments_skipped: 734 segments_fully_contained: 0
Compile Total Time: 0ms

---------- query with 1 time zone conversion -------

PROFILE
SELECT idx,
       SUM(COUNT)

   FROM result sr
   JOIN all_devices_ref dr ON sr.device = dr.device
   AND dr.group = ?
   AND dr.category in (?)
  WHERE sr.time >= '2018-12-28'
  AND sr.time < '2020-01-02'
  AND time >= date('2019-01-01') AND time >= dr.begin_time
  AND time < date('2020-01-01') AND time < dr.end_time
  AND (time(time) >= daily_start_time OR daily_start_time IS NULL)
  AND (time(convert_tz(time, 'utc', dr.tz)) < daily_end_time OR daily_end_time IS NULL)
GROUP BY 1

-------- profile with 1 time zone conversion -------

Project [remote_0.idx, `SUM(COUNT)`] est_rows:4 actual_rows: 4 exec_time: 0ms start_time: 00:00:01.893 network_traffic: 0.041000 KB network_time: 0ms
HashGroupBy [SUM(remote_0.`SUM(COUNT)`) AS `SUM(COUNT)`] groups:[remote_0.idx] actual_rows: 4 exec_time: 0ms start_time: 00:00:01.084 memory_usage: 131.072006 KB
Gather partitions:all est_rows:4 alias:remote_0 actual_rows: 48 exec_time: 0ms start_time: 00:00:01.893 end_time: 00:00:01.893
Project [sr.idx, `SUM(COUNT)`] est_rows:4 est_select_cost:896 actual_rows: 48 exec_time: 0ms start_time: [00:00:01.083, 00:00:01.893] network_traffic: 0.372000 KB network_time: 0ms
HashGroupBy [SUM(sr.count) AS `SUM(COUNT)`] groups:[sr.idx] actual_rows: 48 exec_time: 25ms start_time: [00:00:00.004, 00:00:00.020] memory_usage: 1,572.864014 KB
Filter [sr.time >= dr.begin_time AND sr.time < dr.end_time AND (CAST(sr.time AS TIME(6)) >= dr.daily_start_time OR dr.daily_start_time IS NULL) AND (CAST(CONVERT_TZ(sr.time,?,dr.tz) AS TIME(6)) < dr.daily_end_time OR dr.daily_end_time IS NULL)] actual_rows: 39,413,991 exec_time: 0ms
HashJoin [sr.device = dr.device] actual_rows: 40,429,012 exec_time: 710ms start_time: [00:00:00.004, 00:00:00.020]
|---HashTableBuild shared_build:yes alias:dr actual_rows: 1,185 exec_time: 0ms start_time: 00:00:00.001 memory_usage: 786.432007 KB
|   Project [dr_0.begin_time, dr_0.end_time, dr_0.daily_start_time, dr_0.tz, dr_0.daily_end_time, dr_0.device] est_rows:448 actual_rows: 1,185 exec_time: 1ms start_time: [00:00:00.000, 00:00:00.001]
|   ChoosePlan
|   |   :estimate
|   |       SELECT COUNT(*) AS cost FROM bench.all_devices_ref WHERE dr_0.group = ?
|   |       SELECT COUNT(*) AS cost FROM bench.all_devices_ref WHERE dr_0.category IN (...)
|   |---Filter [dr_0.end_time > dr_0.begin_time AND dr_0.begin_time < (?!:>datetime(6) NULL) AND dr_0.begin_time < DATE(?) AND dr_0.end_time > (?!:>datetime(6) NULL) AND dr_0.end_time > DATE(?) AND dr_0.category IN (...)] actual_rows: 790 exec_time: 0ms
|   |   IndexRangeScan bench.all_devices_ref AS dr_0, KEY all_devices_ref_group_index (group) scan:[group = ?] table_type:reference_rowstore est_table_rows:573,784 est_filtered:449 actual_rows: 894 exec_time: 1ms start_time: [00:00:00.000, 00:00:00.001]
|   +---Filter [dr_0.end_time > dr_0.begin_time AND dr_0.begin_time < (?!:>datetime(6) NULL) AND dr_0.begin_time < DATE(?) AND dr_0.end_time > (?!:>datetime(6) NULL) AND dr_0.end_time > DATE(?) AND dr_0.group = ?] actual_rows: 395 exec_time: 0ms
|       IndexRangeScan bench.all_devices_ref AS dr_0, KEY all_devices_ref_category_index (category) scan:[category IN (...)] table_type:reference_rowstore est_table_rows:573,784 est_filtered:449 actual_rows: 527 exec_time: 0ms start_time: 00:00:00.001
ColumnStoreFilter [sr.time >= (?!:>datetime(6) NULL) AND sr.time < (?!:>datetime(6) NULL) AND sr.time >= DATE(?) AND sr.time < DATE(?) AND sr.device = dr.device bloom] actual_rows: 40,450,167 exec_time: 82ms start_time: [00:00:00.001, 00:00:00.002] total_rows_in: 1,217,921,159 average_filters_per_row: 1.110056 average_index_filters_per_row: 0.000000 average_bloom_filters_per_row: 0.999324
ColumnStoreScan bench.result AS sr, KEY device (device, time) USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore est_table_rows:1,968,061,528 est_filtered:451,442,123 actual_rows: 1,217,921,159 exec_time: 1,068ms start_time: 00:00:00.001 memory_usage: 6,291.456055 KB segments_scanned: 1,195 segments_skipped: 734 segments_fully_contained: 0
Compile Total Time: 0ms

So as you can see, nothing else changes than the amount of convert_tz function calls with same parameters. So it seems that function results are not cached? but instead calculated every time function is called.

Unfortunately in our use case, time zone conversions are needed, as data is gathered on multiple time zones and needs to be projected to local time before aggregating. Also devices time zone can change after data has been gathered (might sound weird but there is good reason behind this.) and query results need to update too. That’s why results can’t be recorded in local time (which would solve this problem.)
Otherwise we would need to do a lot of recalculation of the results.

Edit: I’m running benchmark on AWS EKS cluster with 1 aggregator and 3 leaf nodes on 4 m5.2xlarge instances. memsql version is centos-7.0.12-55436a2b96

Any comments on this one?