Query with many LIKE clauses makes MemSQL unavailable (100% CPU)


#1

Our clients have the possibility to apply custom filters and one client recently added almost 40 filters making the query quite long and slow.

The interesseting part is that after profiling the basic query and the one with the custom filters, even through the plans a quite different, the filtering part is performant in both causes, but the last ColumnStoreScan part is many times slower for the “complex” query, but it actually scans just as many rows checking the same amount of segments.

What is causing the increased load time? Is the reel filtering first happening in the ColumnStoreScan part?

Basic query:

select `clicks`, `impressions`, `ctr`, `position`, `page_snapshots`.`latest_snapshot_at`, `pages`.* from `pages` inner join (select page, sum(clicks) as clicks, sum(impressions) as impressions, sum(ctr * impressions) / sum(impressions) as ctr, sum(position * impressions) / sum(impressions) as position from `gsc_search_analytics` where `gsc_search_analytics`.`site_id` = 40 and `search_type` = 'web' and `gsc_search_analytics`.`country` is null and `gsc_search_analytics`.`device` is null and `gsc_search_analytics`.`query` is null and `gsc_search_analytics`.`date` between '2018-12-31 00:00:00' and '2019-01-27 00:00:00' and `page` is not null group by `page`) as `gsc_search_analytics` on `pages`.`url` = `gsc_search_analytics`.`page` left join (select page_id, max(created_at) as latest_snapshot_at from `page_snapshots` where `view_id` = 50 group by `page_id`) as `page_snapshots` on `page_snapshots`.`page_id` = `pages`.`id` where `pages`.`site_id` = 40 order by `clicks` desc limit 50 offset 0

Complex query:

select `clicks`, `impressions`, `ctr`, `position`, `page_snapshots`.`latest_snapshot_at`, `pages`.* from `pages` inner join (select page, sum(clicks) as clicks, sum(impressions) as impressions, sum(ctr * impressions) / sum(impressions) as ctr, sum(position * impressions) / sum(impressions) as position from `gsc_search_analytics` where `gsc_search_analytics`.`site_id` = 40 and `country` = 'fin' and `query` not like '%karkkainen%' and `query` not like '%kärkkäinen.com%' and `query` not like '%kärkkäisen verkkokauppa%' and `query` not like '%karkainen%' and `query` not like '%verkkokauppa kärkkäinen%' and `query` not like '%karkkainen%' and `query` not like '%kärkkäinen. fi%' and `query` not like '%kärkköinen%' and `query` not like '%kärkkäinrn%' and `query` not like '%kärkkäinen.com %' and `query` not like '%kärkkäine%' and `query` not like '%j kärkkäinen%' and `query` not like '%kärkkäinem%' and `query` not like '%karkkainen.fi%' and `query` not like '%j.kärkkäinen%' and `query` not like '%kärkkäinn%' and `query` not like '%www.kärkkäinen%' and `query` not like '%kärkkäonen%' and `query` not like '%tavaratalo kärkkäinen%' and `query` not like '%krkkäinen%' and `query` not like '%karkainen%' and `query` not like '%kärkkäine%' and `query` not like '%kärkkäinem%' and `query` not like '%kärkkänen%' and `query` not like '%www.karkkainen.fi%' and `query` not like '%kärkkäisen nettikauppa%' and `query` not like '%nettikauppa kärkkäinen%' and `query` not like '%kärkköinen%' and `query` not like '%kärkkäinrn%' and `query` not like '%kärkläinen%' and `query` not like '%kärkkäinwn%' and `query` not like '%kärkkäimen%' and `query` not like '%kärkkäinen.%' and `query` not like '%käkkäinen%' and `query` not like '%käekkäinen%' and `query` not like '%käekkäinen%' and `query` not like '%kärkkäinem%' and `query` not like '%kärkkänen%' and `search_type` = 'web' and `gsc_search_analytics`.`device` is null and `gsc_search_analytics`.`date` between '2018-12-31 00:00:00' and '2019-01-27 00:00:00' and `page` is not null group by `page`) as `gsc_search_analytics` on `pages`.`url` = `gsc_search_analytics`.`page` left join (select page_id, max(created_at) as latest_snapshot_at from `page_snapshots` where `view_id` = 57 group by `page_id`) as `page_snapshots` on `page_snapshots`.`page_id` = `pages`.`id` where `pages`.`site_id` = 40 order by `clicks` desc limit 50 offset 0;

I tried to merge the many LIKE clauses into a single RLIKE clause, but that made the query even slower:

select `clicks`, `impressions`, `ctr`, `position`, `page_snapshots`.`latest_snapshot_at`, `pages`.* from `pages` inner join (select page, sum(clicks) as clicks, sum(impressions) as impressions, sum(ctr * impressions) / sum(impressions) as ctr, sum(position * impressions) / sum(impressions) as position from `gsc_search_analytics` where `gsc_search_analytics`.`site_id` = 40 and `country` = 'fin' and `query` not rlike 'karkkainen|kärkkäinen.com|kärkkäisen verkkokauppa|karkainen|verkkokauppa kärkkäinen|karkkainen|kärkkäinen. fi|kärkköinen|kärkkäinrn|kärkkäinen.com|kärkkäine|j kärkkäinen|kärkkäinem|karkkainen.fi|j.kärkkäinen|kärkkäinn|www.kärkkäinen|kärkkäonen|tavaratalo kärkkäinen|krkkäinen|karkainen|kärkkäine|kärkkäinem|kärkkänen|www.karkkainen.fi|kärkkäisen nettikauppa|nettikauppa kärkkäinen|kärkköinen|kärkkäinrn|kärkläinen|kärkkäinwn|kärkkäimen|kärkkäinen.|käkkäinen|käekkäinen|käekkäinen|kärkkäinem|kärkkänen' and `search_type` = 'web' and `gsc_search_analytics`.`device` is null and `gsc_search_analytics`.`date` between '2018-12-31 00:00:00' and '2019-01-27 00:00:00' and `page` is not null group by `page`) as `gsc_search_analytics` on `pages`.`url` = `gsc_search_analytics`.`page` left join (select page_id, max(created_at) as latest_snapshot_at from `page_snapshots` where `view_id` = 57 group by `page_id`) as `page_snapshots` on `page_snapshots`.`page_id` = `pages`.`id` where `pages`.`site_id` = 40 order by `clicks` desc limit 50 offset 0;

The JSON profiles can be found here: https://gist.github.com/mpskovvang/00909300c384ce533cb43f1e429f4841


#2

Hello,

Can you please share tables DDL used in these queries.

Thanks
Jaimin


#3

Can you share the result of the following queries, this should help us better understand the performance characteristics of the query:

SELECT Count(DISTINCT query) 
FROM   gsc_search_analytics 
WHERE  `search_type` = 'web' 
       AND `gsc_search_analytics`.`device` IS NULL 
       AND `gsc_search_analytics`.`date` BETWEEN 
           '2018-12-31 00:00:00' AND '2019-01-27 00:00:00' 
       AND `page` IS NOT NULL 
       AND site_id = 40; 

SELECT encoding, 
       Count(*) 
FROM   information_schema.columnar_segments 
WHERE  column_name = 'query' 
       AND table_name = 'gsc_search_analytics' 
GROUP  BY 1;

#4

Thank you @jshah and @rob for having a look.

Table DDL:

CREATE TABLE `gsc_search_analytics` ( `id` bigint(20) unsigned NOT NULL, `site_id` bigint(20) unsigned NOT NULL, `url_id` char(32) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `keyword_id` char(32) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `date` date NOT NULL, `search_type` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL, `country` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL, `device` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL, `page` varchar(2048) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL, `query` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL, `clicks` int(10) unsigned NOT NULL, `impressions` int(10) unsigned NOT NULL, `ctr` double NOT NULL, `position` double NOT NULL, KEY `site_id` (`site_id`,`date`,`search_type`,`country`,`device`,`page`,`query`) /*!90619 USING CLUSTERED COLUMNSTORE */, /*!90618 SHARD */ KEY `date` (`date`,`id`) ) /*!90621 AUTOSTATS_ENABLED=TRUE */

Result from first query:

Count(DISTINCT query)
200568

Result from second query:

encoding Count(*)
LZ4 4
StringDictionary 379
String 3

#5

How long does this query take?

SELECT `clicks`, 
       `impressions`, 
       `ctr`, 
       `position`, 
       `page_snapshots`.`latest_snapshot_at`, 
       `pages`.* 
FROM   `pages` 
       INNER JOIN (SELECT page, 
                          Sum(clicks)                                    AS 
                          clicks, 
                          Sum(impressions)                               AS 
                          impressions 
                                         , 
                          Sum(ctr * impressions) / 
                          Sum(impressions)      AS ctr, 
                          Sum(position * impressions) / Sum(impressions) AS 
                          position 
                   FROM   `gsc_search_analytics` with(disable_encoded_filters=1)
                   WHERE  `gsc_search_analytics`.`site_id` = 40 
                          AND `country` = 'fin' 
                          AND `query` NOT LIKE '%karkkainen%' 
                          AND `query` NOT LIKE '%kärkkäinen.com%' 
                          AND `query` NOT LIKE '%kärkkäisen verkkokauppa%' 
                          AND `query` NOT LIKE '%karkainen%' 
                          AND `query` NOT LIKE '%verkkokauppa kärkkäinen%' 
                          AND `query` NOT LIKE '%karkkainen%' 
                          AND `query` NOT LIKE '%kärkkäinen. fi%' 
                          AND `query` NOT LIKE '%kärkköinen%' 
                          AND `query` NOT LIKE '%kärkkäinrn%' 
                          AND `query` NOT LIKE '%kärkkäinen.com %' 
                          AND `query` NOT LIKE '%kärkkäine%' 
                          AND `query` NOT LIKE '%j kärkkäinen%' 
                          AND `query` NOT LIKE '%kärkkäinem%' 
                          AND `query` NOT LIKE '%karkkainen.fi%' 
                          AND `query` NOT LIKE '%j.kärkkäinen%' 
                          AND `query` NOT LIKE '%kärkkäinn%' 
                          AND `query` NOT LIKE '%www.kärkkäinen%' 
                          AND `query` NOT LIKE '%kärkkäonen%' 
                          AND `query` NOT LIKE '%tavaratalo kärkkäinen%' 
                          AND `query` NOT LIKE '%krkkäinen%' 
                          AND `query` NOT LIKE '%karkainen%' 
                          AND `query` NOT LIKE '%kärkkäine%' 
                          AND `query` NOT LIKE '%kärkkäinem%' 
                          AND `query` NOT LIKE '%kärkkänen%' 
                          AND `query` NOT LIKE '%www.karkkainen.fi%' 
                          AND `query` NOT LIKE '%kärkkäisen nettikauppa%' 
                          AND `query` NOT LIKE '%nettikauppa kärkkäinen%' 
                          AND `query` NOT LIKE '%kärkköinen%' 
                          AND `query` NOT LIKE '%kärkkäinrn%' 
                          AND `query` NOT LIKE '%kärkläinen%' 
                          AND `query` NOT LIKE '%kärkkäinwn%' 
                          AND `query` NOT LIKE '%kärkkäimen%' 
                          AND `query` NOT LIKE '%kärkkäinen.%' 
                          AND `query` NOT LIKE '%käkkäinen%' 
                          AND `query` NOT LIKE '%käekkäinen%' 
                          AND `query` NOT LIKE '%käekkäinen%' 
                          AND `query` NOT LIKE '%kärkkäinem%' 
                          AND `query` NOT LIKE '%kärkkänen%' 
                          AND `search_type` = 'web' 
                          AND `gsc_search_analytics`.`device` IS NULL 
                          AND `gsc_search_analytics`.`date` BETWEEN 
                              '2018-12-31 00:00:00' AND '2019-01-27 00:00:00' 
                          AND `page` IS NOT NULL 
                   GROUP  BY `page`) AS `gsc_search_analytics` 
               ON `pages`.`url` = `gsc_search_analytics`.`page` 
       LEFT JOIN (SELECT page_id, 
                         Max(created_at) AS latest_snapshot_at 
                  FROM   `page_snapshots` 
                  WHERE  `view_id` = 57 
                  GROUP  BY `page_id`) AS `page_snapshots` 
              ON `page_snapshots`.`page_id` = `pages`.`id` 
WHERE  `pages`.`site_id` = 40 
ORDER  BY `clicks` DESC 
LIMIT  50 offset 0;

#6

It took 10.68 sec the first time and 4.09 sec the second time, so it’s actually not as bad as first noted.

If I removes the many LIKE clauses the time reduces to 3.35 sec the first time and 2.19 sec the second time.

I also tried to remove the with(disable_encoded_filters=1) part from the original query but it didn’t affect the execution time.

The CPU usage increased to 20-40% from 10% while running these queries. However, executing time is much better than debugging last day. Could this be related to some background processes?

The page view consists of 3 queries (one of them cached per day). The query mentioned in this topic is one of them while the other is this one:

select date(date_trunc('day', date)) as date, min(date) as min, max(date) as max, sum(clicks) as clicks, sum(impressions) as impressions, sum(ctr * impressions) / sum(impressions) as ctr, sum(position * impressions) / sum(impressions) as position from `gsc_search_analytics` where `gsc_search_analytics`.`site_id` = 40 and `country` = 'fin' and `query` not like '%karkkainen%' and `query` not like '%kärkkäisen verkkokauppa%' and `query` not like '%karkainen%' and `query` not like '%kärkköinen%' and `query` not like '%kärkkäinrn%' and `query` not like '%kärkkäine%' and `query` not like '%kärkkäinem%' and `query` not like '%kärkkäinn%' and `query` not like '%kärkkäonen%' and `query` not like '%krkkäinen%' and `query` not like '%kärkkäine%' and `query` not like '%kärkkäinem%' and `query` not like '%kärkkänen%' and `query` not like '%kärkköinen%' and `query` not like '%kärkkäinrn%' and `query` not like '%kärkläinen%' and `query` not like '%kärkkäinwn%' and `query` not like '%kärkkäimen%' and `query` not like '%käkkäinen%' and `query` not like '%käekkäinen%' and `query` not like '%käekkäinen%' and `query` not like '%kärkkäinem%' and `query` not like '%kärkkäinen%' and `search_type` = 'web' and `gsc_search_analytics`.`device` is null and `gsc_search_analytics`.`date` between '2019-01-01 00:00:00' and '2019-01-28 00:00:00' and `gsc_search_analytics`.`page` is null group by rollup (date);

Note: The above query is slighty different because we had our client removing some of the filters.

Both queries executes approx. with the same time.

Perhaps the queries are fine and we just needs to add more leafs/aggregators to our cluster? We’re currently running a single 4 vCPU + 15 GB RAM node with a default cluster-in-a-box setup.


#7

Can you share DDL script pages table ?

Thanks
Jaimin


#8

The query may be getting better segment elimination now as the background merger has sorted the data better. You can confirm this by taking a look at the segments_scanned/segments_eliminated attributes in the profile.

It looks like the best way to further improve performance will be to add more leaf CPU. If you find you’re having further performance issues with LIKE you could try putting a fulltext index on the query column.


#9

The DDL for pages:

CREATE TABLE `pages` ( `id` bigint(20) unsigned NOT NULL, `project_id` bigint(20) unsigned NOT NULL, `site_id` bigint(20) unsigned NOT NULL, `user_id` bigint(20) unsigned DEFAULT NULL, `url_id` char(32) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `url` varchar(4096) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL, `title` text CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL, `body` longtext CHARACTER SET utf8 COLLATE utf8_general_ci, `meta_description` text CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL, `meta_keywords` text CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL, `meta_canonical` text CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL, `meta_index` tinyint(1) DEFAULT NULL, `meta_follow` tinyint(1) DEFAULT NULL, `keyphrases` JSON COLLATE utf8_bin, `status` varchar(50) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `optimized_at` datetime DEFAULT NULL, `crawled_at` datetime DEFAULT NULL, `created_at` datetime DEFAULT NULL, `updated_at` datetime DEFAULT NULL, PRIMARY KEY (`id`) )

and DDL for page_snapshots:

CREATE TABLE `page_snapshots` ( `id` bigint(20) unsigned NOT NULL, `project_id` bigint(20) unsigned NOT NULL, `view_id` bigint(20) unsigned NOT NULL, `page_id` bigint(20) unsigned NOT NULL, `user_id` bigint(20) unsigned DEFAULT NULL, `payload` JSON COLLATE utf8_bin DEFAULT NULL, `note` text CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `created_at` datetime DEFAULT NULL, `updated_at` datetime DEFAULT NULL, PRIMARY KEY (`id`), KEY `page_snapshots_page_id_view_id_created_at_index` (`page_id`,`view_id`,`created_at`) )

The FULLTEXT index is a great suggestion. I’ll diffenently give it a try and see if it performs better.

I just profiled the query again and the ColumnStoreScan has reduced to 2,61 sec. The segments_scanned has reduced to 26 and segments_skipped is now 350. Is segments_eliminated the same as segments_skipped?

I just noted the slowest part is the TableScan on the pages table because there’s no index on the site_id column so it stick to the PRIMARY key. I’ll add this index as well and see how this affects performance.

Thanks for your help so far. I’ll follow up when the changes has been made.


#10

The index on site_id in pages table made the plan use IndexRangeScan instead of TableScan and reduced executing time to 3,89 sec from 7,1 sec. The red indicator on the Visual Explain is gone. :+1:

Excited to see if the FULLTEXT index is just as efficent.


#11

It’s awesome you were able to get this figured out. Feel free to reach out any time again if you have further questions.