How to optimize a query with JOINs and ORDER BY?


#1

Is it possible to optimize a query with both JOINs and ORDER BY as long as the ORDER BY clause only targets columns from the first table?

This is probably better explained with an example:

select
  keywords.*
from
  keywords
  inner join site_keyword on keywords.id = site_keyword.keyword_id and site_keyword.site_id = 1
order by
  keywords.volume desc,
  keywords.id asc
limit
  50 offset 0;

In MemSQL Studio the above query executes in approx. 1 sec (Visual Explain says 130 ms).

If the ORDER BY clause is skipped the query executes in approx. 75 ms (Visual Explain says 5 ms):

select
  keywords.*
from
  keywords
  inner join site_keyword on keywords.id = site_keyword.keyword_id and site_keyword.site_id = 1
limit
  50 offset 0;

We tried with an index like alter table keywords add index (volume desc, id); but without any notable improvement.

Can anything be done to improve the first query? The keywords table contains approx. 35M rows while the site_keyword for this particular query contains approx. 350.000 rows.

The index works great without the JOIN clause with an incredible speed of approx. 75 ms (Visual Explain says 2 ms).


#2

could it be that it is trying to sort the keywords table before the JOIN?
Can you post an explain?
Maybe try to re-write it like this:

with temp as (
select
keywords.*
from
keywords
inner join site_keyword on keywords.id = site_keyword.keyword_id and site_keyword.site_id = 1
)
select * from temp
order by
temp.volume desc,
temp.id asc
limit
50 offset 0;


#3

@zmeidav thanks for your suggestion. Unfortunately, it seems to slow it down a bit further.

Apparently, it sorts in the end of the execution:

Top offset:0 limit:50
GatherMerge [remote_0.volume DESC, remote_0.id] partitions:all est_rows:50 alias:remote_0
Project [keywords.id, keywords.keyword, keywords.location, keywords.language, keywords.volume, keywords.cpc, keywords.cmp, keywords.trends, keywords.visibility, keywords.indexed_at, keywords.estimated_at, keywords.suggestions_at, keywords.created_at, keywords.updated_at] est_rows:50 est_select_cost:692,630
TopSort limit:[?] [keywords.volume DESC, keywords.id]
NestedLoopJoin
|---IndexSeek laravel.keywords, PRIMARY KEY (id) scan:[id = r0.keyword_id] est_table_rows:35,258,462 est_filtered:35,258,462
TableScan r0 storage:list stream:yes est_table_rows:346,315
Repartition [site_keyword.keyword_id] AS r0 shard_key:[keyword_id] est_rows:346,315
IndexRangeScan laravel.site_keyword, PRIMARY KEY (site_id, keyword_id) scan:[site_id = 1] est_table_rows:7,790,339 est_filtered:346,316

From another look at the profile it seems like the repartition (~120 ms) is the bottleneck when ordering with the joined table. Without the ordering part the repartition only takes ~5 ms. The execution time might not be affected by the ordering at all but instead by the repartition. Without the ordering it can probably just return after fetching the first 50 rows while the ordering has to repartition all data in order to complete the query.


#4

What is the shard key definitions on your tables? I think if you could convert it to local join it will work faster as it will not have to repartition?


#5

The shard key is currently site_id, keyword_id (the same as the primary key).

The keywords table is sharded by id.

I guess you’re right a local join would be faster. A reference table is unfortunately not an option, but perhaps the query can be joined locally by swapping the columns in the shard key or skipping the site_id column? It is worth a try.

At least the documentation mentions Local/Collocated Distributed Table Join which is probably what I need.

Try to make the shard key a subset of the joined columns. (Optimizing Table Data Structures)

I’ll post the results after testing the swapped shard key.


#6

i think if you use shard key just as keyword_id, it will do a local join.
Would be interesting to know how it behaves afterwards.