MemSQL slow deletes on columnstore Database

Hey there,

I am testing columnstore table with 1 crore rows having the following structure,

create table tags
(
    id INT not null,
    sub_id int,
    key_name varchar(150) not null,
    key_value varchar(501),
    appid varchar(50),
  KEY (`appid`,`key_name`) USING CLUSTERED COLUMNSTORE ,
  SHARD KEY (`id`)
); 

and I need to delete rows from tags table , I tried deleting with 1000 OR conditions to delete 1000 rows using the following query

DELETE FROM tags where ( appid=XXXX and key_name=XXXX ) OR ( appid=YYYY and key_name=YYYY) OR ....

and it took 102 sec to complete.

I tried deleting individual rows, for 1000 individual deletes it took 7.7 sec.

My questions are :

  1. Why it is taking a long time to delete while its fetching data with lightning speed?
  2. Is there any other way to delete rows more efficiently?

Thanks in advance.

For the 102 second case, what part of it was compile time vs. execution time?

If your predicates are always appid==key_name, then try this:

DELETE FROM tags where appid=key_name and appid IN (XXX, YYY, ...);

I think this should compile faster and probably execute faster also (IN list can be more easily optimized than multiple OR conditions) Also, unlike multiple OR conditions, if you modify the IN list, the query wouldn’t need to be recompiled.

Also, as @hason hinted, query time includes compilation and execution time. Compilation time is only one-time cost per new query. If you run a query twice, it doesn’t need to incur the cost of compilation. More complex queries take longer to compile.

For the first time when executed the query, it took around 120 sec, from the next time it’s taking around 102 sec. So the 102 is the execution time I hope.

Both the appId and key_name are different data, so can’t use the IN clause. And almost 102 sec it’s taking for execution while it’s taking around 8 sec for compilation for the query mentioned above.

Consider also a multi-table, set-oriented DELETE statement, based on this syntax:

DELETE tbl_name FROM table_references
    [WHERE expr]
    [LIMIT row_count]

Like

create table tmp(appid int, key_name varchar(64));
-- insert the constants from the long OR condition into table tmp
insert into tmp values (const1_1, const1_2), (const2_1, const2_2), ...;
delete tags from tags, tmp
where tags.appid = tmp.appid and tags.key_name = tmp.key_name;

Also, MemSQL 7.0 (currently in beta) has improved performance for finding individual rows in a columnstore. That might help too.

Note that for each row deleted, MemSQL 6.8 must scan a 1M row chunk (segment) to find the row before marking it deleted. Depending on the query, MemSQL 7.0 may be able to seek into the segment to find the row (a.k.a. “sub-segment access”) and not have to scan the whole thing. Hash indexes on columnstores in 7.0 can help as well but they only work on single columns so that may not be useful to you since your lookups are on two columns. You could conceivably put the concatenated value of both columns in one column and put a hash index on it, but that would require some potentially complex changes to your app.

Row stores with regular skip list indexes on the lookup columns of course can delete data much faster.

Thanks hanson, the first solution gave a better performance. I want to try the second solution too, but was unable to create HASH index on columnstore .
I tried creating index like

create unique index test_hash ON tags2 (id) USING HASH;

but getting an error

ER_MEMSQL_FEATURE_LOCKDOWN: Feature ‘INDEX on COLUMNAR table’ is not supported by MemSQL.

You need to install Memsql 7.0 beta to try secondary indexes on columnstores

Yes, We have tried to create a secondary index in MemSQL 7.0 itself. Still getting the error.

We do not support unique index in 7.0.
In beta 1, we also do not support alter table add index or create index on an existing table (although they will be supported in our next beta and in 7.0 GA).

So for now you will have to create a new table with a secondary hash index, and dump in your data.
Try something like

create table t (a int, b int, key(a) using clustered columnstore, key(b) using hash)

There is also a known bug in beta 1 about columnstore secondary index interaction with alter. Do not run any alter on a table with secondary hash index.

The documentation for secondary columnstore hash indexes is at:

Query not able to scan using the secondery hash index created.

Our table defination :

create table tags2
(
id INT not null,
sub_id int,
key_name varchar(150) not null,
key_value_string varchar(501),
appid varchar(50),
KEY (appid,key_name,key_value_string,sub_id) USING CLUSTERED COLUMNSTORE ,
KEY (id ) USING HASH
);

The query we are running :slight_smile :

delete from tags2 where id in ( 1000_IDS );

Profiling of the query :

PROFILE
Delete partitions:all actual_rows: 0
Filter [tags2.id IN (…)] actual_rows: 0 exec_time: 2,135ms start_time: [00:00:00.003, 00:00:00.026]
ColumnStoreScan testdb.tags2, KEY appid (appid, key_name, key_value_string, sub_id) USING CLUSTERED COLUMNSTORE actual_rows: 8,399,872 exec_time: 22ms start_time: [00:00:00.000, 00:00:00.006] memory_usage: 10,223.616211 KB segments_scanned: 9 segments_skipped: 7 segments_fully_contained: 0

Its taking a long time on Filter, and not using the HASH index we have created.

From your profile output, it seems like hash index is working.

segments_skipped: 7

This is telling you that 7 segments are skipped. A segment may be skipped by segment elimination or hash index filter. For your specific query, segment elimination probably is not going to work, so hash index is the only explanation that could eliminate the segment.
We are currently (as of beta 1) not showing any direct information about the use of secondary hash index in profile or explain, so it’s kind of tricky. But in general if you are seeing a lot of segments being skipped and your filter does not match the sort key, then you should be using secondary hash index properly.

If you have 1000 elements in your IN clause, then we have to do 1000 secondary hash index seeks and compute their union. This is probably not going to be faster than doing a full table scan.