Delete duplicate rows in column store tables

Hello,

Is there any way I can avoid to have duplicate values on COLUMNSTORE tables? (I have a unique ID value for each row)

When I create the table, I can’t use primary keys, and when I create a pipeline or load data, it doesn’t let me use the option of DUPLICATE KEY …

Any work around?

Thanks,

Santiago

1 Like

We are working to enable this feature in the future releases. In the meantime you can create a view that dedups rows in this table

create table t(id int, data text, key(id) using clustered columnstore);
create view dt as select * from t group by id;
insert into t values(1, ‘a’);
insert into t values(1, ‘a’);

select * from dt;

Alternatively, be sure to put a unique ID or GUID on every row. Then you can look for duplicates (ignoring the ID or GUID column) and if you find them, you can delete them with a DELETE statement.

Or stage the data through a row store table to remove duplicates there, then after that, move to a columnstore table.

Hello. I have a follow-up question regarding your solution. Could you please describe the deduplication behavior of “select * from t group by id;”? If there are multiple records with the same key (id) but different non-key fields (data), which record will be returned? Will it be the same record every time, or is there a way to ensure that it is using some kind of sort? For example, “insert t values(1, ‘b’);” It appears to be whatever record is inserted first, although I’d guess that it’s non-deterministic and depends on which leaf finishes first.

You’re right, the results are not guaranteed, and so using a query like that is not recommended in general because the results are not well-defined.

An approach using window functions that ensures it returns a deterministic value based on ordering is:

select * from (select *, row_number() over (partition by id order by columns_to_order_by) as rn from t) v where rn = 1

Assuming you have a timestamp that will show which record is new:

DELETE t1 FROM table t1 INNER JOIN table t2 on t1.id = t2.id AND t1.Timestamp < t2.Timestamp

That will delete everything from the table except the latest record for each Id.

Create a staging table to do this in - then have a procedure to run the dedupe, then move it to the long term table

DELETE from target t WHERE t.Id in (SELECT Id FROM staging);
INSERT INTO target SELECT DISTINCT * FROM staging;
TRUNCATE staging;

Everything gets moved over to the target table with no duplicates for end users. I tested this process yesterday joining 450mm rows onto itself to dedupe down to 215, then compare and delete/insert/truncate. It took about 1 minute on a low powered cluster and only used 18gb of memory to process.

2 Likes

We added support for unique single-column keys for columnstore tables in 7.1, about a year ago. We’re adding support for multi-column unique keys in 7.5, still on track to ship around the end of July. Upserts are also supported.

These mitigate the need to remove duplicates, in some cases, since you’ll never have any.

1 Like