Delete duplicate rows in column store tables


#1

Hello,

Is there any way I can avoid to have duplicate values on COLUMNSTORE tables? (I have a unique ID value for each row)

When I create the table, I can’t use primary keys, and when I create a pipeline or load data, it doesn’t let me use the option of DUPLICATE KEY …

Any work around?

Thanks,

Santiago


#2

We are working to enable this feature in the future releases. In the meantime you can create a view that dedups rows in this table

create table t(id int, data text, key(id) using clustered columnstore);
create view dt as select * from t group by id;
insert into t values(1, ‘a’);
insert into t values(1, ‘a’);

select * from dt;


#3

Alternatively, be sure to put a unique ID or GUID on every row. Then you can look for duplicates (ignoring the ID or GUID column) and if you find them, you can delete them with a DELETE statement.

Or stage the data through a row store table to remove duplicates there, then after that, move to a columnstore table.


#4

Hello. I have a follow-up question regarding your solution. Could you please describe the deduplication behavior of “select * from t group by id;”? If there are multiple records with the same key (id) but different non-key fields (data), which record will be returned? Will it be the same record every time, or is there a way to ensure that it is using some kind of sort? For example, “insert t values(1, ‘b’);” It appears to be whatever record is inserted first, although I’d guess that it’s non-deterministic and depends on which leaf finishes first.


#5

You’re right, the results are not guaranteed, and so using a query like that is not recommended in general because the results are not well-defined.

An approach using window functions that ensures it returns a deterministic value based on ordering is:

select * from (select *, row_number() over (partition by id order by columns_to_order_by) as rn from t) v where rn = 1