[binary lenght for cosine similarity] MEMSQL do not allow binary column lenght larger than 255 ? How to do similarity?

thusinh · June 5, 2019, 5:42pm

I followed example at Image Recognition at the Speed of Memory Bandwidth and my features extractor generates vector of 1920. However, memSQL 6 does not seem to support it ?

Version 6
CREATE TABLE features (id bigint(11) NOT NULL AUTO_INCREMENT, features_extracted binary(1920) DEFAULT N
ULL, user_id varchar(256) DEFAULT NULL, bbox_id varchar(256) DEFAULT NULL , KEY id (id) USING CLUSTERED COLUMNS
TORE);

Error: ERROR 1074 (42000): Column length too big for column ‘features_extracted’ (max = 255); use BLOB or TEXT instead

Please help.
Steve

thusinh · June 5, 2019, 7:43pm

I found solution: use BLOB for field type and JSON_ARRAY_PACK as example here:

Steve

hanson · June 5, 2019, 9:23pm

I’m glad you found a solution. What kind of matching are you doing?

thusinh · June 8, 2019, 6:06am

Hi Hanson,

I use EUCLIDEAN, which my food dataset is NOT familiar with hence I have to do experiments all over again to find a sweet spot (18.0 now). Is there a COSINE DISTANCE somehow available ?

Thanks,
Steve

hanson · June 10, 2019, 3:48pm

For COSINE SIMILARITY (CS) use the DOT_PRODUCT() function, SingleStoreDB Cloud · SingleStore Documentation, and normalize all the input vector lengths to 1. I believe you can compute the cosine distance from that as 1 - CS.

thusinh · June 18, 2019, 10:11am

Would it be so cool to have COSINE_DISTANCE function in MEMSQL …

Thanks.
Steve

hanson · June 21, 2019, 1:22am

Why not just use DOT_PRODUCT()? Is the length normalization a problem or something?

talgat90.07 · October 3, 2019, 4:23am

Hi, but when we are using BLOB and save features in it, we are having a decrease in speed. It performs SELECT from 6 mln row table in approx. 40 seconds. What can we do to speed up the process? In your example here Image Recognition at the Speed of Memory Bandwidth it shows creation of BINARY(4096) field. But as we got error with creating it, is it not possible to do so?

hanson · October 3, 2019, 9:32pm

Sorry, that is a mistake in the blog. I will follow up to get it fixed. Try VARBINARY(4096) instead.

E.g.:

memsql> create table images (id int, vec varbinary(4096));
Query OK, 0 rows affected (0.05 sec)

memsql> insert into images values(1,json_array_pack(concat("[0.5,0.866,0,",rpad('',4092*2,'0,'),"0]")));
Query OK, 1 row affected (0.12 sec)

memsql> insert into images values(2,json_array_pack(concat("[0,1,0,",rpad('',4092*2,'0,'),"0]")));
Query OK, 1 row affected (0.01 sec)

memsql> select t1.id, t2.id, dot_product(t1.vec, t2.vec) as d 
from images t1, images t2 
order by d desc;
+------+------+--------------------+
| id   | id   | d                  |
+------+------+--------------------+
|    2 |    2 |                  1 |
|    1 |    1 | 0.9999560117721558 |
|    1 |    2 | 0.8659999966621399 |
|    2 |    1 | 0.8659999966621399 |
+------+------+--------------------+

talgat90.07 · October 8, 2019, 1:12pm

Hi Mr. Hanson,

Thanks for your fast reply. We inserted our 512-d feature vectors as you suggested to varbinary(4096).
We have only 13 million vectors in our table. But it takes 65-70 seconds to do SELECT.
Let me explain you our lifecycle:

We inserted all our 13 million vectors as you suggested with json_array_pack(vector);
We get new vector from incoming image (lst) and do:
“SELECT id, username, DOT_PRODUCT(feature_vector, JSON_ARRAY_PACK(concat(’%s’))) as score from FR.our_db order by score desc limit 1;” % (lst)

Maybe we are doing something wrong from our side, but we get good select response - it shows the row corresponding to our (lst) vector. But it takes 65-70 seconds.

Could you give us any other instructions on how to speed up the process?

Thank you again,

Regards,
Talgat,

hanson · October 8, 2019, 5:13pm

I tried something similar:

select id, dot_product(t.vec,json_array_pack(concat("[0,1,0,",rpad('',1020*2,'0,'),"0]"))) as d from images t order by d desc limit 1;

and I got about 1 million rows/sec per core processing rate on a laptop, using a rowstore:

create table images(id int, vec varbinary(4096), key(id), shard(id));

Make sure you are sharding in such a way that you get an even distribution of rows across partitions. And make sure you have the same number of partitions per node as you have hardware threads.

Also, vectors are interpreted as single-precision, 4 bytes per element. So a 512-d vector needs 2048 bytes. So varbinary(2048) might work for you, although it shouldn’t changes things since varbinary(4096) can hold 2048 bytes.