Any good public resources on how to use MemSQL as a plagiarism detector?

mpskovvang · August 30, 2019, 2:54pm

I would like to learn more about MemSQL’s vector features in my research to build a plagiarism detection tool.

My initial thought is to use some word/sentence embedding like word2vec and store these vectors in a table.

If anyone has some great resources, I would really appreciate a link or just some good pointers. I haven’t been playing with vectors or text embedding yet, but I’m curious to learn.

Thanks!

@Jacky

hanson · August 30, 2019, 5:35pm

You’ve probably seen it, but here’s the documentation for our vector functions: SingleStoreDB Cloud · SingleStore Documentation. It talks about the available functions, and how to insert vectors from an application as binary fields. JSON_ARRAY_PACK can also be used to convert json arrays of floats to vectors, for convenience.

I don’t have anything else more specific to offer. I have heard of one other customer prospect who was going to try to do fuzzy full text search to query a product catalog, using a word vector model from a pre-trained deep neural network like word2vec (not sure which one). The idea was to create a vector for each product description, and convert a set of query words to a vector, then do cosine similarity match with dot_product. That way, if you searched for “cat beds” you still might find “dog beds” (ranked high) even if no cat collars were for sale. The word vectors for a product description or query were to be averaged together (normalized) to form product description vectors of length 1 before the similarity matching.

Jacky · August 30, 2019, 8:04pm

Hey Martin!

Great seeing you again

I do remember @bvincent worked on something w/r/t to word vectors for indexing and searching the data in his personal project. I think he could definitely give you some insight there (though perhaps not specific to what you are doing).

Let me keep an ear out on who else I could share your question with too.

bvincent · August 30, 2019, 8:18pm

Hi Jacky, I think you may have me confused with someone else. Unless I have a personal project like that I’m not aware of!

Jacky · August 30, 2019, 8:30pm

Oops! So sorry @bvincent, I meant to tag someone else. Haha.

mpskovvang · August 30, 2019, 9:24pm

Thanks @hanson and @Jacky

I just tried to play around with BERT from Google AI. This allows me to embed sentences. From the few basic tests I ran, I’m able to store the vector arrays with json_array_pack and then do a distance calculation with euclidean_distance and the results already shows that sentences closer to each other gets a lower distance as expected. Great starting point!

I’ll study some more and do some more complex tests in another day.

Jacky · August 30, 2019, 9:48pm

That sounds like a cool project. Do share your results when you are ready. I feel like that would make an excellent blog post. We’re always excited to hear about different use cases of MemSQL especially ones like these!

Good luck, and feel free to email me if you have any specific eng questions.

hanson · August 31, 2019, 12:00am

Can you post a link to BERT documentation?

mpskovvang · August 31, 2019, 3:51pm

The official repository and documentation can be found at GitHub google-research/bert.

However, I did use the excellent and easy-to-use hanxiao/bert-as-service - it took only 5 minutes to install and setup with a pre-trained model from Google AI team.

pbaylies · September 1, 2019, 1:22am

Hi @mpskovvang – although @hanson did a nice job of outlining what I had done with the product, here are a few notes / tips. I used the 100d glove word vectors from the magnitude project, just got vectors for all the words I was using and put them in a table. Then I’d average word vectors together both for my search queries and for my targets (basically short lists of keywords / attributes). And finally, I’d just do queries using the dot product aka DOT_PRODUCT() against the search table with precomputed vectors, and the search query using JSON_ARRAY_PACK(). That’s basically all it took; if you need certain words to appear, you could combine it with a full text search or check for the presence of keywords. Good luck!

mpskovvang · September 1, 2019, 5:50pm

@pbaylies thanks for sharing! Great idea by combining vectors and full text in a single query to precise the results.

How did you determinate the threshold for a good match/candidate? Did you calculate an average score based on samples from the data set beforehand?

Did you compare DOT_PRODUCT against EUCLIDEAN_DISTANCE scoring? I believe, I read that DOT_PRODUCT is equals to a cosine distance calculation. In plagiarism detection, euclidean distance should give better results, but the distance calculation sure depends on the case.

pbaylies · September 2, 2019, 12:40pm

@mpskovvang In my use case, I was only interested in the top n results, so in practice I didn’t need a threshold. Note that euclidean distance is the same as the cosine distance provided that both vectors are normalized first. Feel free to compare using both, but DOT_PRODUCT worked fine for me.

mpskovvang · September 3, 2019, 5:27am

@pbaylies thanks for the clarification. I’m not sure if my vectors are normalized or not. I still have a lot to learn about vectors. I’ll try out both distance calculations and see how they compare.

pbaylies · September 3, 2019, 12:22pm

They’re normalized if they sum to 1! The Euclidean (or L2) norm is the square root of the sum of the squared vector values. You’d calculate that number and then divide the vector by it; in numpy you can use numpy.linalg.norm to calculate the norm.

janetracy · July 8, 2020, 9:47am

Hello everyone
this question even got mentioned in the article https://memsql38.rssing.com/chan-73464100/all_p8.html So maybe the solution will come up eventually