Thorn partners across the tech industry, government and NGOs, leveraging technology to combat predatory behavior, rescue victims, and protect vulnerable children.
About Eric Boutin
Eric leads an engineering team for MemSQL in our Seattle office. This is background information from Eric on our work with Thorn.
How did you first get connected with Thorn?
I was introduced to Federico Gomez Suarez, a volunteer working with Thorn, by a common friend. I was impressed by the work Thorn was doing, and excited about the opportunity to help them.
What specific technical challenges did you see as opportunities?
Thorn was working on face recognition and machine learning work to analyze pictures on the internet to protect vulnerable children. The main technical challenge they had however, was to match the fingerprint of a picture with the fingerprints of an extremely large number of other pictures. Thorn needs to match a very large number of pictures per second, all in real time, with a gigantic database of pictures that is constantly being updated.
What connections were you able to draw to MemSQL capabilities?
The fingerprint matching problem seemed like a natural match for MemSQL. The dataset is too large to fit in one machine, and very high parallelism is required to match pictures in real time. While the process of extracting the fingerprint out of an image is extremely complex, the process of matching fingerprints consists of linear operations of vectors. The difficulty here is the vast mountain of changing data that has to be processed in real time, and to me, this looked a lot more like a database problem than a just a machine learning problem. More specifically, it seemed like a perfect use case for MemSQL.
Did you have to develop anything for MemSQL so Thorn could succeed?
Overall the distributed and parallel architecture of MemSQL was a natural fit for the problem that Thorn needed to solve. The only gap was the ability to do linear algebra operations on vectors in order to match image fingerprints. I added database operators to perform the required linear algebra operations. Given the steep performance requirement, I used the AVX2 instructions set to implement the linear algebra operations to minimize the latency. A few hours later I was able to test real time fingerprint matching at scale.
What improvements were made possible for Thorn by using MemSQL?
When we started the project they didn’t have a solution to the problem of matching the fingerprint of images. Thorn was investigating a number of approaches, but they had not yet found an approach which would match image fingerprints in real time. Those improvements are enabling them to move forward with the project which will in turn protect children more effectively.
How might this work apply to other use cases or industries?
The key insight from this project is that by adding basic linear algebra operations to the SQL language, any machine learning system using models that can be evaluated by using linear algebra (logistic regression, linear regression, k-mean or k-nn using euclidean or cosinusoidal distance) could be evaluated directly in a SQL query. For example, Click Through Rate prediction is a machine learning problem where a website is trying to predict which advertisement has the highest probability of being clicked on. The problem can be modeled as a linear regression between one user and a large number of ads, and the ads with the highest probability of being clicked on is picked. Logistic regression actually consists of a simple dot product between the ‘ad’ vector and the ‘user’ vector followed by a few scalar operations. We can imagine applications where the same database is being used for click through prediction, as well as business intelligence and analytics on the real time stream of clicks and impressions. In a few line of SQL the user could express ‘select the ad with the higher predicted click through rate for a given user, from an advertiser that still has enough budget’. In the same transaction, the application could then deduct money from the advertisers budget to account for the impression.
What do you see next in terms of new innovations in this arena?
I would like this field to innovate from two different directions. On the one hand, I would like to see databases support more and more algebra primitives to allow expressing more complex machine learning models. For example, supporting matrices, vector/matrix operators, aggregation across multiple vectors, and so on. This would allow expressing a growing number of machine learning algorithms in SQL. Even neural networks can be expressed as a sequence of scalar and vector operations. I would then like to see machine learning framework ‘push down’ algorithms into databases using SQL. Today Business Intelligence tools commonly push down joins and filtering into databases to leverage their high performance query processing engine. We could see machine learning frameworks push down parts of the algorithm (compute the gradient of the error for example) as a SQL query in the database engine to more effectively process data at scale.