Thorn’s child sex trafficking investigations tool, Spotlight, gathers information from escort sites to provide law enforcement with a tool to help find trafficked children, fast. (A Special Agent for the Wisconsin Human Trafficking Task Force describes Spotlight this way: “It is the greatest tool we have in the fight against human trafficking.“) And using MemSQL is one of the ways they do it. MemSQL is a powerful solution that meets Thorn’s requirements, including SQL support; fast query response time; support for machine learning and AI; a large and scalable number of simultaneous users; and horizontal scale-out. Also, MemSQL runs just about anywhere, notably including on-premises installations and all the major public clouds.
Still, Thorn had a business problem. As a tech non-profit, they are highly skilled at identifying and making tradeoffs that will allow their small team to deliver the biggest impact. In a constantly shifting digital environment, they know they need to focus on keeping Spotlight agile, to help find victims faster. So they need to keep the operation and maintenance of Spotlight as simple and easy to manage as possible.
In support of this strategy, Thorn is moving to MemSQL Helios, the fully managed, on-demand, and elastic cloud database from MemSQL. Where MemSQL 7.0 meets Thorn’s database needs, MemSQL Helios meets Thorn’s operational needs – removing work from Thorn’s development and operations personnel, and leaving it in the hands of MemSQL.
Peter Parente, data engineer at Thorn, puts it well: “We want to focus our time on building the application for our mission, rather than managing every detail of exactly how the data is going to be stored.” Now, Thorn can focus on growing Spotlight to meet the needs of its users and fulfill its mission: to build technology to defend children from sexual abuse.
What Thorn Delivers
As Thorn describes it, new technologies can be used by abusers to facilitate abuse – and, thankfully, the same new technologies can be leveraged to stop this abuse. Thorn leverages data to find trafficked children faster, building technology to create a world where every child can be safe, curious, and happy.
There are more than 150,000 escort ads posted daily across the US, totaling in the millions of ads a year – and, somewhere in that mountain of data, children are being sold for sex. Thorn’s research shows that 63% of child sex trafficking survivors were advertised online at some point. Harnessing that data, Spotlight is offered for free to users who are involved in actively investigating child sex trafficking cases.
When Thorn started several years ago, they only focused on a few problematic sites and online sources. Now, the number of sites with child sex trafficking content is increasing, and the user base for Spotlight has grown. Thorn is a strong example of the need that so many organizations have for nearly limitless scalability and concurrent access.
“As time passes, we have greater data complexity. More data to store and more users that need to analyze that data,” says Parente. “There are more sites, and some of the sites have added features that increase the data flowing in from them as well.”
But, even as the demands increase, so does Thorn’s effectiveness. Thorn has huge impact. Spotlight has been very successful, helping to identify over 10,000 trafficked children. On average, eight children a day are identified with it. And Thorn is proud of having sped up law enforcement investigation time, by as much as 63% – that is, they’ve cut time for investigations by nearly two-thirds. (Thorn also educates people on these topics; more than 3.5 million teens have learned to identify and prevent sextortion – extortion focused on nude images of the victim – through Thorn projects.)
To work effectively, such a system needs to meet a number of technical requirements:
- Fast ingest and fast processing. Processing a site of interest quickly; finding matches in minutes, not hours; and synthesizing results to users for easier analysis.
- Fully scalable. Thorn needs to be able to speed up or extend the system by adding capacity in a horizontal, linear fashion.
- Fast query response time. As with finding matches and reporting, query response time must be fast – seconds, not minutes or hours.
- High concurrency. Thorn needs to be able to support an ever-increasing number of signed-in recipients and interrogators of its data from a small computing footprint, with full scalability to meet new demands.
In addition, Thorn identified two business-oriented requirements, to allow them to fulfill their specific mission most effectively:
- Low-maintenance. Thorn needs to spend as much of their engineering time as possible improving Spotlight by expanding its feature set. Building a reliable, flexible, data pipeline to support their solution needs to be as hassle-free and worry-free as possible. No one but Thorn can do this work. By making the core system as low-maintenance as possible, Thorn frees up their technical talent for this vital work.
- Stateless. Thorn quickly identified Kubernetes as a core element of any solution. Kubernetes is very good for managing stateless components; stateful support has recently been added, but it’s still somewhat of a work in progress. (And will always be more complex than managing the stateless parts.) So Thorn sought to keep its solution stateless in as many components as possible, if not all of them.
How MemSQL Helios Helps Thorn Succeed
Thorn built a tool that meets all their requirements:
- Thorn finds new or updated content on targeted websites.
- The content is placed in an Amazon Simple Storage Service (S3) bucket.
- A scalable, Python data pipeline using the Dramatiq library (similar to Celery) receives notifications of new text and media content in S3 via Amazon’s Simple Query Service (SQS) and processes it.
- The data pipeline stores the processed, transformed data in MemSQL Helios for exploration in the Spotlight application.
- Trained investigators look for key details that indicate a child trafficking victim, to build their case, and to locate the most vulnerable victims.
MemSQL Helios sits at the heart of the system. “It’s currently our primary data store,” according to Parente.
Using Machine Learning and AI to Facilitate Identifications
Thorn uses MemSQL’s Euclidean distance function for computing image similarity, resulting in very high throughput rates for image comparisons. The process is described in detail in this blog post from MemSQL co-CEO Nikita Shamgunov: MemSQL as a Data Backbone for Machine Learning and AI.
The slide below shows the use of this function. Thorn has previously worked with MemSQL on advances in machine learning for image recognition.
Using Amazon SQS as a Data Pipeline
Thorn uses Amazon S3 and SQS as the input source for their data pipeline. Many other MemSQL customers have used Kafka in similar situations. (We recently published a case study featuring the Kafka-plus-MemSQL architecture from a major technology services company.) But Thorn finds Amazon SQS easier to maintain and manage.
According to Parente, “Our data is not currently delivered to S3 in a streaming fashion. It’s more a set of micro batches. We don’t currently have a need for the streaming support that you typically see associated with Kafka.”
“We rely on SQS to provide us with the notifications we need, as data is delivered into our S3 buckets,” continues Parente. “When we receive a notification, our data pipeline runs a set of machine learning models and natural language processing annotators before storing the results in MemSQL for use by our application.”
MemSQL Helios Helps Thorn Achieve Statelessness
Why has Thorn chosen MemSQL Helios, rather than self-managed MemSQL software, which they could install and run on AWS themselves? The main reason is to focus their technical resources on other areas. Every hour saved in database administration is an hour freed up for work that will speed up an investigator’s process, providing timely insights and aggregating information across time and space to find child victims faster.
The features of MemSQL Helios lend themselves to Thorn’s needs. Thorn has designed their system in such a way as to offload software maintenance and management to the greatest degree possible, using Kubernetes as their management tool for most of the system, and MemSQL Helios – which is built on Kubernetes, and managed using it – as their core database.
Kubernetes was originally developed for stateless services, and Thorn built their data pipeline (above) to be as stateless as possible. Parente says, “The pipeline workers are all stateless. If we fail processing some input data, the pipeline simply retries the input from S3 at some point in the future. Our processing is idempotent.”
More recently, Kubernetes has added features for managing stateful software. To make these features work, stateful software such as MemSQL (or any database) requires a Kubernetes Operator, which serves as an interface between the database and Kubernetes. MemSQL has created a Kubernetes Operator and uses it for managing MemSQL Helios. MemSQL customers are also using this Operator in their own development efforts.
Thorn could have used the MemSQL Operator to integrate self-managed MemSQL software into their Kubernetes management framework. Instead, they chose MemSQL Helios. “So in some way,” Parente continues, “the indirect answer to the question, ‘Are we depending on the stateful features of Kubernetes?,’ is ‘Yes – but indirectly, through Helios.'” Thorn maintains their stateless management framework by leaving the management of stateful software – their MemSQL database – to MemSQL, the company, through Helios.
“One of the reasons we’re using MemSQL Helios is to offload having to manage that stateful data store,” continued Parente. “If we weren’t using Helios, and instead hosting our own database, we would be responsible for scaling it on Kubernetes, making sure data is retained after nodes restart, repartitioning data to take advantage of new nodes, and so on.”
Thorn defers to other industry-leading experts for its other data store. “For S3, Amazon is managing the complexity,” says Parente. “The files are written, and then we assume that S3 works as advertised.”
The same questions arise for both technologies: “Are we sure it’s backed up? Is it going to scale? We want to offload that onto other vendors, including AWS and MemSQL. That’s time better spent for our mission-oriented work. We focus more on how we build out our system, or surface the processed information to our users in the best available fashion.”
This approach allows Thorn to work more closely with their users, improve the system to meet user needs, and get data out to them in the way they need it, in the formats and with the timeliness they need to prioritize the identification of child sex trafficking victims.
As Julie Cordua, CEO of Thorn, has said: “MemSQL is delivering a real impact for our organization by making real-time decisions and predictive analytics easier. And, because it easily scales to support our machine learning and AI needs, MemSQL helps us continually build better tools to find victims of trafficking and sexual abuse, faster. It is a true case of technology being applied in a way that will make a real difference in people’s lives.”