Machine Learning (ML) and Artificial Intelligence (AI) have stirred the technology sector into a flurry of activity over the past couple of years.
However, it is important to remember that it all comes back to data. As Hilary Mason, a prominent data scientist, noted in Harvard Business Review*,
…you can’t do AI without machine learning. You also can’t do machine learning without analytics, and you can’t do analytics without data infrastructure.
Over the last year we assembled a number of blogs, videos, and presentations on using ML and AI with MemSQL.
MemSQL with ML and AI
As a foundational datastore, MemSQL incorporates machine learning functions in one of three ways:
- Calculations outside the database
- Calculations on ingest
- Calculations within the database
Read an overall summary in our blog post Machine Learning and MemSQL from Rick Negrin.
Outside the Database
For integrating ML and AI outside the database, two popular methods are integrating with Spark and TensorFlow.
For Spark, MemSQL offers and open source MemSQL Spark Connector, which delivers high-throughput, bi-directional, and highly-parallel operations from partition to partition. This connector opens up unlimited ML and AI possibilities that can be combined with a scalable, durable datastore from MemSQL.
One example of this integration is real-time machine learning scoring. A stereotypical pipeline might be:
IIoT Collection > Kafka > Spark > MemSQL > Queries
For example, some popular statistical software packages allow export as PMML, the predictive machine learning markup language. These or similar models, can be exported into Spark, or even the database itself, to score incoming data in real time. From there, the datapoint and the score of the model on the datapoint, can be persisted together for easy analysis.
MemSQL PowerStream is a specific example of combining machine learning with a database to score data in real time and predict the likelihood of an equipment failure.
An interactive demonstration using MemSQL PowerStream simulates 200,000 wind turbines sending sensor information at a rate of approximately 2 million inserts per second to MemSQL. From there, the user interface shows live status on real-time information, and ML scoring predicts the likelihood of turbine failures.
Read more in our blog IoT at Global Scale: PowerStream Wind Farm Analytics with Spark.
TensorFlow, and the results of applying machine learning models in TensorFlow, are shown in the this video Scoring Machine Learning Models at Scale from Strata New York.
ML On Ingest
Another area to run machine learning is as the data arrives in the database. MemSQL enables this with native pipelines from Kafka, including exactly-once semantics, a critical capability for deduplication in event-driven pipelines. Further, the pipeline ingest capability includes the option for executing a custom transformation.
A typical real-time data pipeline in this scenario might be:
Kafka > MemSQL > Query/Visualization
MemSQL UI for Custom Transformations
Within the Database
The third area for ML integration with databases is within the database itself. This occurs with a couple of secret ingredients.
The first enabler for machine learning within a database is extensibility, or more specifically the inclusion of stored procedures, user-defined functions, and user-defined aggregates. These functions allow for customizations that can quickly and easily execute elements of an overall machine learning pipeline.
Another enabler is the inclusion of popular algebraic operations such as
DOT_PRODUCT, which can take two vectors and quickly compare them for similarity.
DOT_PRODUCT can also easily compare one vector against hundreds of millions of vectors to deliver a degree of similarity.
Thorn, an organization dedicated to ending child sex trafficking and the sexual exploitation of children, relies on image recognition to help fulfill its mission.
Through the use of
DOT_PRODUCT within MemSQL, Thorn was able to dramatically improve its real-time image recognition capabilities. Read more on
An Engineering View on Real-Time Machine Learning, or watch the video from AWS re:Invent 2017, Business and Life-Altering Solutions Through AI and Image Recognition.
Beyond image recognition,
DOT_PRODUCT has applicability in a wide range of use cases.
For example, we demoed K-means clustering on YouTube tags within this presentation from Gartner Catalyst 2017, The Data Warehouse Blueprint for ML, AI, and Hybrid Cloud. Skip to slide 23 for the demonstration section.
Another example comes in this presentation from Gartner Data and Analytics 2018 in Texas, Building a Machine Learning Recommendation Engine in SQL. Skip ahead to slide 48 for the demonstration.
If you’d like to read more, feel free to check out, MemSQL 6 Product Pillars and Machine Learning Approach. This post includes more on:
– Built-in machine learning functions
– Real-time machine learning scoring
– Machine learning in SQL with extensibility
And if you would like to see where 1600 IT professionals think the industry is headed with Machine Learning and Artificial Intelligence, see the 2018 Outlook: Machine Learning and Artificial Intelligence Survey. Developed in conjunction with O’Reilly Media, this survey helps shape the general sentiment and industry trends around ML and AI.
Of course, for more information, or to see some of the above demonstrations in action, feel free to book a demo with MemSQL anytime!
*Harvard Business Review: https://hbr.org/2017/07/how-ai-fits-into-your-data-science-team