With the release of MemSQL 6.5, the MemSQL team is announcing our latest innovation in data ingestion: Pipelines to stored procedures. By marrying the speed and exactly once semantics of MemSQL Pipelines with the flexibility and transactional guarantees of stored procedures, we’ve eliminated the need for complex transforms and unlocked the doors to a variety of new, groundbreaking use cases. This blog post will walk you through how to use Pipelines to stored procedures to break data streams up into multiple tables, join them with existing data, and update existing rows, but this is just the tip of the iceberg of what is possible with this exciting new functionality.
How Does it Work?
Pipelines to stored procedures augments the existing MemSQL Pipelines data flow by providing the option to replace the default Pipelines load phase with a stored procedure. The default Pipelines load phase only supports simple insertions into a single table, with the data being loaded either directly after extraction or following an optional transform. Replacing this default loading phase with a stored procedure opens up the possibility for much more complex processing, providing the ability to insert into multiple tables, enrich incoming streams using existing data, and leverage the full power of MemSQL Extensibility.
Pipelines Transforms vs. Stored Procedures
Those familiar with MemSQL Pipelines might be wondering, “I already have a transform in my pipeline….should I use a stored procedure instead?” Leveraging both a transform and a stored procedure in the same Pipeline allows you to combine the third-party library support of a traditional transform alongside the multi-insert and data-enrichment capabilities of a stored procedure.
|Traditional Transform||Stored Procedure|
Example Use Cases
Insert Into Multiple Tables
Pipelines to stored procedures now enables you to insert data from a single stream into multiple tables in MemSQL. Consider the following stored procedure:
CREATE PROCEDURE proc(batch query(tweet json)) AS BEGIN INSERT INTO tweets(tweet_id, user_id, text) SELECT tweet::tweet_id, tweet::user_id, tweet::text FROM batch; INSERT INTO users(user_id, user_name) SELECT tweet::user_id, tweet::user_name FROM batch; END
This procedure takes in tweet data in JSON format, separates out the tweet text from the user information, and inserts them into their respective tables in MemSQL. The entire stored procedure is wrapped in a single transaction, ensuring that data is never inserted into one table but not the other.
Load Into Table and Update Aggregation
There are many use cases where maintaining an aggregation table is a more performant and sensible alternative to running aggregation queries across raw data. Pipelines to stored procedures allows you to both insert raw data and update aggregation counters using data streamed from any Pipelines source. Consider the following stored procedure:
CREATE PROCEDURE proc(batch query(tweet json)) AS BEGIN INSERT INTO tweets(tweet_id, user_id, text) SELECT tweet::tweet_id, tweet::user_id, tweet::text FROM batch; INSERT INTO retweets_counter(user_id, num_retweets) SELECT tweet::retweeted_user_id, 1 FROM batch ON DUPLICATE KEY UPDATE num_retweets = num_retweets + 1 WHERE tweet::retweeted_user_id is not null; END
This procedure takes in tweet data as JSON, inserts the raw tweets into a “tweets” table, and updates a second table which tracks the number of retweets per user. Again, the transactional boundaries of the stored procedure ensures that the aggregations in retweets_counter are always in sync with the raw data in the tweets table.
Use Existing Data to Enrich a Stream
It’s also possible to use a stored procedure to enrich an incoming stream using data that already exists in MemSQL. Consider the following stored procedure, which uses an existing MemSQL table to join an incoming IP address batch with existing geo data about its location:
CREATE PROCEDURE proc(batch query(ip varchar, ...)) AS BEGIN INSERT INTO t SELECT batch.*, ip_to_point_table.geopoint FROM batch JOIN ip_to_point_table ON ip_prefix(ip) = ip_to_point_table.ip; END
These use cases only scratch the surface of what is possible using Pipelines to stored procedures. By joining the speed of Pipelines with the flexibility of stored procedures, MemSQL 6.5 gives you total control over all of your streaming data. For more information on the full capabilities of MemSQL Extensibility, please see our documentation.