Operationalizing Spark with MemSQL

Gary Orenstein
Gary Orenstein

Operationalize Spark

In Short:
Combining the data processing prowess of Spark with a real-time database for transactions and analytics, where both are memory-optimized and distributed, leads to powerful new business use cases. MemSQL Spark Connector links at end of this post.

Data Appetite and Evolution

Our generation of, and appetite for, data continues unabated. This drives a critical need for tools to quickly process and transform data. Apache Spark, the new memory-optimized data processing framework, fills this gap by combining performance, a concise programming interface, and easy Hadoop integration, all leading to its rapid popularity.

However, Spark itself does not store data outside of processing operations. That explains that while a recent survey of over 2000 developers chose Spark to replace MapReduce, 62% still load data to Spark with the Hadoop Distributed File System and there is a forthcoming Tachyon memory-centric distributed file system that can be used as storage for Spark.

But what if we could tie Spark’s intuitive, concise, expressive programming capabilities closer to the databases that power our businesses? That opportunity lies in operationalizing Spark deployments, combining the rich advanced analytics of Spark with transactional systems-of-record.

Introducing the MemSQL Spark Connector

Meeting enterprise needs to deploy and make use of Spark, MemSQL introduced the MemSQL Spark Connector for high-throughput, bi-directional data transfer between a Spark cluster and a MemSQL cluster. Since Spark and MemSQL are both memory-optimized, distributed systems, the MemSQL Spark Connector benefits from cluster-wide parallelization for maximum performance and minimal transfer time. The MemSQL Spark Connector is available as open source on Github.

MemSQL Spark Connector Architecture

There are two main components of the MemSQL Spark Connector that allow Spark to query from and write to MemSQL.

  • A MemSQLRDD class for loading data from a MemSQL query
  • A saveToMemsql function for persisting results to a MemSQL table

Spark ClusterFigure 1: MemSQL Spark Connector Architecture

This high performance connection between MemSQL and Spark enables several relevant use cases for today’s Big Data, high-velocity environments.

Spark Use Cases:

  • Operationalize models built in Spark
  • Stream and event processing
  • Extend MemSQL Analytics
  • Live dashboards and automated reports

Understanding that operationalizing Spark often involves another system, MemSQL—with its performance, scale, and enterprise fit—provides significant consolidation, incorporating several types of Spark deployments.

Operationalize Models Built in Spark

In this use case, data flows into Spark from a specified source, such as Apache Kafka, and models are created in Spark. The results set of those models can be immediately persisted in MemSQL as one or multiple tables, whereby an entire ecosystem of SQL-based business intelligence tools can consume the results. This rapid process allows data teams to go to production and iterate faster.

Spark Data
Figure 2: Operationalize models built in Spark

Stream and Event Processing

In the same survey of Spark developers, 67% of users need Spark for event stream processing, and streaming can also benefit from a persistent database.

A typical use might be to capture and structure event data on the fly, such as that from a high-traffic website. While the event stream may include a bevy of information about overall site and system health, as well as user behavior, it makes sense to structure and classify that data before passing to a database like MemSQL in a persistent queryable format.

Processing the stream in Spark, and passing to MemSQL, enables developers to

  • Use Spark to segment event types
  • Send each event type to a separate MemSQL table
  • Immediately query real-time data across one or multiple tables in MemSQL

Spark Real-time
Figure 3: Stream and event processing

Extend MemSQL Analytics

Another use case brings extended functionality to MemSQL users who need capabilities beyond what can be offered natively with SQL or JSON. As MemSQL is typically the system-of-record for primary applications, it holds the freshest data for analysis. To extend MemSQL analytics, users can

  • Set up a replicated cluster providing clear demarcation between operations and analytics teams
  • Give Spark access to live production data for the most recent and relevant results
  • Allow Spark to write results set back to the primary MemSQL cluster to put new analyses into production

Spark Machine Data
Figure 4: Extend MemSQL Analytics

Live dashboards and automated reports

Many companies run dashboards using SQL analytics, and the opportunity to do that in real-time with the most recent data provides a differentiating advantage.

There are, of course, advanced reports that cannot be easily accomplished with SQL. In these cases, Spark can easy access live production on the primary operational datastore to deliver custom real-time reports using the most relevant data.

Spark Custom Reporting
Figure 5: Live Dashboards and automated reports

For more information on the MemSQL Spark Connector please visit:

Github Site for MemSQL Spark Connector

MemSQL Technical Blog Post

MemSQL free 30 day trial

Get The MemSQL Spark Connector Guide

The 79 page guide covers how to design, build, and deploy Spark applications using the MemSQL Spark Connector. Inside, you will find code samples to help you get started and performance recommendations for your production-ready Apache Spark and MemSQL implementations.
Download Here

memsql ribbon
MemSQL Helios
The World's Fastest Cloud Database