Creating an IoT Kafka Pipeline in Under Five Minutes

Seth Luersen
Seth Luersen

In a recent MemSQL webcast, we discussed how modern enterprises can easily adopt new data management tools to manage data size, growth, and complexity. Then we demonstrated how to use Apache Kafka and MemSQL to build interactive, real-time data pipelines. Data pipelines capture, process, and serve massive amounts of data to millions of users.

During the webcast we also shared how to:

Build new data pipelines with modern tools

Enable data workflows to support machine learning and predictive analytics with data pipelines

Deploy a real-time data pipeline using Apache Kafka and MemSQL Pipelines in under five minutes

We received additional questions from the webcast attendees, and wanted to share the answers publicly.

Question: Is MemSQL a NoSQL database or a Relational Database Management System (RDBMS)?
MemSQL is a modern, in-memory optimized, massively parallel processing, shared-nothing, real-time database. MemSQL stores data in tables, and supports standard SQL data types. Geospatial and JSON data types are first-class citizens in MemSQL. With MemSQL, you can store and query structured, semi-structured, and unstructured data.

Question: What are the minimum memory requirements for MemSQL?
MemSQL is a distributed system consisting of one or more nodes. You can find out more about system and hardware requirements in our documentation.

Question: Is loading JSON into MemSQL similar to MongoDB?
Behind the scenes, MongoDB represents JSON documents in a binary-encoded format called BSON. BSON is also the wire format for MongoDB. JSON is a first-class citizen in MemSQL. The JSON data type validates JSON. Behind the scenes, MemSQL stores validated JSON as text. For those who want to preserve BSON in MemSQL, the supported SQL data type is VARBINARY or any variation thereof: LONGBLOB, MEDIUMBLOB, BLOB, or TINYBLOB.

Question: What is the infrastructure requirement for running MemSQL and Apache Kafka?
Like Apache Kafka, MemSQL is a distributed system consisting of one or more nodes that run as a cluster of nodes. At a minimum, the infrastructure requirement is for a stand-alone Apache Kafka producer and broker, and a stand-alone MemSQL cluster consisting of one master aggregator and one leaf node. In a production environment, there are two clusters: an Apache Kafka cluster and MemSQL cluster.

A MemSQL Pipeline ingests streaming data from an Apache Kafka topic into MemSQL leaf nodes by default. MemSQL leaf nodes contain individual database partitions. Each database partition stores the data from the Kafka stream into its own destination table. Parallelism between the number of MemSQL database partitions and Kafka broker partitions for the given topic determines optimal performance.

Question: In order to digest data from Apache Kafka, are there any consumer concepts in MemSQL?
Apache Kafka follows a more traditional design, shared by most messaging systems, where data is pushed to the broker from the producer and pulled from the broker by the consumer. In a pull-based system, the consumer can catch up when the system falls behind. A MemSQL Pipeline for Apache Kafka uses a pipeline extractor for Kafka. This extractor is a Kafka consumer.

Question: Does a MemSQL Pipeline using the Apache Kafka extractor only support real-time ingestion into a MemSQL “rowstore” table?
With a MemSQL Pipeline, you can massively ingest data in parallel into distributed tables. In MemSQL, a table is either distributed or non-distributed (reference table). There are two storage types for tables: in-memory rowstore and columnstore. All columnstore tables have an unexposed, in-memory rowstore table. MemSQL automatically spills rows from the in-memory rowstore to columnstore. All data, including the hidden rowstore table, is queryable for the columnstore table.

Question: Is it possible to move data from an in-memory rowstore table to a columnstore table?
Yes. It is as simple as:

INSERT INTO columnstore_table  SELECT * FROM rowstore_table;

Question: In this webinar demonstration, where is this data being ingested from and is it pre-generated data?
The webinar demo showcases a MemSQL Pipeline for Apache Kafka. MemSQL hosts the Apache Kafka cluster. A Python program generates data and writes it to a Kafka Producer for the adtech topic. The MemSQL Pipeline consumes this topic from the Kafka broker endpoint.

Question: What happens if you need to adjust or change your data schema?
You can modify a table in MemSQL with a data definition language (DDL) ALTER TABLE… statement. By default, MemSQL supports an online ALTER statement. Since MemSQL Pipelines are written with DDL, you can also alter a MemSQL Pipeline with an ALTER PIPELINE… statement. Typically, the procedure to handle schema changes is simply:

STOP PIPELINE mypipeline;
ALTER TABLE mytable… ;
ALTER PIPELINE mypipeline…;
TEST PIPELINE mypipeline;
START PIPELINE mypipeline;

Question: Will you please provide a code example of how a MemSQL Pipeline for Apache Kafka transforms a JSON message?
Here is a simple example of JSON in a Kafka message:
Written in Python, here is a basic transform script that extracts the id field from JSON:

#!/usr/bin/env python 
import struct
import sys
import json
def transform_records():
    while True:
        byte_len = sys.stdin.read(8) 
        if len(byte_len) == 8:
            byte_len = struct.unpack("L", byte_len)[0] 
            result = sys.stdin.read(byte_len)
            yield result
for l in transform_records():
    parsed_json = json.loads(l)
    sys.stdout.write("%s\t%s\n" % (parsed_json["id"], l))

Question: How is it possible to persist complex master detail records with foreign keys with a MemSQL Pipeline?
MemSQL 6 does not enforce foreign key constraints, does not support triggers, and a MemSQL Pipeline only supports loading data into a single table. Recently, a MemSQL engineer demoed a beta release of MemSQL 6.5 in which a MemSQL Pipeline can load data into a stored procedure! This architecture allows for conditional logic within the stored procedure to manage complex scenarios such as data ingest into related tables. To learn more about streaming data into stored procedures with MemSQL Pipelines, see Recapping An Evening with MemSQL Engineering.

Question: Usually the data format for data in Apache Kafka is in a binary form such as Apache Avro, so how does MemSQL support user-defined decoding?
MemSQL Pipelines support data ingest that is in either a CSV or TSV data format. One way to ingest compressed Avro data from a Kafka topic is to create a data pipeline with Apache Spark. Spark Streaming allows Spark to consume a Kafka topic directly. Using the MemSQL Spark Connector, it is possible to decode the binary formats and save data directly into MemSQL. You can learn more about working with Spark in our MemSQL Spark Connector Guide.

Another approach is to use an Avro to JSON converter. Once converted, the Kafka message is essentially a JSON blob. In a MemSQL Pipeline, you can easily transform JSON using any Linux supported API that efficiently parses JSON. MemSQL Pipelines also supports jq, a lightweight and flexible command-line JSON processor. To learn more, read JSON Streaming and the Future of Data Ingest.

JSON is a first-class citizen in MemSQL. With built-in JSON functions, it is possible to parse JSON key-values into persisted, computed columns. MemSQL supports indexing of computed columns. With MemSQL, you can easily index and parse JSON with standard SQL.

Question: Does MemSQL handle back-pressure automatically?
A large Apache Kafka cluster in production is capable of delivering messages at a very fast rate, in order of millions of messages per second. For use cases where there is high volume and a high rate, many API-driven consumers reach a point where they cannot keep pace and fall behind, a behavior known as back-pressure.

Because MemSQL is a modern, in-memory optimized, massively parallel processing, shared-nothing, real-time database, a MemSQL Pipeline for Apache Kafka can easily consume and ingest messages at very high volume and high rates. A MemSQL Pipeline ingests streaming data from an Apache Kafka topic into MemSQL leaf nodes by default. MemSQL leaf nodes contain individual database partitions. Each database partition consumes the Kafka stream into the designated destination table. By default, tables in MemSQL are in-memory, rowstore tables.

Parallelism between the number of MemSQL database partitions and Kafka broker partitions for the given topic determines optimal performance as this parallelism dictates the total batch size. MemSQL keeps track of the Kafka earliest and latest offsets in the  information_schema.PIPELINES_BATCHES table.

Question: What are the advantages to using Apache Kafka versus Amazon S3?
Apache Kafka is a modern, distributed messaging system. Amazon S3 is cloud object storage built to store and retrieve files. MemSQL Pipelines has extractors for both Apache Kafka and Amazon S3. For both extractors, the number of database partitions in MemSQL determines the degree of parallelism for data ingest.

For S3, the number of database partitions in MemSQL equates to the number of files in a pipeline batch. Each database partition ingests a specific S3 file from a folder in an S3 bucket. The files can be compressed. Amazon S3 has known limits for GET request speeds that start at over 100 requests per second. In addition, the pricing model for S3 is based on the volume of data egress. To learn more about MemSQL S3 Pipelines at scale, check out 1.3 Billion NYC Taxi Rows.

An Apache Kafka cluster is able to support reads and writes in the millions per second. Of course, this is very dependent on the package size. As package sizes increase, messaging throughput decreases. That said, as a distributed system, you can scale out Apache Kafka to meet your requirements.

Question: How can I get the O’Reilly eBook trilogy from MemSQL?
You can download our O’Reilly eBooks covering topics that range from Predictive Analytics to Artificial Intelligence at MemSQL O’Reilly Trilogy.

Question: What do I need to get started with Apache Kafka and MemSQL?
You can deploy a MemSQL cluster on-premises, with Docker, with Amazon Web Services, or with Microsoft Azure. To learn more, check out our Quick Start Guides.

To learn how to quickly setup a MemSQL Pipeline for Apache Kafka, review our Quick Start for Kafka Pipelines. You can watch the recording of the Pipelines in Five Minutes webcast here. If you prefer a quick demo of MemSQL, MemSQL Pipelines, and Apache Kafka, sign up at MemSQL Demo.


blog closer

MemSQL Helios eclipse
MemSQL Helios
The World’s Fastest Cloud Database