Turning Amazon S3 Into a Real-Time Analytics Pipeline

Seth Luersen
Seth Luersen

MemSQL 5.7 introduces a new pipeline extractor for Amazon Simple Storage Service (S3). Many modern applications interface with Amazon S3 to store data objects into buckets up to 5TB providing a new modern approach for today’s enterprise data lake.

Without analytics, the data is just a bunch of files

For modern enterprise data warehouses, the challenge is to harness the unlimited nature of S3 for ad-hoc and real-time analytics. For traditional data warehouse applications, extracting data from S3 requires additional services and background jobs that monitor buckets for new objects and then load those objects for reporting and analysis. Eliminating duplicates, handling errors, and applying transformations to the retrieved objects often requires extensive coding, middleware, or additional Amazon offerings.

From data lake to real-time data warehouse

A MemSQL S3 Pipeline extracts data from a bucket’s objects, transforms the data as required, and loads the transformed data to columnstore and rowstore. MemSQL Pipelines use the power of distributed processing and in-memory computing to extract, transform, and load external data in parallel to each database partition to achieve exactly-once semantics.

To stream existing and new S3 objects while querying the streaming data at sub-second performance, a MemSQL S3 Pipeline runs perpetually. Rapid and continuous data ingest for real-time analytic queries is a native component of MemSQL. The constant data ingest allows you to deliver real-time analytics with ANSI SQL and power business intelligence applications like Looker, ZoomData, or Tableau.

MemSQL Pipelines are a first class database citizen. Database developers and administrators can easily create, test, alter, start, stop, and configure pipelines with basic data definition language (DDL) statements or use a graphical user interface (GUI) in MemSQL Ops.

Excited to get started with MemSQL S3 Pipelines? Follow these steps:

1) Open an AWS account. AWS offers an AWS Free Tier that includes 5 GB of Amazon S3 Storage, including 20,000 Get Requests and 2,000 Put Requests.

2) Download a 30-day free trial of the MemSQL Enterprise Edition or use the MemSQL Official Docker Image to run MemSQL.

3) With an available cluster running, create your first MemSQL S3 Pipeline using our S3 Pipelines Quickstart. The guide covers creating S3 buckets, a MemSQL database, and most importantly, a MemSQL S3 Pipeline.

Real-Time Analytics S3

MemSQL Helios eclipse
MemSQL Helios
The World’s Fastest Cloud Database