MEMSQL 7.0

It’s About Time: Getting More from Your Time-Series Data With MemSQL 7.0

Eric Hanson
Eric Hanson

MemSQL is uniquely suited to real-time analytics, where data is being ingested, updated, and queried concurrently with aggregate queries. Real-time analytics use cases often are based on event data, where each separate event has a timestamp. It’s natural to interpret such a sequence of events as a time series.

Prior to the 7.0 release, MemSQL delivered many capabilities that make it well-suited to time-series data management [Han19]. These include:

  • a scaled-out, shared-nothing architecture that supports transactional and analytical workloads with a standard SQL interface,
  • fast query execution via compilation and vectorization, combined with scale out,
  • ability to load data phenomenally fast using the Pipelines feature, which supports distributed, parallel ingestion,
  • non-blocking concurrency control so readers and writers never make each other wait,
  • window functions for ranking, moving averages, and so on,
  • a highly-compressed columnstore data format suitable for large historical data sets.

Hence, many of our customers are using MemSQL to manage time series data today.

For the MemSQL 7.0 release, we decided to build some special-purpose features to make it even easier to manage time-series data. These include FIRST(), LAST(), TIME_BUCKET(), and the ability to designate a table column as the SERIES TIMESTAMP [Mem19a-d]. Taken together, these allow specification of queries to summarize time series data with far fewer lines of code and fewer complex concepts. This makes expert SQL developers more productive, and opens up the ability to query time series data to less expert developers.

We were motivated to add special time series capability in MemSQL 7.0 for the following reasons:

  • Many customers were using MemSQL for time series data already, as described above.
  • Customers were asking for additional time series capability.
  • Bucketing by time, a common time series operation, was not trivial to do.
  • Use of window functions, while powerful for time-based operations, can be complex and verbose.
  • We’ve seen brief syntax for time bucketing in event-logging data management platforms like Splunk [Mil14] and Azure Data Explorer (Kusto) [Kus19] be enthusiastically used by developers.
  • We believe we can provide better overall data management support for customers who manage time series data than the time series-specific database vendors can. We offer time series-specific capability and also outstanding performance, scalability, reliability, SQL support, extensibility, rich data type support, and so much more.

Designating a Time Attribute in Metadata

To enable simple, brief SQL operations on time series data, we recognized that all our new time series functions would have a time argument. Normally, a table has a single, well-known time attribute. Why not make this attribute explicit in metadata, and an implicit argument of time-based functions, so you don’t have to reference it in every query expression related to time?

So, in MemSQL 7.0 we introduced a special column designation, SERIES TIMESTAMP, that indicates a default time column of a table. This column is then used as an implicit attribute in time series functions. For example, consider this table definition:

CREATE TABLE tick(
  ts datetime(6) series timestamp,
  symbol varchar(5),
  price numeric(18,4));

It defines a table, tick, containing hypothetical stock trade data. The ts column has been designated as the series timestamp. In examples to follow, we’ll show how you can use it to make queries shorter and easier to write.

The Old Way of Querying Time Series

Before we show the new way to write queries briefly using time series functions and the SERIES TIMESTAMP designation in 7.0, consider an example of how MemSQL could process time series data before 7.0. We’ll use the following data for examples:

INSERT INTO tick VALUES
 ('2020-02-18 10:55:36.179760', 'ABC', 100.00),
 ('2020-02-18 10:57:26.179761', 'ABC', 101.00),
 ('2020-02-18 10:59:16.178763', 'ABC', 102.50),
 ('2020-02-18 11:00:56.179769', 'ABC', 102.00),
 ('2020-02-18 11:01:37.179769', 'ABC', 103.00),
 ('2020-02-18 11:02:46.179769', 'ABC', 103.00),
 ('2020-02-18 11:02:59.179769', 'ABC', 102.60),
 ('2020-02-18 11:02:46.179769', 'XYZ', 103.00),
 ('2020-02-18 11:02:59.179769', 'XYZ', 102.60),
 ('2020-02-18 11:03:59.179769', 'XYZ', 102.50);

The following query works in MemSQL 6.8 and earlier. As output, it produces a separate row, for each stock, for each hour it was traded at least once. (So if a stock is traded ten or more times, in ten separate hours, ten rows are produced for that stock. A row will contain either a single trade, if only one trade occurred in that hour, or a summary of the trades – two or more – that occurred during the hour.) Each row shows the time bucket, stock symbol, and the high, low, open, and close for the bucket period. (If only one trade occurred in that hour, the high, low, open, and close will all be the same – the price the stock traded at in that hour.)

WITH ranked AS
(SELECT symbol,
    RANK() OVER w as r,
    MIN(price) OVER w as min_pr,
    MAX(price) OVER w as max_pr,
    FIRST_VALUE(price) OVER w as first,
    LAST_VALUE(price) OVER w as last,
    from_unixtime(unix_timestamp(ts) div (60*60) * (60*60)) as ts
    FROM tick
    WINDOW w AS (PARTITION BY symbol, 
               from_unixtime(unix_timestamp(ts) div (60*60) * (60*60)) 
               ORDER BY ts
               ROWS BETWEEN UNBOUNDED PRECEDING
               AND UNBOUNDED FOLLOWING))
 
SELECT ts, symbol, min_pr, max_pr, first, last
FROM ranked
WHERE r = 1
ORDER BY symbol, ts;

This query produces the following output, which can be used to render a candlestick chart [Inv19], a common type of stock chart.

+---------------------+--------+----------+----------+----------+----------+
| ts                  | symbol | min_pr   | max_pr   | first    | last     |
+---------------------+--------+----------+----------+----------+----------+
| 2020-02-18 10:00:00 | ABC    | 100.0000 | 102.5000 | 100.0000 | 102.5000 |
| 2020-02-18 11:00:00 | ABC    | 102.0000 | 103.0000 | 102.0000 | 102.6000 |
| 2020-02-18 11:00:00 | XYZ    | 102.5000 | 103.0000 | 103.0000 | 102.5000 |
+---------------------+--------+----------+----------+----------+----------+

The query text, while understandable, is challenging to write because it uses a common table expression (CTE), window functions with a non-trivial window definition, a subtle use of ranking to pick one row per group, and a non-obvious divide/multiply trick to group time to a 60*60 second bucket.

New Time-Series Functions in MemSQL 7.0

Here I’ll introduce the new time series functions, and then show an example where we write an equivalent query to the “candlestick” query above using the new functions. I think you’ll be impressed by how concise it is!

Also see the latest documentation for analyzing time series data and for the new time series functions.

FIRST()

The FIRST() function is an aggregate function that takes two arguments, as follows:

FIRST (value[, time]);

Given a set of input rows, it returns the value for the smallest associated time.

The second argument is optional. If it is not specified, it is implicitly the SERIES TIMESTAMP column of the table being queried. It’s an error if there is no SERIES TIMESTAMP available, or if there is more than one available in the context of the query where FIRST is used; in that case, you should specify the time explicitly.

For example, this query gives the symbol of the first stock traded among all stocks in the tick table:

SELECT first(symbol) FROM tick;

The result is ABC, which you can see is the first one traded at 10:55:36.179760 in the rows inserted above.

LAST()

LAST is just like FIRST except it gives the value associated with the latest time.

TIME_BUCKET()

TIME_BUCKET takes a time value and buckets it to a specified width. You can use very brief descriptions of bucket width, like ‘1d’ for one day, ‘5m’ for five minutes, and so on. The function takes these arguments:

TIME_BUCKET (bucket_width [, time [,origin]])

The only required argument is bucket_width. As with FIRST and LAST, the time argument is inferred to be the SERIES TIMESTAMP if it is not specified. The origin argument is used if you want your buckets to start at a non-standard boundary – say, if you want day buckets that begin at 8am every day.

Putting It All Together

Now that we’ve seen FIRST, LAST, TIME_BUCKET, and SERIES TIMESTAMP, let’s see how to use all of them to write the candlestick chart query from above. A new version of the same query is simply:

SELECT time_bucket('1h') as ts, symbol, min(price) as min_pr,
    max(price) as max_pr, first(price) as first, last(price) as last
FROM tick
group by 2, 1
order by 2, 1;

The new version of the query produces this output, which is essentially the same as the output of the original query.

+----------------------------+--------+----------+----------+----------+----------+
| ts                         | symbol | min_pr   | max_pr   | first    | last     |
+----------------------------+--------+----------+----------+----------+----------+
| 2020-02-18 10:00:00.000000 | ABC    | 100.0000 | 102.5000 | 100.0000 | 102.5000 |
| 2020-02-18 11:00:00.000000 | ABC    | 102.0000 | 103.0000 | 102.0000 | 102.6000 |
| 2020-02-18 11:00:00.000000 | XYZ    | 102.5000 | 103.0000 | 103.0000 | 102.5000 |
+----------------------------+--------+----------+----------+----------+----------+

Look how short this query is! It is 5 lines long vs. 18 lines for the previous version. Moreover, it doesn’t use window functions or CTEs, nor require the divide/multiply trick to bucket time. It just uses standard aggregate functions and scalar functions.

Conclusion

MemSQL 7.0 makes it much simpler to specify many time-series queries using special functions and the SERIES TIMESTAMP column designation. For a realistic example, we reduced lines of code by more than three-fold, and eliminated the need to use some more advanced SQL concepts.

Given the high performance, unlimited scalability, and full SQL support of MemSQL, it was a strong platform for time series data in earlier releases. Now, in MemSQL 7.0, we’ve taken that power and added greater simplicity with these new built-in capabilities. How can you apply MemSQL 7.0 to your time-oriented data?

References

[Han19] Eric Hanson, What MemSQL Can Do For Time Series Applications, https://www.memsql.com/blog/what-memsql-can-do-for-time-series-applications/, March 2019.

[Inv19] Understanding Basic Candlestick Charts, Investopedia, https://www.investopedia.com/trading/candlestick-charting-what-is-it/, 2019.

[Kus19] Summarize By Scalar Values, Azure Data Explorer Documentation, https://docs.microsoft.com/en-us/azure/kusto/query/tutorial#summarize-by-scalar-values, 2019.

[Mem19a] FIRST, MemSQL Documentation, https://docs.memsql.com/v7.0/reference/sql-reference/time-series-functions/first/, 2019.

[Mem19b] LAST, MemSQL Documentation, https://docs.memsql.com/v7.0/reference/sql-reference/time-series-functions/last/, 2019.

[Mem19c] TIME_BUCKET, MemSQL Documentation, https://docs.memsql.com/v7.0/reference/sql-reference/time-series-functions/time_bucket/, 2019.

[Mem19d] CREATE TABLE Topic, SERIES TIMESTAMP, https://docs.memsql.com/v7.0/reference/sql-reference/data-definition-language-ddl/create-table/, 2019.

[Mil14] James Miller, Splunk Bucketing, Mastering Splunk, O’Reilly, https://www.oreilly.com/library/view/mastering-splunk/9781782173830/ch03s02.html, 2014.

MemSQL Helios eclipse
Introducing
MemSQL Helios
The World’s Fastest Cloud Database