MemSQL Geospatial

Executive Summary

MemSQL is the real-time database for transactions and analytics. It is a fast, scalable, fully-featured relational database management system (RDBMS). MemSQL converges transactions and analytics for sub-second data processing and reporting. Gartner places MemSQL in the Hybrid Transactional and Analytical Processing market (HTAP), an emerging area that allows organizations to further operationalize their data and gain increased analytical value.

Gartner Identifies Emerging Category

Hybrid Transactional plus Analytical Processing

"HTAP will enable business leaders to perform...much more advanced and sophisticated real-time analysis of their business data than with traditional architectures."

Why HTAP?

It's clear that the promise of big data still outweighs the reality. Until now, many organizations have found an effective way to store their data, but have limited means to act on it. Much of the problem lies with organizations using separate systems for data storage and data processing. To close the gap, database technology focused on realizing instant value from growing and changing data sources and has manifested itself in hybrid transactional and analytical processing (HTAP) - that is, performing analysis on data directly in an operational data store.

To enable HTAP, the following features must be in place:

In-Memory Computing. Storing data in-memory is especially valuable for running concurrent workloads, whether transactional or analytical, because it eliminates the inevitable bottleneck caused by disk contention. In-memory computing is necessary for HTAP because no purely disk-based system is able to provide the required input/output (I/O) with any reasonable amount of hardware.

Code Generation and Compiled Query Execution Plans. With no disk I/O, queries execute so quickly that dynamic SQL interpretation actually becomes a performance bottleneck. This problem can be addressed by taking SQL statements and generating a compiled query execution plan. This approach is much faster than dynamically interpreting SQL thanks to the inherent performance advantage of executing compiled versus interpreted code, as well as enabling more sophisticated optimizations.

Some organizations turn to using a caching layer on top of their relational database management systems, but this approach runs into problems when data is frequently updated. By storing compiled query plans rather than caching one-time query results, every time a query executes, the user has the most up-to-date version of data. The compiled query plan strategy provides an advantage by executing a query on the in-memory database directly, rather than fetching cached results, which is important as data is frequently updated.

Fault Tolerance and ACID Compliance. Operational data stores cannot lose data, making fault tolerance and Atomicity, Consistency, Isolation, Durability (ACID) compliance prerequisites for any HTAP system. Some important enterprise readiness features to consider include storage of a redundant copy of data with automatic failover for high availability, persistence to disk for in-memory databases, and cross-data center replication. While these features are not tied to HTAP performance per se, they are absolutely necessary for HTAP to be implemented in an enterprise environment.

Benefits of HTAP

HTAP performs transactional and analytical operations in a single database of record, often performing time-sensitive analysis of streaming data. For any data-centric organization, an HTAP-capable database can become the core of its data processing infrastructure, handling day-to-day operational workloads with ease. It serves as a database of record, but is also capable of analytics.

The benefit that this convergence of operations and analytics can bring is dramatic. Three key reasons why enterprises are turning to HTAP are as follows:

Real-time analytics. Many databases promise to speed up applications and analytics. However, there is a fundamental difference between simply speeding up existing infrastructure and actually opening up new opportunities for performance and optimization. True "real-time analytics" does not simply mean faster response times, but more that analytics that capture the value of data before it reaches a specified time threshold, usually in some fractions of a second. Combined, these three key features allow companies to see real results from big data. Until recently, technological limitations have necessitated maintaining separate workload-specific data stores, which introduces latency and complexity, and prevents organizations from capturing the full value of real-time data.

Reducing administrative and development overhead. With HTAP, data no longer needs to move from operational databases to separated data warehouses or data marts to support analytics. Rather, data is processed in a single system of record, effectively eliminating the need to extract, transform, and load (ETL) data. This benefit provides much welcomed relief to data analysts and administrators, as ETL often takes hours, and sometimes days, to complete.

Enabling new sources of revenue or products. An example of this can be illustrated in financial services, where investors must be able to respond to market volatility in an instant. Any delay is money comes out of their pockets. HTAP is making it possible for these organizations to respond to fluctuating market conditions as they happen, providing more value to investors.

HTAP-enabled databases are filling the void where big data promises have failed to deliver. A hybrid data processing model gives enterprises real-time access to streaming data, providing faster and more targeted insights. The ability to analyze data as it is generated allows organizations to spot operational trends as they develop rather than reacting after the fact. For applications like data center monitoring, this helps reduce or eliminate downtime. For applications that require monitoring a complex dynamic system, such as a shipping network, for instance, it allows analysts to "direct traffic" and make real-time decisions to boost revenue.

Industry Overview

MemSQL was built from the ground up to support the enterprise across many types of data types and formats. Our core technology is built for high performance data workloads that require sub-second response times as data is being ingested. We believe that organizations do not need to compromise transactional performance due to traditional architectures that depend on ETL, NoSQL, and non-relational standards. The chart below describes the industry landscape for the types of database technology.

industry-landscape

There are many types of database solutions in the market today however very few can handle the changing nature of what is now called modern data. The ability to act on data as it is happening in real-time and merge or perform analytics on historical data in a single database has become greatly desired but elusive for many organizations. Traditionally this has been achieved by organizations acquiring a diverse set of database solutions from open source to proprietary models. In fact, the ability to address many data types, at scale while leveraging industry standards is paramount to enterprises. MemSQL provides a fully compliant relational database that can address many data types to include geospatial, with cost effective scalability, ACID compliance and leverages industry standard SQL language.

Technical Overview

Real-time organizations can build modern applications on simple and scalable infrastructure that complements and extends existing data pipelines. MemSQL is the first in-memory database to combine:

  • Analytics in your operational database (lock-free data structure)
  • Sub-second latency across high-volume datasets
  • Simple and scalable distributed architecture
  • Flexible scale-out from commodity hardware to the cloud
  • Easy integration with existing data warehouses and applications

technical-overview

MemSQL supports standard SQL including joins, indexing, ACID transactions, and complex queries. Our database software runs on clusters of cost-effective commodity hardware, and has successfully scaled to many terabytes of data and hundreds to thousands of machines, running millions of transactions per second. Advanced management features include replication, failover, and robust systems monitoring on the cloud or behind your firewall. It has been designed from the ground up as a modern answer to modern big data problems.

As of version 4.0, MemSQL supports geospatial objects as first-class data types. You can create tables that combine time, key attributes, points, paths, and polygons over the Earth's surface, and query them all in real time. Geospatial features include geo-fencing, intersections, and spatial joins between tables, using our geospatial index for maximum performance.

MemSQL enables companies to process transactions and perform analysis simultaneously in a single operational database with a lock-free data structure. By loading everything into scalable DRAM, data can be executed across extremely large datasets in real-time. With immediate access to both real-time and historical data, MemSQL opens up new opportunities to explore new research areas, accelerate analysis and create new analytical models.

By storing all data in memory, MemSQL can serve query results from billions of records at the same moment data is being created. Insert latency is less than a millisecond and has been lab-tested to be faster than any other database for processing transactions and analysis in real-time.

Our key geospatial function is the ability to perform relational, temporal, and spatial analysis and transactions in the same system without compromising speed, scale, or simplicity. There are many databases that can perform relational and time-series analysis. There are plenty of spatial databases too. No other database on the market can perform all of three, in the same query, at massive scale and performance.

Geospatial Features and Functions

MemSQL makes geospatial data a first-class citizen: just as easy to use, at scale, at great speed and high throughput, as any other kind of relational data. MemSQL geospatial is a partial implementation of the OpenGIS standard for spatial processing.

  • Datatypes: MemSQL supports Points, Linestrings, Polygons and Multipolygons natively using the industry-standard Well-Known Text (WKT) format;
  • Topological filter functions: Contains, Intersects, Approximation, and Within Distance;
  • Measurement functions: Length, Area, Distance;
  • Indexing: MemSQL leverages its fast, massively parallel, lock-free indexing for geospatial objects;
  • Spatial Joins: MemSQL tables can be joined by their spatial relationships.

Use Case Examples

MemSQL Supercar

"Supercar" is a demo we built to exercise our geospatial features and the ability to perform transactions and analytics in real time on the same fast-moving dataset. It is a simulation of 50,000 taxis roaming around the New York metro area, picking up and dropping off passengers. Each vehicle reports its geolocation to the server once a second. A "trips" thread uses real-world NYC taxi data to create requests for pickups and destinations at a clip of several hundred per second. A taxi is chosen by performing a within_distance geospatial query to find the closest 20 available vehicles with the features the rider asks for (e.g., SUV, carseat, limo). A candidate is chosen at random and the taxi starts moving to the pickup point. The price of the ride is determined dynamically based on a geofence query and recent values for supply and demand within that geofence.

Once the rider is dropped off, the taxi performs another query to determine where it is, the price of that area, and the location of another area with a higher price. Having chosen a likely place to wait for another fare, it moves toward it.

The "pricing" thread dynamically adjusts local prices for taxi fares once a second. It looks up recent requests and taxi locations, grouped by the areas they occurred, and bumps the price of each geofence up or down based on the ratio of supply and demand. A web-based user interface plots the state of the system and allows the user to run real time analytical queries against the dataset.

supercar-demo

Forest Cover and Persistent Change

Percentage of change (i.e., finding the area of overlap and difference between the same geospatial entities at two points in time) is important for measuring growth or reduction over many observations of aerial imagery. To support this requires geospatial functions which measure area and distance, and compute the polygon of intersection, union, or disjoin. MemSQL currently supports several of these features, with others planned. With these in place, calculating the regrowth rate of a forest becomes a straightforward query:

select
    year,
    area(difference(f.shape, baseline.shape)) as new_growth,
    new_growth / area(baseline.shape) * 100 as percent_change
from forests f, forests baseline
where
    f.name = baseline.name
    and f.name = 'Decentralized Park',
    and baseline.year = '2010'
    group by f.year;
Year New Growth Percent Change
2010 0 0
2011 156713 5.90
2012 112340 4.22
2013 -8907 -0.3

Persistent Change. Datasets to support persistent change detection are large and complex. The initial input is imagery, which is processed to identify edges, contiguous blocks of color, etc. This generates a large dataset of polygons tagged with the date and time of the picture. Comparing these regions over many observations filters out temporary changes. The focus is on long-term changes in natural features (forest cover, glaciation, coastline erosion) as well as human constructs (farming, roads, stripmines, etc.). For example, regrowth after a forest fire or a new railway would qualify as a "persistent change", but a temporary change like an algae bloom, a blizzard, or an outdoor concert would not.

The key dimensions of this process are time, location, region attributes, and the boundaries of the regions being studied. Geospatial databases are often specialized for the task of spatial relationships only, and can be difficult to use for more complex queries that include all of the dimensions.

For example, imagine a query that finds roads that were cut through a forest over the last 5 years in a province, and have not yet been reviewed by an analyst. MemSQL easily executes queries like this, sifting through large amounts of spatial, temporal, and relational data all at once.

select f.id, f.name, c.event_date
from possible_changes c, provinces p, forests f
where
 c.event_date between '2010-04-18' and '2015-04-18'
 and c.type = 'road'
 and intersects(c.shape, f.borders)
 and intersects(f.borders, p.borders)
 and p.name = 'Userland'
 and analyst_review_date is NULL;

ID Name Event Date
1 Decentralized Park 2010-05-17
2 Red-Black Forest 2011-09-23

Conclusion

MemSQL provides a highly performant relational database which Gartner identifies as a Hybrid Transactional and Analytical Processing (HTAP) technology, which will provide organizations with the ability to perform much more "advanced and sophisticated real-time analysis of their data than with traditional architectures." MemSQL is a leader in HTAP technology. Combined with our geospatial data types and processing, ability to scale on commodity hardware in the cloud or on premise, and our adherence to industry standards makes MemSQL a choice.

Addendum - Geospatial Guide

With version 4, MemSQL has added support for geospatial queries. The term "geospatial" covers a very broad range of features. There is a formal standardization for some of those features, known as SQL-MM or OpenGIS. The features in MemSQL are not yet a complete implementation of that standard, but follows the same structure.

At a basic level, geospatial support requires three things: datatypes, topological functions, and measurement functions.

Geospatial Types

geospatial-types

The three main geospatial object types are polygons, paths, and points. We support a subset of the standard "Well-Known Text" (WKT) syntax to describe them. For example, a POINT is a simple longitude / latitude pair. Note the lack of a comma between the numbers.

"POINT(-74.044514 40.689244)"

A linestring or path object is an ordered list of points. Note the double parenthesis around pairs of numbers, which are separated by commas.

"LINESTRING((0 0, 0 1, 1 1))"

A polygon is similar to a linestring, except that it is "closed", that is to say, it must have at least three unique points, and the first and last point-pairs must be equal:

"POLYGON((0 0, 0 1, 1 1, 0 0))"

You can also describe multiple shapes inside a polygon object, for example, a set of islands, or a lake inside a country.

"POLYGON((0 0, 0 1, 1 1, 0 0), (5 5, 6 5, 5 6, 5 5))"

All geospatial functions can accept either a geospatial object or the equivalent WKT string as arguments. MemSQL does not yet support the MULTIPOLYGON, GEOMETRY, CURVE, or other keywords.

Creating a Geospatial Table

Geospatial objects are just another data type in MemSQL and can be used right alongside numbers, strings, and JSON. To create a column that holds point data, use the GeographyPoint type. For path and polygon data, use the Geography type.

Note: It is possible to store POINTs in the Geography type, but is not recommended. The GeographyPoint type was designed specifically for point data, and is highly optimized for memory efficiency and speed.

In this example, we'll create a table of neighborhoods. The "centroid" field is the point that is roughly in the middle of the borders of the neighborhood. The "shape" field is the polygon describing those borders. All geospatial data can be indexed for very high query performance.

create table neighborhoods (id int unsigned not null primary key, name varchar(64) not null, population int unsigned not null, shape GEOGRAPHY not null,centroid GEOGRAPHYPOINT not null,

index (shape) with (resolution = 8)

index (centroid), );

insert into neighborhoods values

(1,  "Hamilton", 12346, "POLYGON((....))", ...),
(2,  "Probability Bay", 263951, "POLYGON((....))", ...), (3,  "Decentralized Park", 29265, "POLYGON((....))", ...),
(4,  "Axiom Township", 845696, "POLYGON((....))", ...)
...

Querying Geospatial Data

querying-geospatial-data

You can run queries over the neighborhood data to see how they relate to each other spatially. For example, you can join the table to itself and measure distances between neighborhoods:

memsql> select b.name as town,
-> round(GEOGRAPHY_DISTANCE(a.centroid, b.centroid), 0) as
distance_from_center,
-> round(GEOGRAPHY_DISTANCE(a.shape, b.shape), 0) as distance_from_border
-> from neighborhoods a, neighborhoods b
-> where a.id = 2
-> order by 2;
+---------------------+----------------------+----------------------+
" town                " distance_from_center " distance_from_border "
+---------------------+----------------------+----------------------+
" Probability Bay     "                     0"                     0"
" Lower Bound Valley  "                 1924 "                     0"
" Axiom Township      "                 2907 "                     0"
" Elegant Island      "                 4551 "                 1935 "
" Upper Bound Hill    "                 7866 "                 5632 "
" Big Endian          "                11374 "                 9016 "
" Little Endian       "                19275 "                17290 "
" Isle Zero           "                19439 "                17163 "
...

You can also find out where you are with GEOGRAPHY_INTERSECTS:

memsql> select id, name from neighborhoods where
-> GEOGRAPHY_INTERSECTS("POINT(-73.94990499 40.69150746)", shape);

+----+----------+
" id " name     "
+----+----------+
"  1 " Hamilton "
+----+----------+
1 row in set (0.00 sec)

Spatial Joins

Geospatial objects are first-class datatypes in MemSQL, which means that you can use spatial relationships to join tables. For example, to find all of the businesses in your neighborhood, you join the businesses table with the neighborhoods table on whether the business location (a point) is contained by the neighborhood's shape (a polygon).

memsql> select b.id, b.name from businesses b, neighborhoods n
    -> where n.name="Abstract Island" and
    -> GEOGRAPHY_CONTAINS(n.shape, b.location);
+----+------------------------------+
" id " name "
+----+------------------------------+
" 2  " TouringMachinesBikeRepair "
" 17 " Random Slice Pizza "
" 34 " The Philosopher's Diner "
" 51 " Push & Pop Cafe "
...

These functions and many more are documented in the MemSQL Geospatial Reference.

Geospatial Model

MemSQL 4 uses a spherical model similar to that used in Google Earth. It assumes a perfectly spherical Earth with a radius of 6,367,444.66 meters. The real Earth is not a sphere. It resembles an ellipsoid slightly wider than it is tall, with many local variations. In a spherical model, measurements of distance and area may vary from reality. For example, a 10,000 meter line drawn vertically over San Francisco would measure about 10,020 meters in this model. Topological functions, which describe the relationships between shapes and points, are less affected by this difference.

Columnstore Geospatial

In MemSQL 4, geospatial data and functions are supported in columnstore (ie on-disk) tables. Columnstore tables do not yet support indexes on geo data. The native GeographyPoint type is supported but the Geography type is not. However, you can store path and polygon data in a plain text field, and run geospatial functions over them just as you do with rowstore (in-memory) tables.

create COLUMNSTORE table neighborhoods_columnstore (

    id int unsigned not null,
    name varchar(64) not null,
    population int unsigned not null,
    centroid GEOGRAPHYPOINT not null,
    shape TEXT not null
);

insert into neighborhoods_columnstore values
    (1,  "Hamilton", 12346, "POLYGON((....))", ...),
    (2,  "Probability Bay", 263951, "POLYGON((....))", ...),
    (3,  "Decentralized Park", 29265, "POLYGON((....))", ...),
    (4,  "Axiom Township", 845696, "POLYGON((....))", ...)
...

select id, name from neighborhoods_columnstore where GEOGRAPHY_INTERSECTS("POINT(-73.94990499 40.69150746)", shape);

Special Spatial Tuning Parameters

Spatial indices on Geography columns may be given an optional RESOLUTION parameter. You can choose a resolution between 6 and 32. This controls how many pieces the polygons and linestrings are broken up into for indexing. As a general rule, a lower number means less memory consumption, faster inserts and updates, but slower query times. A higher number may increase query performance at a cost of memory and insert performance.

Note: As with all tuning knobs, these parameters are not magic. Knobs should only be twiddled if you have a measurable need for more performance and have worked through the tradeoffs between ingest, computational load and memory consumption.

For example, it may be useful to compare the percentage difference in the number of rows returned by GEOGRAPHY_INTERSECTS and APPROX_GEOGRAPHY_INTERSECTS. This difference is the "false positive" rate of the index for a given resolution. For example, at resolution 16, suppose that 10% more records are returned by the approximate function. At resolution 8, the rate only grows to 11%. In that case you should be able to save memory and increase ingest throughput by using the lower number, without significantly harming query performance. Again, this kind of tuning is highly dependent on your particular dataset and workload. The parameters obtained for one application may not apply to another. Your mileage may vary.

You can also change the resolution of polygons passed into geospatial functions at query time. In this example, the resolution of the "shape" column is unchanged, but the POLYGON in the second argument is broken into 16 parts. Each part is queried separately. A lower number will result in fewer index lookup operations, but may return more false positive candidates from the index which are then filtered out of the result set in the second phase of query execution.

select *
from neighborhoods with (index = shape, resolution = 16)
where geography_intersects(shape,"POLYGON(...)");

Using GeoJSON Data

MemSQL as of version 4 does not have native GeoJSON support. However, we have generic JSON support as well as computed columns. Combining these features, you can import a subset of GeoJSON data into MemSQL the type system automatically.

For example, given a set of GeoJSON points like this:

{
    "type": "Feature",
    "geometry": {
    "type": "Point",
    "coordinates": [123.456, 78.901]
},

    "properties": {
    "id": 123456,
    "name": "Probability Bay Naval Observatory"
    }
}

You can construct a table like this:

memsql> create table geostuff (

    ->  id as dataz::properties::%id persisted bigint unsigned,
    ->  name as dataz::properties::$name persisted varchar(128),
    ->  location as geography_point(
    ->  dataz::geometry::coordinates::%`0`,
    ->  dataz::geometry::coordinates::%`1`
    ->  ) persisted geographypoint,
    ->  dataz JSON not null,
    ->  index (location)
    ->  );

And then insert the JSON string:

memsql> insert into geostuff (dataz) values ('{"type": "Feature", "geometry": {"type": "Point","coordinates": [123.456, 78.901]},"properties": {"id": 123456, "name": "Probability Bay Naval Observatory" }}');

memsql> select id, name, location from geostuff;

+--------+-----------------------------------+---------------------------------+
"   id   "               name                "             location            "
+--------+-----------------------------------+---------------------------------+
" 123456 " Probability Bay Naval Observatory " POINT(123.45600012 78.90100000) "
+--------+-----------------------------------+---------------------------------+