This week at the Esri Developers Summit in Palm Springs, our friends at Esri are previewing upcoming features for the next release of MemSQL, using a huge real-world geospatial dataset.
Esri develops geographic information systems (GIS) that function as an integral component in nearly every type of organization. In a recent report by the ARC Advisory Group, the Geographic Information System Global Market Research Study, the authors stated, “Esri is, without a doubt, the dominant player in the GIS market.”
Everything happens somewhere. But, traditionally, spatial data has been locked away in specialized software that either lacked general database features, or didn’t scale out. With MemSQL we are making geospatial data a first-class citizen: just as easy to use, at scale, at great speed and high throughput, as any other kind of data.
The demonstration uses the “Taxistats” dataset: a compilation of 170 million real-world NYC taxi rides. It includes GPS coordinates of the pickup and dropoff, distance, and travel time. MemSQL is coupled with the new version of Esri’s ArcGIS Server, which has a new feature to translate ArcGIS queries into external database queries. From there we generate heatmaps from the raw data in sub-second time.
Heatmaps are a great way to visualize aggregate geospatial data. The X and Y are the longitude and latitude of “cells” or “pixels” on the map, and the color shows the intensity of the values. From there you can explore the dataset across any number of dimensions: zoom in on an area, filter by time, length of ride, and more.
How it Works
To prepare the data, we simply loaded it into a MemSQL table running on a modest Amazon AWS cluster. At first we thought about pre-processing the geospatial points to calculate hexgrid values for faster querying, but that turned out to be unnecessary. MemSQL is fast enough to aggregate the point data into heatmap “pixels” on the fly. Alternatively, we could have generated a table of hexagons and performed a spatial join between the point table and the pixel table.
The queries are almost absurdly simple. We filter the table to an area on the map using GEOGRAPHY_INTERSECTS, filter further by time and date, round off the GPS coordinates to gather them into rough “pixels” about 100 meters wide, and group them. Spatial data is just another type of data in MemSQL, complete with lock-free indexes and powerful functions for manipulating it. The demo zips through nearly 200 million complex records in under 150 milliseconds:
select count(1) as cnt, round(GEOGRAPHY_LONGITUDE(pickup), 3) as lon, round(GEOGRAPHY_LATITUDE(pickup), 3) as lat from taxistats where day_of_week = ‘Monday’ and hour_of_day = 2 and month_of_year = ‘January’ and GEOGRAPHY_INTERSECTS(pickup, 'POLYGON((-74.96857503 40.79939298,...))') group by 2, 3 having count(1) > 10 limit 10; +-----+---------+--------+ | cnt | lat | lon | +-----+---------+--------+ | 19 | -73.997 | 40.723 | | 19 | -74.003 | 40.730 | | 16 | -74.000 | 40.732 | | 21 | -73.998 | 40.735 | | 13 | -74.007 | 40.735 | | 51 | -73.998 | 40.729 | | 18 | -74.000 | 40.754 | | 20 | -74.004 | 40.740 | | 17 | -73.981 | 40.765 | | 21 | -74.000 | 40.761 | +-----+---------+--------+ 10 rows in set (0.14 sec)
Real-world Use Cases for Geospatial
Slicing by hour of the day, we can calculate the average speed of a taxi ride (distance / time) and find the best and worst places for traffic jams. Slicing by day of week, we can see the ebb and flow of traffic during workdays and weekends. For a city planner, this kind of data can be used to redirect traffic at specific times in effort to unclog traffic congestion. For the taxi business, this data can improve efficiency with supply and demand of cabs during times of high or low traffic for any given region.
Welcoming Geospatial to the World of Relational In-Memory Databases
For too long, geospatial data has been relegated to specific databases and datastores. With the emergence of mobile phones and the coming wave of wearables and sensor data related to the Internet of Things, everything is now logged in both time and space.
Now enterprises have an opportunity to combine geospatial data with the rest of their organizational data in a single database that is in-memory, linearly scalable, and supports a full range of relational SQL and geospatial functions. We look forward to a whole world of applications and opportunities.