Newest Upsert Benchmark showcases critical use case for internet billing with telcos, ISPs, and CDNs
MemSQL achieves 7.9 million upserts per second, 6x faster than Cassandra
Benchmark details and scripts now available on GitHub
The business need for fast updates and live dashboards
Businesses want insights from their data and they want it sooner rather than later. For fast-changing data, companies must rapidly glean insights in order to make the right decisions. Industry applications like IoT telemetry monitoring, mobile network usage, internet service provider (ISP) billing, and content delivery network (CDN) usage tracking depend upon real-time analytics with fast-changing data. Web traffic merits special attention since it continues to grow at an astounding rate. According to Cisco, Global IP traffic will increase nearly threefold over the next 5 years, and will have increased nearly a hundredfold from 2005 to 2020. Overall, IP traffic will grow at a compound annual growth rate (CAGR) of 22 percent from 2015 to 2020. Many businesses face the challenge of monitoring, analyzing, and monetizing large scale web traffic, so we will explore this use case.
Use case example
In particular, we dive into the example of a content delivery or distribution network (CDN). A CDN is a globally distributed network of web servers deployed in multiple data centers across different geographic regions and is relied upon by content providers such as media companies and e-commerce vendors to deliver content to end users. CDNs have a business need to monitor their system in real-time. In addition to logging customer usage for the purpose of billing, they want to be alerted to sudden increases and decreases in their workloads for load balancing as well as for detecting network events like “denial of service attacks”. The sheer volume of web traffic mandates a massive parallel processing (MPP) system that can scale out to support the load. The concurrent need for real-time analytics points to the direction of hybrid transaction/analytical processing, or HTAP. HTAP systems enable high speed ingest and sophisticated analytics simultaneously without data movement or ETL.
Background on the Upsert Benchmark
This benchmark demonstrates the raw horsepower of a database system capturing high volume updates. Update, or upsert, is the operative word here. With a conventional
insert a new row is created for each new database entry. With an upsert, individual rows can be updated in place. This upsert capability allows for a more efficient database table and faster aggregations, and it is particularly useful in areas such as internet billing. For more detail on this workload in use, take a look at this blog post, Turn Up the Volume With High-Speed Counters.
MemSQL delivers efficient upsert performance, achieving up to 8 million upserts per second on a 10 node cluster, using the following parameterized query:
Upsert query for MemSQL
insert into records (customer_code, subcustomer_id, geographic_region, billing_flag, bytes, hits) values on duplicate key update bytes=bytes+VALUES(bytes),hits=hits+VALUES(hits);
Comparing Upsert performance
Legacy databases and data warehousing solutions are optimized for batch loading of data and subsequently are unable to handle fast data insertions along with ad-hoc analysis of freshly generated data. NoSQL databases like Cassandra can handle fast data insertions but have more challenges with upserts, which are critical for web traffic monitoring across end-customer behavior and tracking web requests. More importantly however, Cassandra does not provide native support for analytics and requires users to bring in additional components like SparkSQL in order to support meaningful querying of data.
We created the following query for Cassandra:
Upsert query for Cassandra
update perfdb.records set hits = hits + 1 where timestamp_of_data=1470169743185 and customer_code=25208 and subcustomer_id='ESKWUEYXUKRB' and geographic_region=10 and billing_flag=1 and ip_address='220.127.116.11';
The upsert benchmark is based on a simulated workload that logs web traffic across ten different geographic regions. MemSQL 5.1 runs on a 10 node m4.10xlarge cluster on AWS, at $2.394 per Hour (effective pricing with 1-year reserved instances), and is able to execute up to 8 million upserts per second and simultaneously run live queries on the latest data to provide a real-time window on the changing shape of traffic.
Cassandra running on an identical cluster achieves 1.5 million upserts per second. We tested the most recent 3.0.8 version of Apache Cassandra. In the Cassandra query, update means upsert.
As noted in the following chart, MemSQL scales linearly as we increase the number of machines with a batch size of 500. Cassandra however, does not appear to support large batch sizes well. According to Cassandra,
# Caution should be taken on increasing the size of this threshold as it can lead to node instability. # Fail any batch exceeding this value. 50kb (10x warn threshold) by default.
So we set
batch_size_fail_threshold_in_kb: 5000 to support a 10,000 row batch size, but we encountered numerous errors that prevented the benchmark from running on Cassandra with these settings.
Additional differences between datastores
It is important to note that MemSQL has consistent and durable transactions whereas Cassandra has eventually consistent, non-transactional counters with less stringent semantics. So every insert that MemSQL processes is transactionally consistent and immediately available. Cassandra’s counters instead offer eventual consistency without transactional semantics. In short MemSQL retains ACID compliance and Cassandra does not, and retaining ACID guarantees are an important part of building an application for critical environments like billing.
Upserts are only half the story, analytics are the rest
Of course, the incoming data is only part of the picture. A critical complement is the ability to run sophisticated queries on the data as soon as it is recorded.
Unlike MemSQL, Cassandra does not support analytical queries natively and users are forced to either install SparkSQL on the same cluster or on another cluster. In addition to the extra complexity, users face additional latency due to SparkSQL having to read data into Spark dataframes before running the analytical queries.
Sophisticated queries run natively in MemSQL
Since MemSQL is a fully relational database with a native SQL engine, users can run sophisticated queries immediately after storing data. Here are a few sample queries that could be used in an internet billing application.
The following query lists the top five customers for each of the ten geographic regions based on the number of bytes delivered.
Top 5 customers in each geographic region
select * from ( select *, row_number() over (partition by geographic_region order by bytes desc rows between unbounded preceding and current row) as rank from records where timestamp_of_data > now() - interval 1 minute ) as REGIONS_WINDOWED where rank <= 5;
CDNs need to monitor surges in traffic in order to load balance or provision additional capacity. The following query lists top ten customers in terms of increase in traffic.
Top 10 increases in the last one second span
with RECORDS_AGGREGATED as ( select timestamp_of_data, subcustomer_id, sum(bytes) as bytes, UNIX_TIMESTAMP(timestamp_of_data) as epoch_seconds from records where timestamp_of_data between (now() - interval 2 second) and (now() - interval 1 second) group by timestamp_of_data, subcustomer_id ), RECORDS_WINDOWED as ( select * ,lead(bytes, 1) over (partition by subcustomer_id order by timestamp_of_data desc rows between unbounded preceding and current row) as bytes_prev ,lead(epoch_seconds, 1) over (partition by subcustomer_id order by timestamp_of_data desc rows between unbounded preceding and current row) as epoch_seconds_prev from RECORDS_AGGREGATED ), TRAFFIC_SURGES as ( select timestamp_of_data, subcustomer_id, epoch_seconds, bytes, bytes - if(epoch_seconds_prev = epoch_seconds - 1, bytes_prev, 0) as bytes_change from RECORDS_WINDOWED where epoch_seconds = unix_timestamp(now()) - 1 ) select timestamp_of_data, subcustomer_id, epoch_seconds, bytes, bytes_change from TRAFFIC_SURGES order by bytes_change desc limit 10;
Usually, steep increases in web traffic can be attributed to increased customer interest due to special events or promotions. However, in some cases, the steep increases are caused by Denial of Service attacks. The following query generates a DoS attack alert if the steep increase is caused by large number of requests from a relatively small number of IP addresses. The steep increase is defined as an increase by more than two standard deviations from the mean.
with RECORDS_AGGREGATED as ( select timestamp_of_data, subcustomer_id, sum(bytes) as bytes, UNIX_TIMESTAMP(timestamp_of_data) as epoch_seconds, approx_count_distinct(ip_address) as unique_ip_addrs from records where timestamp_of_data between (now() - interval 2 second) and (now() - interval 1 second) group by timestamp_of_data, subcustomer_id ), RECORDS_WINDOWED as ( select * ,lead(bytes, 1) over (partition by subcustomer_id order by timestamp_of_data desc rows between unbounded preceding and current row) as bytes_prev ,lead(epoch_seconds, 1) over (partition by subcustomer_id order by timestamp_of_data desc rows between unbounded preceding and current row) as epoch_seconds_prev from RECORDS_AGGREGATED ), TRAFFIC_SURGES as ( select timestamp_of_data, subcustomer_id, epoch_seconds, bytes, unique_ip_addrs, bytes - if(epoch_seconds_prev = epoch_seconds - 1, bytes_prev, 0) as bytes_change from RECORDS_WINDOWED where epoch_seconds = unix_timestamp(now()) - 1 ) select timestamp_of_data, subcustomer_id, epoch_seconds, bytes, bytes_change from TRAFFIC_SURGES where unique_ip_addrs < 100 and bytes_change > (select avg(bytes_change) + 2*stddev(bytes_change) from TRAFFIC_SURGES) order by bytes_change desc limit 10;
All together, MemSQL presents a compelling option both in terms of linear scale performance, but also the ability to run sophisticated queries with ease using SQL.
Full benchmark details on GitHub.com
For complete details of the benchmark including scripts, please visit https://github.com/memsql/upsert-benchmark.
Feel free to download MemSQL if you’d like to try it yourself.