High Availability in SingleStore Helios

BF

Brian Finnegan

Principal Cloud Solutions Engineer

High Availability in SingleStore Helios

Singlestore Helios enables and manages high availability for you, and SingleStoreDB Premium extends this to multi-AZ high availability. This blog post is intended to help customers understand the high availability features of Singlestore Helios, and to demonstrate failover when a problem occurs.

availability-vs-high-availabilityAvailability vs High Availability

I started my career as an Avionics Engineer. I had great exposure to aircraft systems that can safely tolerate component malfunctions. Since then, I’ve maintained what one could arguably describe as an unhealthy interest in systems failure.

This has manifested itself in various ways: from reading NTSB reports and bingeing episodes of Air Crash Investigation, to reading Post Event Summaries and Post Incident Reviews published by public cloud providers. I also worked for a time with a company who specialised in providing highly available IT systems. But before discussing high availability in depth, we need a definition of “availability”.

The Oxford English Dictionary defines availability as the ability of a system “to be used or obtained”. From a database perspective, it means the database system can serve its clients. Availability is typically measured in terms of uptime, normally expressed as a percentage. For example, a 99.99% availability Service Level Agreement (SLA) means a system can be down for just under 53 minutes in an entire year — without breaching the availability SLA.

High availability (commonly abbreviated to HA)  is the ability of a system to provide higher uptime than normal. For example, in a dual server active passive cluster, if the active server fails the passive server takes over the active role. Service continues, and as a result, the uptime of the service is higher than if the service had been reliant on the single failed server. The clustering technology provides the high availability functionality.

Any infrastructure component that forms part of a database management system can fail. Failure modes vary, but generally fall into these categories:

  • Failure of a single machine. My own experience of these failures includes hardware failures like RAID controller, NIC and RAM failure, to software failures like a memory leak in endpoint protection software causing a server to cease all network communications. High availability technologies at the single machine level include dual power supplies, memory mirroring, dual port HBAs, RAID and NIC teaming. Note that localised failures can have a broad effect, including the NIC failure that brought down an airport’s radar system.
  • A failure of more than one machine. This can include the failure of a shared device or service, like a storage array,  top of rack switch, power or cooling to one or more racks in a data hall or an infrastructure software bug — like the one that affected VMWare virtual machines. High availability can be attained by implementing “N” level redundancy, e.g. 2N or N+1 servers, switches, powering racks with dual electrical feeds, etc.
  • Loss of an entire data centre. While loss of a data centre is less common than other failure modes, it can and does happen (examples include Northgate, OVHcloud and Twitter).
  • Loss of an entire region. While it’s unlikely that all data centres in a broad geographical region would be lost, it is possible due to natural disasters, including large earthquakes, tsunamis or  heat waves. A regional outage can also be caused by civil unrest or war. Appropriate mitigations depend on availability requirements. Constraints may limit some mitigations, e.g. if data must reside in a particular geographical region. SingleStoreDB’s high availability architecture is limited to failure modes within regions (out-of-region disaster recovery is a topic for another day).

high-availability-in-singlestore-heliosHigh Availability in Singlestore Helios

One of the design principles of SingleStoreDB is “no single point of failure,” and Singlestore Helios is highly available by default. It is important to know that there are two types of nodes in SingleStoreDB: aggregator nodes and leaf nodes.

Aggregators route queries to leaf nodes, aggregate results, and send results to clients. Aggregator high availability in Singlestore Helios is achieved using cloud provider load balancers. They distribute traffic to aggregators, and detect and recover from aggregator failure. This method of attaining high availability is common, is not unique to Singlestore Helios and is not quite the focus of this blog post.

A leaf node is responsible for a subset of a cluster’s data. There is always more than one leaf node in a Singlestore Helios Workspace. SingleStoreDB automatically distributes data across leaf nodes. High availability for leaf nodes is attained by replicating data across two availability groups. An availability group is a group of leaf nodes. Each availability group has a copy of every data partition, some as masters and some as replicas. If a single node should fail, there is still a copy of the data in the system.

Master partitions are evenly distributed across workspace leaf nodes, and replicas are spread evenly across the opposite availability group. This ensures that when a failover occurs, the additional load is evenly balanced across the workspace, rather than having a single node take the entire additional load of the failed node. Should a node fail, Singlestore Helios keeps the databases online by automatically promoting the relevant replica partitions to master partitions. 

The following diagrams show partition distribution before and after a failover event. Master partitions are evenly dispersed across nodes in the first diagram. Replicas of master partitions in an availability group are spread evenly across leaf nodes in the other availability group. For example, “db_0 master” on Node 1 has a replica “db_0 replica” on Node 2. Similarly, db_1 has a replica on Node 4.

If Node 1 should fail, the replica of db_0 on Node 2 and the replica of db_1 on Node 4 are both promoted to master replicas. This provides the database with high availability.

Note that with Singlestore Helios Premium and Dedicated Editions, the availability groups are in different cloud provider availability zones. This ensures that your Singlestore Helios Workspace remains available in the event of an availability zone failure.

demo-timeDemo Time

I can’t demonstrate a failover event on Singlestore Helios so I’ll spin up a test environment using SingleStoreDB Self-Managed that I have on AWS. SingleStoreDB Self-Managed and Singlestore Helios have the same architecture — while this demo is performed on SingleStoreDB Self-Managed, the result is the same on Singlestore Helios.

This was the state of the cluster when I started it:

You can see that while everything is healthy, high availability is off. This can also be seen by running sdb-admin show-cluster (output truncated for readability):

[ec2-user@ip-10-0-0-30 ~]$ sdb-admin show-cluster
✓ Successfully ran 'memsqlctl show-cluster'
+------------+------------+------+--------------------+
|    Role    |    Host    | Port | Availability Group |
+------------+------------+------+--------------------+
| Leaf       | 10.0.0.246 | 3306 | 1                  |
| Leaf       | 10.0.0.236 | 3306 | 1                  |
| Leaf       | 10.0.0.136 | 3306 | 1                  |
| Leaf       | 10.0.0.77  | 3306 | 1                  |
| Aggregator | 10.0.0.30  | 3306 |                    |
+------------+------------+------+--------------------+

You can see there is only a single availability group: ‘1’. I enable high availability by running sdb-admin enable-high-availability on the master aggregator:

[ec2-user@ip-10-0-0-30 ~]$ sdb-admin enable-high-availability
Toolbox will perform the following actions to upgrade the cluster to
redundancy 2
  · Configure the cluster to enable redundancy

  · Remove leaf nodes that need to be re-assigned to availability group 2
    - remove leaf node 10.0.0.77:3306
    - remove leaf node 10.0.0.136:3306

  · Add back existing leaf nodes and reassign them to availability group 2
    - add leaf node 10.0.0.77:3306 into availability group 2
    - add leaf node 10.0.0.136:3306 into availability group 2

  · Restore redundancy on each database
    - Restore redundancy on database "adtech"
    - Restore redundancy on database "test"

Would you like to continue? [y/N]: y
✓ Set cluster redundancy level to 2
✓ Removed leaf 10.0.0.77:3306 from cluster
✓ Removed leaf 10.0.0.136:3306 from cluster
✓ Re-added leaf 10.0.0.77:3306 to cluster
✓ Re-added leaf 10.0.0.136:3306 to cluster
✓ Executed RESTORE REDUNDANCY ON "adtech"
Operation completed successfully

You can see the two databases are adtech and test. I only need one database for my purposes, so I keep adtech and drop test. More information on the adtech database is available in our SingleStore documentation.

I checked high availability status once more (truncated again for visibility):

[ec2-user@ip-10-0-0-30 ~]$ sdb-admin show-cluster
✓ Successfully ran 'memsqlctl show-cluster'
+------------+------------+------+--------------------+
|    Role    |    Host    | Port | Availability Group |
+------------+------------+------+--------------------+
| Leaf       | 10.0.0.246 | 3306 | 1                  |
| Leaf       | 10.0.0.236 | 3306 | 1                  |
| Leaf       | 10.0.0.77  | 3306 | 2                  |
| Leaf       | 10.0.0.136 | 3306 | 2                  |
| Aggregator | 10.0.0.30  | 3306 |                    |
+------------+------------+------+--------------------+

There are now two availability groups: ‘1’ and ‘2’.

One of the benefits of SingleStoreDB Managed Cloud is high availability is enabled by default, meaning you don’t need to  concern yourself with the preceding low-level details to enable it. However, as I’m using a self-managed environment for these tests, I had no choice but to set it up myself.

Checking SingleStore Studio again, I can see my nodes and database are all healthy and that high availability is on:

Incidentally, you can see the version of SingleStoreDB Self-Managed I’m using is 7.3. This is an environment I built some time ago, but I haven’t upgraded it to SingleStoreDB 8.0 yet. Singlestore Helios version upgrades are managed for you by SingleStore — which is another good reason to use it instead of SingleStoreDB Self-Managed. But the high-availability features I need for this demo are available in 7.3, so I’ll continue.

Options for simulating failure include:

  • Shutting down a leaf node. This should work, but I’m seeking something less graceful.
  • Forcibly detaching the network interface from a leaf node. This won’t work because I only have a single primary Elastic Network Interface (ENI) per EC2 instance. By design, AWS EC2 primary ENIs can’t be detached or force detached.
  • Changing the security group of a leaf node to one that denies necessary communications. This won’t work because existing connections are not immediately interrupted.
  • Forcibly killing the SingleStoreDB processes running on a leaf node. This should work, so it’s what I’ll use for this test.

Incidentally, when I asked our SRE team what they’ve observed in production, I was informed that failover for this type of hard failure mode occurs in approximately three seconds. I have a baseline — let’s see what my own tests show.

I’ll use dbbench for testing, following the getting started instructions. The only change I have to make is to run go install github.com/memsql/dbbench@latest instead of go get github.com/memsql/dbbench.

I create a query.sql file with the following contents:

SELECT
   events.advertiser  AS events.advertiser,
   
COUNT(*) AS `events.count`
   
FROM adtech.events  AS events
   
WHERE
        (events.advertiser LIKE '%Subway%' OR events.advertiser LIKE
'%McDonals%' OR events.advertiser LIKE '%Starbucks%' OR events.advertiser
LIKE '%Dollar General%' OR events.advertiser LIKE '%YUM! Brands%' OR
events.advertiser LIKE '%Dunkin Brands Group%')
   
GROUP BY 1
   ORDER BY 2 DESC;

I create a simple dbbench configuration file called demo_time.ini with the following contents:

[failover test]
query-file=./query.sql

I start dbbench on my master aggregator as follows:

[ec2-user@ip-10-0-0-30 go]$ dbbench --host=127.0.0.1 --port=3306 --username root
--password <redacted> --database=adtech ./demo_time.ini

That will run continuously until I stop it — which is what I need for my purposes. I test my configuration when the cluster is healthy, and manually stop it after 10 runs:

[ec2-user@ip-10-0-0-30 go]$ dbbench --host=127.0.0.1 --port=3306 --username root
--password <redacted> --database=adtech ./demo_time.ini
2023/03/07 10:11:00 Connecting to
root:aA0gj3ap4qeAxgU@tcp(127.0.0.1:3306)/adtech?allowAllFiles=true&interpolateParams=true
2023/03/07 10:11:00 Connected
2023/03/07 10:11:00 starting failover test
2023/03/07 10:11:01 failover test: latency 5.785496ms±142.164µs; 172 transactions (172.466
TPS); 860 rows (862.330 RPS)
2023/03/07 10:11:02 failover test: latency 5.753536ms±130.945µs; 173 transactions (173.320
TPS); 865 rows (866.601 RPS)
2023/03/07 10:11:03 failover test: latency 5.768018ms±134.791µs; 173 transactions (172.936
TPS); 865 rows (864.682 RPS)
2023/03/07 10:11:04 failover test: latency 5.733945ms±129.323µs; 174 transactions (174.034
TPS); 870 rows (870.171 RPS)
2023/03/07 10:11:05 failover test: latency 5.777818ms±152.785µs; 173 transactions (172.725
TPS); 865 rows (863.627 RPS)
2023/03/07 10:11:06 failover test: latency 5.712914ms±127.154µs; 175 transactions (174.701
TPS); 875 rows (873.503 RPS)
2023/03/07 10:11:07 failover test: latency 5.650113ms±123.938µs; 176 transactions (176.618
TPS); 880 rows (883.092 RPS)
2023/03/07 10:11:08 failover test: latency 5.734043ms±142.509µs; 174 transactions (174.059
TPS); 870 rows (870.293 RPS)
2023/03/07 10:11:09 failover test: latency 5.715959ms±133.074µs; 175 transactions (174.531
TPS); 875 rows (872.657 RPS)
2023/03/07 10:11:10 failover test: latency 5.735491ms±142.657µs; 174 transactions (173.983
TPS); 870 rows (869.916 RPS)
^C2023/03/07 10:11:10 stopping failover test
2023/03/07 10:11:10 failover test: latency 5.731991ms±41.647µs; 1825 transactions (174.073
TPS); 9125 rows (870.364 RPS)
  4.194304ms -   8.388608ms [  1809]: ██████████████████████████████████████████████████
  8.388608ms -  16.777216ms [    16]: ▍
[ec2-user@ip-10-0-0-30 go]$

Now I run dbbench again. During the run, I simulate a failure by forcibly killing the SingleStoreDB processes running on leaf 1 by running sudo pkill '^memsql' on leaf 1 (memsql being the heritage name for SingleStore). Here’s what happened dbbench:

[ec2-user@ip-10-0-0-30 go]$ dbbench --host=127.0.0.1 --port=3306 --username root --password <redacted> 
--database=adtech ./demo_time.ini
2023/03/07 11:17:23 Connecting to
root:aA0gj3ap4qeAxgU@tcp(127.0.0.1:3306)/adtech?allowAllFiles=true&interpolateParams=true
2023/03/07 11:17:23 Connected
2023/03/07 11:17:23 starting failover test
2023/03/07 11:17:24 failover test: latency 5.691413ms±137.508µs; 175 transactions (175.299 TPS); 875 rows
(876.495 RPS)
2023/03/07 11:17:25 failover test: latency 5.765873ms±156.533µs; 173 transactions (173.051 TPS); 865 rows
(865.254 RPS)
2023/03/07 11:17:26 failover test: latency 5.754822ms±148.561µs; 173 transactions (173.402 TPS); 865 rows
(867.008 RPS)
2023/03/07 11:17:27 error for query SELECT
    events.advertiser  AS events.advertiser,
    COUNT(*) AS events.count
    FROM adtech.events  AS events
    WHERE
         (events.advertiser LIKE '%Subway%' OR events.advertiser LIKE '%McDonals%' OR events.advertiser
LIKE '%Starbucks%' OR events.advertiser LIKE '%Dollar General%' OR events.advertiser LIKE '%YUM!
Brands%' OR events.advertiser LIKE '%Dunkin Brands Group%')
    GROUP BY 1
    ORDER BY 2 DESC in failover test: Error 2004 (HY000): Leaf Error (10.0.0.246:3306): Error reading from
socket. Errno=104 (Connection reset by peer)

The error above occurred when I killed all memsql processes on leaf 1. I immediately started dbbench again:

[ec2-user@ip-10-0-0-30 go]$ dbbench --host=127.0.0.1 --port=3306 --username root --password <redacted>
--database=adtech ./demo_time.ini
2023/03/07 11:17:30 Connecting to
root:aA0gj3ap4qeAxgU@tcp(127.0.0.1:3306)/adtech?allowAllFiles=true&interpolateParams=true
2023/03/07 11:17:30 Connected
2023/03/07 11:17:30 starting failover test
2023/03/07 11:17:31 failover test: latency 10.59886ms±218.439µs; 94 transactions (94.227 TPS); 470 rows
(471.136 RPS)
2023/03/07 11:17:32 failover test: latency 10.746505ms±215.934µs; 93 transactions (92.937 TPS); 465 rows
(464.686 RPS)
2023/03/07 11:17:33 failover test: latency 10.836879ms±196.614µs; 92 transactions (92.168 TPS); 460 rows
(460.840 RPS)
2023/03/07 11:17:34 failover test: latency 10.660179ms±200.68µs; 94 transactions (93.698 TPS); 470 rows
(468.489 RPS)
^C2023/03/07 11:17:35 stopping failover test
2023/03/07 11:17:35 failover test: latency 10.7299ms±94.468µs; 433 transactions (93.083 TPS); 2165 rows
(465.415 RPS)
  8.388608ms -  16.777216ms [   433]:
██████████████████████████████████████████████████
[ec2-user@ip-10-0-0-30 go]$ 

As you can see, “immediately” for this human means I re-tried at 11:17:30 — three seconds after the initial error occurred. My dbbench test again proceeded as normal because the failover had successfully taken place.

This is what the dashboard looked like:


Note that while one of five nodes is showing as critical (and licence capacity now shows three units allocated of four available), the database is only showing a warning. That’s because high availability is on, failover has occurred and clients can still be served.

I can deduce the failover time by having a look at the
memsql.log file. As you might imagine, forcibly shutting down the SingleStoreDB processes on the Leaf node caused a lot of logging, so some pertinent extracts are presented below.

Initially we can see Heartbeat error detection, with the last line in this block indicating that Failover has been triggered:

6351167184 2023-03-07 11:17:27.490  ERROR: ProcessNetworkEvents Heartbeat
connection error reading heartbeat response header to 10.0.0.246:3306 (2 = End
of file)
6351167446 2023-03-07 11:17:27.490  ERROR: ProcessNetworkEvents Heartbeat
connection error connecting to 10.0.0.246:3306 (111 = Connection refused)
6351167464 2023-03-07 11:17:27.490  ERROR: ProcessNetworkEvents Failure during
heartbeat connecting to 10.0.0.246:3306
6352520622 2023-03-07 11:17:28.844   INFO: ProcessTransactions Node
10.0.0.246:3306 heartbeat failure summary. Initial heartbeat failure at
2023-03-07 11:17:27. 211 Consecutively Missed heartbeats. Failover was
triggered after 200 missed heartbeats

As the failed Leaf is unreachable, we see that it is transitioned to offline:

6352517302 2023-03-07 11:17:28.840   INFO: ProcessTransactions Thread 115046:
HandleLeafFailure: Leaf 10.0.0.246:3306 is no longer reachable and will
transition from 'online' to 'offline'

The promotion of replica partitions commences (showing logging for one partition only for brevity purposes):

6352552538 2023-03-07 11:17:28.876   INFO: Partition action PROMOTE PARTITION
WITH OFFLINE REPOINT partition: adtech:11 on node: '10.0.0.77':3306 starting

6352560287 2023-03-07 11:17:28.883   INFO: Pointing partition adtech:11[2000]
at adtech on 10.0.0.77:3306

6352563910 2023-03-07 11:17:28.887   INFO: Partition action PROMOTE PARTITION
WITH OFFLINE REPOINT partition: adtech:11 on node: '10.0.0.77':3306 SUCCESS

And here we can see that failover has completed:

6352564070 2023-03-07 11:17:28.887   INFO: Thread 115043: FailoverLeaf: Node 2
has been failed over.

We can see that in this instance, failure detection to recovery took under two seconds.

summarySummary

  • SingleStoreDB high availability provides higher availability than if the service was reliant on a failed component.
  • My test demonstrated SingleStoreDB high availability functionality detecting and recovering from node failure in under two seconds. That correlates well to what our SRE team has experienced in production. Note that failover for this hard down failure mode can take up to three seconds.
  • During a failover event, there is brief interruption to service. This is expected. As with any other service that your application relies on, it is important to include retry logic with exponential backoff.
  • Singlestore Helios is highly available by default. You do not need to enable it or manage it: SingleStore looks after that for you.

Interested in high availability for your workload? Get started with a free trial of SingleStoreDB today.


Share