Recovering from / avoiding partial commits?

Hi -

I’m using MemSQL 6.7.1. One of the things we’ve noticed is that, under certain circumstances, a transaction can be partially committed and not properly rolled back. This happens when we’re:

  • Inserting multiple records in a single transaction
  • The records are distributed across multiple partitions / leaf nodes
  • One of the leaf nodes loses connection / has a failure

The error we would see looks like this:

Lost connection to a leaf while committing transactions in the cluster. This transaction may be partially committed. Run SHOW WARNINGS for more details

With a little bit of setup, you can reproduce this locally using some scripts to simulate a lost server together with a network connection failure. The sql commands end up executing like so:

memsql> start transaction;
Query OK, 0 rows affected (0.00 sec)

memsql> insert into facts_test_metrics values
    -> (1, 1, 1, 1547406001, 1),
    -> (2, 2, 2, 1547406002, 1),
    -> (3, 3, 3, 1547406003, 1),
    -> (4, 4, 4, 1547406004, 1),
    -> (5, 5, 5, 1547406005, 1),
    -> (6, 6, 6, 1547406006, 1),
    -> (7, 7, 7, 1547406007, 1),
    -> (8, 8, 8, 1547406008, 1),
    -> (9, 9, 9, 1547406009, 1),
    -> (10, 10, 10, 1547406010, 1),
    -> (11, 11, 11, 1547406011, 1),
    -> (12, 12, 12, 1547406012, 1),
    -> (13, 13, 13, 1547406013, 1),
    -> (14, 14, 14, 1547406014, 1),
    -> (15, 15, 15, 1547406015, 1),
    -> (16, 16, 16, 1547406016, 1),
    -> (17, 17, 17, 1547406017, 1),
    -> (18, 18, 18, 1547406018, 1),
    -> (19, 19, 19, 1547406019, 1),
    -> (20, 20, 20, 1547406020, 1),
    -> (21, 21, 21, 1547406021, 1),
    -> (22, 22, 22, 1547406022, 1),
    -> (23, 23, 23, 1547406023, 1),
    -> (24, 24, 24, 1547406024, 1),
    -> (25, 25, 25, 1547406025, 1),
    -> (26, 26, 26, 1547406026, 1),
    -> (27, 27, 27, 1547406027, 1),
    -> (28, 28, 28, 1547406028, 1),
    -> (29, 29, 29, 1547406029, 1),
    -> (30, 30, 30, 1547406030, 1),
    -> (31, 31, 31, 1547406031, 1),
    -> (32, 32, 32, 1547406032, 1),
    -> (33, 33, 33, 1547406033, 1),
    -> (34, 34, 34, 1547406034, 1),
    -> (35, 35, 35, 1547406035, 1),
    -> (36, 36, 36, 1547406036, 1),
    -> (37, 37, 37, 1547406037, 1),
    -> (38, 38, 38, 1547406038, 1),
    -> (39, 39, 39, 1547406039, 1),
    -> (40, 40, 40, 1547406040, 1);
Query OK, 40 rows affected (2.01 sec)
Records: 40  Duplicates: 0  Warnings: 0

memsql> commit;

ERROR 1856 (HY000): Lost connection to a leaf while committing transactions in the cluster. This transaction may be partially committed. Run SHOW WARNINGS for more details.
memsql> show warnings;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                  |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Error | 2004 | Leaf Error (127.0.0.1:3313): Error reading from socket. Errno=104 (Connection reset by peer)                                                             |
| Error | 2004 | Leaf Error (127.0.0.1:3313): Error reading from socket. Errno=104 (Connection reset by peer)                                                             |
| Error | 2004 | Leaf Error (127.0.0.1:3313): Error reading from socket. Errno=104 (Connection reset by peer)                                                             |
| Error | 1856 | Lost connection to a leaf while committing transactions in the cluster. This transaction may be partially committed. Run SHOW WARNINGS for more details. |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)

memsql> select count(1) from facts_test_metrics;
ERROR 1777 (HY000): Partition reporting:4 has no master instance.

memsql> rollback;
Query OK, 0 rows affected (0.00 sec)

memsql> select count(1) from facts_test_metrics;
+----------+
| count(1) |
+----------+
|       37 |
+----------+

We’re wondering if there’s anything that can be done to recover in a scenario like this. The only things we can think of are:

  • Stop inserting multiple records in a single transaction (effectively stop using transactions).
  • Stop inserting records across multiple partitions in a single transaction (requires knowledge of how records get distributed)
  • Delete existing records that match some custom identifier before attempting the transaction again (if we detect a partial commit error from memsql)

Any other thoughts? Has this scenario been improved in newer versions of MemSQL? Any better way to approach this?

Thanks for any advice!

Hi Aaron,

This scenario is being improved. Stronger transaction support is currently slated for MemSQL 7.5 later this year (its being built as we type). This error msg won’t be possible at that point. In the meantime I think you have summarized the options pretty well. Sometimes it may also be possible to switch to a reference table (if the table is small and doesn’t need the parallelism of being sharded). Otherwise, MemSQL doesn’t do a lot to help you cleanup a partial transaction today. Sorry, I don’t better news on this front.

-Adam

1 Like

Thanks for the follow-up – looking forward to 7.5!