Ran Out of Disk Space & Unable to Connect to the Master Partition to Drop Tables

ymonye · June 23, 2020, 2:27pm

Hi,

It appears one of our queries used more than expected data overnight, resulting in our hard drives nearly filling up in space. The database & nodes appear in a critical space according to the MemSQL Studio dashboard, and restarting those nodes via memsql-admin returns no luck.

Is there a way to connect to this node “offline” and drop tables while it’s in a critical state? The current hard disk if fairly large, as we’re unable to simply copy everything to an even larger RAID array. I don’t see any other unused files we could possibly delete, as 100% of this RAID array is dedicated to MemSQL.

Here’s the error I see when attempting to drop a table:

ERROR 1772 ER_DISTRIBUTED_PARTITION_MOVE_FAILED: Failed to find a master partition for db_data

adam · June 23, 2020, 2:40pm

Hi ymonye,

Which version of memsql are you running?

There are two things you can try to free up disk space.

If running MemSQL 7.0 or higher set snapshots_to_keep to 1 on all nodes. This should free up a bunch of disk space almost right away and hopefully allow you to delete some data. This doesn’t work in memsql 6.X.
Manually drop slave partitions via DROP PARTITION db_name:X on 'host':port . You’ll need to do this very carefully (don’t accidentally drop a master partition). You can see which hosts have slave partitions via select * from information_schema.distributed_partitions where ROLE='Slave';. You can restore the slave partitions via RESTORE REDUNDANCY when done freeing up some disk space.

ymonye · June 23, 2020, 4:26pm

I’m on 7.1.3. I will try the 1st, then second option, thanks

ymonye · June 23, 2020, 7:15pm

Is there an elegant way of setting the snapshots_to_keep variable to 1? I searched across the documentation online, only saw it referenced via the database call memsql> SHOW VARIABLES;

Thanks

adam · June 23, 2020, 7:32pm

I assume your using memsql tools and not ops? You can use this if so (ops has equivalent functionality if your using ops):

ymonye · June 23, 2020, 7:52pm

Thank you, that did the trick & I was able to restart the nodes to then delete the extra tables.

memsql-admin update-config --all --set-global --key "snapshots_to_keep" --value "1"

memsql-admin restart-node --all

ymonye · August 2, 2020, 9:57pm

Hi all,

I’m having this same problem again, only I’d created a new RowStore table and apparently filled it beyond my disk / RAM capacity with an “INSERT INTO <rowstore_table> SELECT * FROM <old_columnstore_table>” query.

I am on MemSQL 7.1.3, and this system absolutely refuses to come back online, and stuck in a recovering state.

It is showing as no tables exist, not good:

±-----------±-----------±----------±-----±--------------±-------------±--------±---------------±-------------------±-------------+
| MemSQL ID | Role | Host | Port | Process State | Connectable? | Version | Recovery State | Availability Group | Bind Address |
±-----------±-----------±----------±-----±--------------±-------------±--------±---------------±-------------------±-------------+
| 172BECC21C | Master | 127.0.0.1 | 3307 | Running | True | 7.1.3 | Online | | 0.0.0.0 |
| D783110BFB | Aggregator | 127.0.0.1 | 3308 | Running | True | 7.1.3 | Online | | 0.0.0.0 |
| 5014EE9650 | Leaf | 127.0.0.1 | 3309 | Running | True | 7.1.3 | Recovering | 1 | 0.0.0.0 |
±-----------±-----------±----------±-----±--------------±-------------±--------±---------------±-------------------±-------------+

ymonye · August 3, 2020, 9:52am

So I’m not sure if this database is unable to fully recover because I’d created a RowStore table that exceeded my available total RAM (is there a way I can directly delete this?), or because my database entered a corrupted state, which I’d hope not. This error did however start as soon as I’d attempted to copy an existing ColumnStore table into a new RowStore table using the "INSERT INTO <new_table> SELECT * FROM <old_table> command.

This database is currently on a trial (development) version of MemSQL 7.1.3, while our production data is on the standard version of MemSQL on another machine. However, this current database has more up-to-date data and would much preferably love to salvage whatever I can, as this would set our work back several months.

If this database is truly unable to recover due to my RAM limitations (I’m currently using 128GB on this Ubuntu 18.04.4 LTS machine), would it be possible to create a temporary swap file with for ex, 1 TB, to fool MemSQL into thinking I have more than 128 GB, then recover this database?

I ran the following command across my master, aggregator, and leaf nodes with no success:

memsql-admin update-config --key maximum_memory --value 500000 --memsql-id
memsql-admin update-config --key maximum_memory --value 500000 --memsql-id
memsql-admin update-config --key maximum_memory --value 500000 --memsql-id

I’d also attempted the same directly in mysql:

set @@global.maximum_memory=500000;
set @@global.maximum_table_memory=500000;

However, MemSQL still sees 128GB, as I’m not sure if that would’ve helped anyway:

All 24 partitions are still showing as “Recovering” 12 hours later. I know exactly which RowStore table caused this issue, as well as the name… if only I can directly delete that from my hard-disk…

I’d highly appreciate any help with this matter, as again beign unable to recover this database will set us back several months.

Thanks,
Nick

adam · August 4, 2020, 3:17am

Does your company have access to MemSQL support? They will be able to help trouble shoot this for you. We can work through it here too, but it may take a bit longer.

Is this cluster running with redundancy 2? If it is, we can drop replica partitions to free up space and let things recover.

Can you send a cluster-report to bug-reports@memsql.com? We can tell why the exactly the leaf is not coming back online from that.

-Adam

ymonye · August 4, 2020, 9:51am

Hey Adam,

Looks like after a little patience, these nodes finally came back online hours later. I was then able to delete the trouble-maker RowStore table.

Thanks

adam · August 4, 2020, 1:34pm

Sounds good. Recovery shouldn’t typically take that long (12 hours you said?).

Can you share the SHOW CREATE TABLE of the new big rowstore table? Send us over a cluster report if you want us to check on why it took so long to recover.

-Adam