Creating Thousands of Tables vs. One Table for Improved Query Execution Performance: Leaf Errors & Max RAM Utilization

I’m sure this has been asked before, yet couldn’t find any case (via Google) specifically dealing with MemSQL.

Scenario:


Let's say I'd like to index historical (multi-year) transactional data from various files, using NodeJS & 'for loops', into MemSQL. The method I'm using today writes that data into 1 large columnstore table, reaching several million rows over time, and to keep it simple has the following info:

Table Name: tbl_transactions
Columns: transaction_id (Primary Key) | client_from | client_to | amount | timestamp

I have queries executed from a front-end website to display client specific information when provided a client_id. In the example of client_id = 100:

SELECT * FROM tbl_transactions WHERE (client_from = '100' OR client_to = '100');

The Problem:


These queries often take several seconds to a few minutes to execute just from the sheer size & amount of data in that table. Adding specific columns as shards hasn't really helped either ('client_from' & 'client_to' are shards).

I’d like to greatly reduce the amount of time it takes to query my several million-row table for specific client information, as storing the table as a rowstore was not possible since that’d exceed the machine’s total available RAM.

Attempted Solution:


I'd then thought of an idea to create separate tables for each client, then populate fields whenever my NodeJS script finds data specific to each client (instead of in one large table), for ex:

tbl_transactions_client_100
tbl_transactions_client_101

Before I could begin, I had to increase both the maximum pooled connections & maximum MemSQL / MySQL connections:

memsql-admin update-config --key max_pooled_connections --value 100000 --all
memsql-admin restart-node --all

<login to memsql / mysql to increase the max_connections & verify changes>

SET GLOBAL max_connections = 100000;
show variables like 'max%connections';


NodeJS Code Snippet:

[FOR LOOP CODE]

client_id = i;

//==============================
// Create New Table If New Client ID
//==============================

var script_01 = ‘CREATE TABLE IF NOT EXISTS ’ + ‘tb_transactions_’ + client_id + ’ (transaction_id VARCHAR(10), client_from VARCHAR(10), client_to VARCHAR(10), amount NUMERIC (11, 2), timestamp DATETIME, KEY(‘transaction_id’) USING CLUSTERED COLUMNSTORE);’;

con.query(script_01, function (err) {
if (err) throw err;

//==============================
// Insert Data Into Table
//==============================

var values = … // Fill these in
var script_02 = ‘INSERT INTO ‘tb_transactions_’ + client_id + ’ VALUES (?,?,?,?,?);’;

con.query(script_02, values, function (err) {
if (err) throw err;
});

});
[CLOSE FOR LOOP]

Resulting Error:


As you can imagine I end up with several thousands of tables generated. This provided a temporary solution, but over time, MemSQL fully utilizes all of the system's available RAM (128GB) due to the table creations and displays the following error in Terminal:

Error: ER_UNKNOWN_PARTITION: Unable to connect to leaf @127.0.0.1:3309 with user distributed, using password YES: [2004] Cannot connect to ‘127.0.0.1’:3309. Errno=111 (Connection refused)

Follow Up:


Two questions here:
  1. Is there a way to circumvent this error and the amount of RAM utilized by MemSQL?
  2. Is this a correct approach: creating multiple tables per client, to improve overall query performance & execution time, or is there a better solution for this?

Again, the end goal is to greatly reduce the amount of time it takes to query my several million row table for specific client information.

Thanks

One table is generally the more elegant solution here. Creating a table per client id will limit the ways in which you can efficiently query your data. Thousands of tables can also require significant metadata overhead.

It’s unexpected that a columnstore table with only a few million rows takes multiple seconds to execute a query with a couple filters. Can you share the profile of your query?

Anyway, this sounds like a good use-case for columnstore hash indices, a new feature in 7.0. Try creating a hash index on each of client_from and client_to - this should speed up your equality filters.

By the way, is there a reason that client_from and client_to are varchar instead of int? You may find that you get better performance with int.

Oooo, very nice! Thanks @cwatts, I’m checking out MemSQL 7.0 now.

The reason ‘client_from’ & ‘client_to’ are varchars is because they are KECCAK-256 hashes. Would there be a more efficient way of storing this data?

My actual Columnstore table has additional columns also stored as shards:

  • status (‘successful’ OR ‘failed’)
  • address
  • client_from
  • client_to
  • date (YYYY-MM-DD)
  • update_bool
  1. My main query only has a filter for ‘client_from’ & ‘client_to’ (which is the select statement from my initial post), but there are more advanced features on my application that’d allow users to apply sophisticated queries, on that particular table, utilizing the other shard fields. Will there be a hit on performance if my main query only filters for ‘client_from’ & ‘client_to’ while additional unreferenced columns existed as shards? For ex: would it make sense to only apply shards to ‘client_to’ & ‘client_from’ in one table, then create a separate table that includes shards for those + other columns & have the sophisticated queries hit that particular table? This question also applies to MemSQL 7.0’s HASH indicies & if I’m better off with two separate tables.

  2. I have a particular shard field: ‘update_bool’ that is a boolean I periodically update (using SQL UPDATE) whenever I change any fields from a particular row. I may or may not introduce this field as a filter in future select queries (SELECT * … WHERE update_bool = 1). Will MemSQL properly continue storing this column as a shard if this value changes via the UPDATE command?

  3. For MemSQL 7.0, would I have to completely replace all shards in my ‘CREATE TABLE’ function as hash indicies, or would a combination of shards & hash indicies co-exist?

I already have the syntax correct I believe, currently running test cases to compare the performance across both tables with identical fields:

TABLE 1:

‘… KEY(transaction_id) USING COLUMNSTORE, SHARD KEY(status, address, client_from, client_to))’

TABLE 2:

‘… KEY(transaction_id) USING COLUMNSTORE, KEY(status) USING HASH, KEY(address) USING HASH, KEY(client_from) USING HASH, KEY(client_to) USING HASH)’

The reason ‘client_from’ & ‘client_to’ are varchars is because they are KECCAK-256 hashes. Would there be a more efficient way of storing this data?

varchar is fine here.

  1. The main way queries can take advantage of shard keys is by executing a query only on the relevant partition (aka shard). Filtering (or joining) on columns that are a strict subset of your shard key does not allow us to do this because we use hash partitioning.
    This means that if your shard key includes more columns than client_from, client_to and you only filter on those columns, we need to send the query to all partitions. Note that in general we don’t expect this to hurt query latency, but overall query cost increases. I recommend testing query performance and concurrent query throughput with different shard keys to see what works best for you. Only sharding on client_from, client_to can introduce data skew if there are “hot” client pairs. You may find our documentation on optimizing shard keys helpful: Optimizing Table Data Structures · SingleStore Documentation
    As for hash indices, you can add a hash index to any (eligibly-typed) column and we will use it for equality filters. They generally work best on columns with high cardinality.

  2. Updates on shard-key-affecting columns are locked out.

  3. You should be able to add hash indices to columns that you want to equality filter on without changing the rest of your table schema.

Thanks,

Looks like I have no choice but to just create multiple tables with different column sharding combinations & compare performance from there.

So I ran into an error while trying to apply UPDATE commands to my update_bool column, which is already a shard, and learned that’s unsupported. No biggie, but then would it make sense to even add this particular column as a Hash index? Again, I may run addition queries with the filter (SELECT * FROM ... WHERE update_bool = 1), to filter rows that were modified at some point. By default these are set to 0, and I only apply a 1 value when all data is fully ingested & processed on my server (so all rows are eventually updated with update_bool = 1).

I did some more reading on hash indicies from:

One thing I don’t quite understand… does this feature replace shards in columns? Would it make sense to apply both a shard AND hash for my columns of interest, in this case: client_from & client_to? I still don’t understand the advantage if my columns of interest are already shards.

One thing I don’t quite understand… does this feature replace shards in columns?

Shard keys are primarily a tool to evenly distribute your data across partitions (and nodes), increasing the parallelism of your query. In the case where you are filtering on a superset of the shard key, it has the secondary effect of allowing partition pruning: we can search for the matching rows only on the partition that contains them. This should not generally make the individual query faster since the query is executed on every partition in parallel (unless queries are getting queued due to resource scarcity). Instead, it makes the query cheaper from a resource utilization perspective. This allows you to run more concurrent queries.

On the other hand, hash indices are used for seeking directly to specific rows, speeding up equality filters. This improves query latency - instead of scanning the whole table, we seek to the relevant part.

To put it another way, filtering on a superset of the shard key gives us a general idea of where the relevant rows are. Filtering on hash-indexed columns tells us exactly where they are.

then would it make sense to even add this particular column as a Hash index?

Columns with low cardinality will not benefit much from hash indices. If half of all rows have update_bool = 1, we’re going to have to scan the whole table anyway.

2 Likes

Thanks, got it. It makes sense to me now. My test cases are coming along well: shard tables, hash tables, combo of both, and so I’m able to realize in real-time the performance of my queries & CPU utilization.

What is the best practice for tables with dates as column data? For ex, I have one table with a timestamp per transaction (YYYY-MM-DD HH:mm:ss), and none of that data will be sharded. However, there’s a 2nd column I’d manually created with just the date without the time (YYYY-MM-DD) and plan to apply filters every now & then.

(SELECT * ... WHERE date = '2019-12-15')

or

(SELECT * ... WHERE date >= '2019-12-01' AND date < '2020-01-01').

Would it make sense to add the date column as both a shard & index? We’re talking about 5 years of historical data.

Remember that we can only take advantage of the shard key for partition pruning in a filter query if the shard key is a subset of the filter columns for that query. So be careful about adding many columns to your shard key if your primary concern is partition pruning. Also, we can generally not do partition pruning for range filters - this is because we do hash partitioning instead of range partitioning.

We also currently do not take advantage of hash indices either for range filters. If the date column is the main column you’ll be doing range filters on, I suggest experimenting with making the date column your sort (clustered columnstore) key instead of transaction_id (you would then probably want to add a hash index to transaction_id).

1 Like