How to configure for bulk loading

andrea.woo · May 14, 2019, 9:03am

Hello,

I would like to know how to raise the performance for massive data loading.
As online documentation, Memsql support two types of parallel bulk loading.
I have tried LOAD DATA INFILE and FS PIPELINE. However, I am not sure those worked well.
Please check my HW information in the below and recommend how to ingest data much faster.

Server : (28 CPUs, 192GB RAM, 1TB DISK) per node
Cluster : 1 master, 1 child, 2 leaf node
Test data : 100GB

Jacky · May 15, 2019, 6:42pm

Hi Andrew

I am curious where is your data source currently residing – Kafka, AWS, Azure, etc? Or is it NFS?

JoYo · May 15, 2019, 6:46pm

FS Pipeline ought to be the fastest way to load your data, assume its split into many files. Do you have a single large file or many small files?
You can use PROFILE PIPELINE to understand the bottlenecks in your pipeline performance, but it might help to start by sharing your CREATE PIPELINE statement.
FWIW, the two aggs won’t really help with pipelines, so that’s 56 wasted CPUs (for ingest). With large nodes like this, it might be better to install leaves co-located with your aggregators, since leaves most of the heavy lifting (in most workloads).

andrea.woo · May 16, 2019, 12:31am

Hi Jacky,

I have tried FS pipeline through NFS. So, NFS is mounted for each leaf node. I also checked pipeline worked well as adding the new file to the NFS.

Thanks.

andrea.woo · May 16, 2019, 1:35am

Hi JoYo,

As you mentioned, I am sure that FS Pipeline should be much faster than Load data infile by only using leaf node resources. I’ve just confirmed some factor regarding FS Pipeline. However, I want to know how to scale the factor for the best optimization. Please check the option in the below.

test data set : 287,997,024 rows / total 50GB GB( each file 500MB )

2 leaves -> 3 leaves
How many partitions? now I set 32 partitions.
HA async
timestamp shard key

And one more question.
Is it possible to install aggregator and leaf on the same host? or I need to organize VM instance to split resource?

Thanks,
Andrea

JoYo · May 16, 2019, 5:38pm

An even number of leaves is required for HA. You can have multiple memsql nodes on a host, and its a good idea to have one memsql per numa node if your machines have multiple sockets. If you’re using ops, you can use SingleStoreDB Cloud · SingleStore Documentation.
1 partition per 2 cores is a pretty good default configuration, you’re mileage may vary.
Cool!
Timestamp is usually a bad shard key. If the timestamp is a literal timestamp of insert column, all your data for a single batch (many files) will end up on a single partition, which is not particularly useful. No shard key is probably better than a timestamp shard key, but timestamps make excelent clustered columnstore indexes.

And yes, with large hosts its a good idea to install multiple memsql nodes per host. No need to split the VMs!

andrea.woo · May 17, 2019, 4:17am

Hi JoYo,

thanks a lot. I got many clue from your feedback.

Regards,
Andrea