How to configure for bulk loading


#1

Hello,

I would like to know how to raise the performance for massive data loading.
As online documentation, Memsql support two types of parallel bulk loading.
I have tried LOAD DATA INFILE and FS PIPELINE. However, I am not sure those worked well.
Please check my HW information in the below and recommend how to ingest data much faster.

Server : (28 CPUs, 192GB RAM, 1TB DISK) per node
Cluster : 1 master, 1 child, 2 leaf node
Test data : 100GB


#2

Hi Andrew

I am curious where is your data source currently residing – Kafka, AWS, Azure, etc? Or is it NFS?


#3

FS Pipeline ought to be the fastest way to load your data, assume its split into many files. Do you have a single large file or many small files?
You can use PROFILE PIPELINE to understand the bottlenecks in your pipeline performance, but it might help to start by sharing your CREATE PIPELINE statement.
FWIW, the two aggs won’t really help with pipelines, so that’s 56 wasted CPUs (for ingest). With large nodes like this, it might be better to install leaves co-located with your aggregators, since leaves most of the heavy lifting (in most workloads).


#4

Hi Jacky,

I have tried FS pipeline through NFS. So, NFS is mounted for each leaf node. I also checked pipeline worked well as adding the new file to the NFS.

Thanks.


#5

Hi JoYo,

As you mentioned, I am sure that FS Pipeline should be much faster than Load data infile by only using leaf node resources. I’ve just confirmed some factor regarding FS Pipeline. However, I want to know how to scale the factor for the best optimization. Please check the option in the below.

test data set : 287,997,024 rows / total 50GB GB( each file 500MB )

  1. 2 leaves -> 3 leaves
  2. How many partitions? now I set 32 partitions.
  3. HA async
  4. timestamp shard key

And one more question.
Is it possible to install aggregator and leaf on the same host? or I need to organize VM instance to split resource?

Thanks,
Andrea


#6
  1. An even number of leaves is required for HA. You can have multiple memsql nodes on a host, and its a good idea to have one memsql per numa node if your machines have multiple sockets. If you’re using ops, you can use https://docs.memsql.com/memsql-ops-cli-reference/v6.7/memsql-optimize/.
  2. 1 partition per 2 cores is a pretty good default configuration, you’re mileage may vary.
  3. Cool!
  4. Timestamp is usually a bad shard key. If the timestamp is a literal timestamp of insert column, all your data for a single batch (many files) will end up on a single partition, which is not particularly useful. No shard key is probably better than a timestamp shard key, but timestamps make excelent clustered columnstore indexes.

And yes, with large hosts its a good idea to install multiple memsql nodes per host. No need to split the VMs!


#7

Hi JoYo,

thanks a lot. I got many clue from your feedback.

Regards,
Andrea