Hello haj! Thank you for trying out HDFS pipelines.
The pipeline workload usually grabs a list of HDFS files from the Namenode, and then assigns one file per MemSQL partition. All partitions will then download the file from HDFS in parallel. It is possible that multiple partitions will be accessing one HDFS datanode, but in general the load should be spread more or less evenly on the HDFS side. After the file download is complete, MemSQL will transactionally insert the data into a table, or run a stored procedure. If you have enough files to load in parallel at once, then pipeline performance is expected to scale with the number of partitions existing across the cluster.
There are multiple sizing options in this process. One thing to consider is the speed of file download (file size, compression, bandwidth). Another is how much work does the database transaction require (number of columns, table type, shard key, SP complexity). In order to provide sizing recommendations, we would need to know a few details about the specific workload and hardware.
In general, MemSQL can ingest millions of rows per second. The simplest use case would just be gzipped files from HDFS with one integer per line, ingested into rowstore table with one bigint column.
If you go ahead with your benchmark, could you share results with us?
thanks in advance!