Strange phenomenon of pipeline

I’m testing for large IOT workloads with a file system pipeline from memsql.

the target table is configured as ‘column store’.

The specifications for the ‘Application’ server generating data and the ‘db’ server processing the data are as follows:

Test server spec (applicaton server)
shape: OCI (VM.DenseIO2.16)
oCPU : 16 (vCPU 32)
MEM: 128 GB
N/W: 16.4 Gbps

Test Server Specifications (db server) _MemSQL 7.0
shape : OCI (VM.Standard2.16)
oCPU : 16 (vCPU 32)
MEM: 128 GB
N/W: 16.4 Gbps

Let me explain the test scenario. The application’s server has one equipment that generates sensor data that is generated every 0.1 second in a single ‘csv’ file.

As we increase the number of these equipments, we are measuring the ‘cpu’ rate, the rate of data loading.

Partition of target table has a 1:1 relationship with the number of equipment in the ‘application’ server.
We conducted the test by increasing the amount of sensor data that was generated in one ‘csv’ file and increasing the file size.
However, if you look at the attached verification table, the ‘cpu’ rate and loading time are not linear.

Especially, look at the indicators when processing 30,000 sensor data.
I can’t understand the ‘cpu’ rate and loading time when processing ‘csv’ files that are smaller than the comparison target.

I want to know the correlation between file system pipeline and 'cpu’rate, network, file size, and partition.

In addition, the file system pipeline was not able to process all of the data even though the cpu rate was not high.

Please refer to the following.

need a description of the file system pipeline mechanism associated with this phenomenon.

The same thing happens when add a leaf node.


hello chaeyoung.ko, and thank you for trying out filesystem pipelines.

there are a number of factors to consider when profiling performance of a pipeline. although it does not appear to be likely in your case, an increase of the size of each file, it is possible the leaves have to repartition more data to correct partitions. this will happen over the network. however, at the repartition stage the data should be in table format and may not match up to original file sizes.

perhaps counterintuitively, but due to some overhead of file processing, smallest files might be slower to load than medium size.

another thing to consider is behavior of inserts into columnstore. as you insert a relatively small number of rows, they will land into a special row-store segment of each columnstore partition. that data will eventually be merged into columnstore on-disk segments via a background merger process. however, if you insert a large number of rows at once, it may kick off a write to disk as part of the same transaction, based on the resulting number of rows in that segment.

you mentioned that pipeline was not able to process all of the data. can you expand on what you observed? did you experience errors in pipeline processing? did the pipeline pick up only some of the files in a batch? did the pipeline catch up on the rest of the files in subsequent batches?

additionally, can you share the formula you use for CPU rate?

I measured cpu rate as ‘dstat’ on Linux.

There was no error in processing the pipeline, but it did not meet one of the conditions for the tests we were conducting.
When a file is created on an application server in real time, it must be within one second of being stored in the database through a pipeline.

Of course, the follow-up batch catches up with the rest of the files, but it doesn’t meet the above conditions.

Also, subsequent batches that have been pushed back do not seem to be processed in the order in which the files are created.

When I find the optimal condition for files to be created and stored within a second without subsequent batches, I try to conduct a test that increases the data throughput linearly by increasing the leaf node.
Please give me some advice.