Some questions about pipelines

As the above solution, I can see the performance improvement by deleting the file periodically.

However, when the loading is delayed due to certain factors (network delay, etc.), pipeline stops when the unloading file is deleted.

To solve it, it would be better to distinguish between loaded and unloaded of the pipeline and then delete the loaded file.

However, as you can see in the second question below, it takes a long time to read the file.
So, it doesn’t seem to be a good idea. Can you suggest a plan?

As sensor data is generated continuously during actual operation, the size of the “PIPELINES_FILES” table will continue to grow for the pipeline to process these files.

As you can see below, the sensor data was created for about 12 hours and it took a long time (53.97s) to select the files.
I’d like to know that as these files pile up, performance slows down to check for new files.

Then, is there a setting that effectively manages the “PIPELINES_FILES” table so that it doesn’t grow?

In addition, pipeline related ‘PIPELINES_CURSORS’ and ‘PIPELINES_OFFSETS’ tables continue to grow, how can they be effectively managed?

image

Whenever a new file is created, it appears that the new file is recognized by the filename as the pipeline loads the file.

Is it possible to set the criteria to time of file creation?

in addition,
What is the meaning of ‘batch time’ and ‘batch interval’?
would like to know how this relates to the time it was last loaded into the database.
If I infinitely increase leaf nodes for a lot of data loads at a given time, will the amount of network traffic increase infinitely because of reshuffle?

thanks in advance

1 Like

Thanks for your question, let me see if I can answer it.

However, when the loading is delayed due to certain factors (network delay, etc.), pipeline stops when the unloading file is deleted.

This is because we cannot distinguish between the user having deleted a file, and something having gone wrong. However, if you want to tell memsql to ignore a file, you can do

alter pipeline <name> drop file 'file_name'

and memsql will forget about that file. It is important to make sure 'file_name' is already deleted from the source when you run this, as if it is already loaded, memsql will load it again (since drop file causes us to “forget” about it, so when we see it, its like seeing it for the first time). If the file isn’t loaded, memsql won’t try to load it again.

As sensor data is generated continuously during actual operation, the size of the “PIPELINES_FILES” table will continue to grow for the pipeline to process these files.

You will need to take manual action to delete these files after you’re done with them. The best thing to do would be

check pipelines_files for 'Loaded' files
delete them from disk
run alter pipeline drop file

I’d like to know that as these files pile up, performance slows down to check for new files.

Checking for new files will be proportional to the size of the directory, not the size of pipelines_files. In addition, pipelines_files is somewhat slower than it needs to be because we haven’t optimized this specific system view. We intend to do this in the very near future, so stay tuned.

Then, is there a setting that effectively manages the “PIPELINES_FILES” table so that it doesn’t grow?

I’m sorry, currently MemSQL considers itself a consumer of those files, not an owner of them.

In addition, pipeline related ‘PIPELINES_CURSORS’ and ‘PIPELINES_OFFSETS’ tables continue to grow, how can they be effectively managed?

Actually, all three of these tables are views of the same underlying metadata table

Whenever a new file is created, it appears that the new file is recognized by the filename as the pipeline loads the file.
Is it possible to set the criteria to time of file creation?

I don’t follow, sorry…

in addition,
What is the meaning of ‘batch time’ and ‘batch interval’?
would like to know how this relates to the time it was last loaded into the database.
If I infinitely increase leaf nodes for a lot of data loads at a given time, will the amount of network traffic increase infinitely because of reshuffle?

batch_time is the amount of time the batch took, and batch_interval is the maximum amount of time to wait between batches. If the batch time is greater than the batch interval, we don’t wait.

If I infinitely increase leaf nodes for a lot of data loads at a given time, will the amount of network traffic increase infinitely because of reshuffle?

only if you also tried to load an infinite amount of data :slight_smile: The network traffic is proportional to the amount of data loaded at a time. We will load up to max_partitions_per_batch files at a time, which by default is the number of partitions in your database, which is related to the number of nodes. However, if you don’t have this amount of files available, we’ll load what you have.

Hope this helps!

Oh one more thing

alter pipeline <name> drop orphan files

is a useful command that will remove all unloaded files from metadata.