Data Communication between Leaf Nodes in FS Pipeline

haj · November 5, 2019, 12:43am

Hello,
When running FS PIPELINE, there is a transmission between Leaf Nodes.
I would appreciate it if you could tell me which data is exchanged between Leaf Nodes.

Thanks in advance!

sasha · November 6, 2019, 7:17pm

The transmission you’re seeing is likely a row reshuffle operation, as if you had run insert into dest select * from src where dest’s shard key doesn’t match src’s.

“Normal” (“non-aggregator”) pipelines execute by periodically assigning “batch partitions” of the incoming data to individual leaves and having the leaves download and parse their batch partitions in parallel, as part of one shared distributed transaction. Since the rows parsed from a batch partition don’t necessarily belong on the leaf doing the parsing, the act of actually inserting those rows requires forwarding them to the appropriate leaf.

The definition of a batch partition differs for different sorts of pipelines. For FS pipelines, a “batch partition” of the data is defined a single file in the file tree being monitored. So individual leaves will own individual files, parse them in parallel, and reshuffle rows from those files. (That means, for instance, that it’s optimal to have more files, even if they’re smaller, so that you can saturate the leaves.)

I say this applies to “normal” pipelines because adding the AGGREGATOR clause to a pipeline causes all downloading and parsing to happen on a single aggregator node. There’s not much reason to prefer this mode apart from cases where it’s mandatory: since normal pipelines bypass the aggregators and download data directly to leaves, both formulations involve the same amount of network traffic and CPU work but normal pipelines can do it in parallel.