Data Communication between Leaf Nodes in FS Pipeline. called 'reshuffle'

chaeyoung.ko · November 11, 2019, 9:13am

I need to explain more about ‘reshuffle’.

Does it mean that leaf nodes going ‘sharding’?

sasha · November 12, 2019, 1:01am

If I understand your question correctly: yes, data in non-reference MemSQL tables is sharded among leaves. A given row will be stored on only a single leaf, rather than being replicated to all leaves. On the other hand, reference tables are replicated to all leaves and aggregators.

There’s an overview of the architecture here:

As with any other insert into a sharded table, the rows inserted via a pipeline must wind up on the correct leaf node according to the table’s shard key. Leaves download data and parse it into rows in parallel and, since newly parsed rows don’t necessarily belong on the leaf doing the parsing, they also forward rows to each other over the network as they parse. We call this a “reshuffle” operation.

Edit: also worth noting that “a given row is stored on a single leaf” above doesn’t take into account high availability features, where that row will in fact be replicated to another leaf. That replica is only a backup copy, however, and isn’t queryable.

chaeyoung.ko · November 12, 2019, 2:14am

I understood it perfectly.

thanks.