What hashing algorithm is used for hashing the shard key to determine the partition?


#1

I’m wondering what algorithm is used to determine which partition data belongs to?

I’m making heavy use of Kakfa, and have aligned our various producer clients (scala, golang/sarama) to use the same algorithm, and I would very much like to have tables that are partitioned in the exact same way.


#2

The hash is MemSQL-specific and depends on the datatypes, e.g. for some string datatypes MemSQL uses CityHash. It is unlikely to match up with Kafka.

However, one thing you can consider is using keyless sharded tables - defined with an empty shard key shard key () - which allows data to reside on any partition. Ingesting into a keyless sharded table through pipelines is optimized to move data directly from a Kafka partition to a MemSQL partition.


#3

Thanks Jack. I’m not necessarily interested in using MemSQL to ingest data, so much as a KV-store used to enrich data as we process it:

Imagine that we have an input topic and and output topic used by a service. Data comes in, we look it up in the store, possibly modifying data in the store, and output to the output topic. By partitioning in the same way as the Kafka topics, we can then scale the instance count of that service up/down and be certain that no two instances of the service would ever use the same partition in the table at the same time. That’s what I’m looking to do.

Am I able to use the shardless table to accomplish this by specifying the partition to use directly in calls to the database?


#4

Keyless sharded does not guarantee that the data from a given kafka partition always goes to a given memsql partition.

What kind of performance impact are you seeing with multiple instances using the same partition at the same time? Because there are multiple partitions on each leaf, and MemSQL handles concurrent access well, partitioning data in this way usually doesn’t matter that much.

MemSQL doesn’t have a way to directly specify the partition to insert data into, but you could use the kafka partition id as the shard key to at least ensure that all data for a given kafka partition resides in the same memsql partition, although a memsql partition could contain the data for multiple kafka partitions. For a given number of memsql partitions, you could precompute a mapping of ids in the shard key that makes the partitions line up one-to-one, just by trying random ids and seeing what partition they get mapped to. It is also possible to have your application access the leaf partitions directly to input/output data, but this is not generally recommended.


#5

I see. I have yet to actually try this out so I’ll go ahead without trying to force this separation myself.