Kafka BATCHES and Transactions

buddha · March 30, 2019, 8:32pm

i have 2 questions about what the docs have to say regarding when MEMSQL connects to Kafka:
“Offsets are ingested in batches, and the maximum number per batch is specified in the engine variables. When an extracted batch has successfully read its offset data, the batch is then optionally transformed and finally loaded into the destination table”

if the batch size is set to 2500, does MEMSQL wait until 2500 offsets are ingested before doing ANYTHING? or, if a bunch of messages are ingested from KAFKA and then there is a “pause” in messages being produced, does MEMSQL begin it’s processing?

The reason i ask is that i have a program that produces about 600 messages and then waits 20 seconds and repeats all day long. When are those messages actually processed by MEMSQL?

if my pipeline is sending the batch to a Stored Procedure and the input parameter to the Stored Proc is:
procName(batch query(‘kafkaMsg’ JSON))

does the stored proc get the contents of the entire batch in one call or does the stored proc get called for each message within the batch?

The reason i ask is that i would like to have a transaction for each one of these ~600 messages - so for example, in my stored proc, could i:

begin
START A TRANSACTION
REPLACE table_name (fields) select
COMMIT TRANSACTION
end

–>or, would that be 1 transaction for every message?

hanson · April 1, 2019, 8:34pm

Regarding part 2 of your question, it should be possible to do a database transaction for each row of input in the batch. You can use COLLECT on the input batch and iterate through the results in a loop that begins/commits a transaction.

JoYo · April 1, 2019, 9:24pm

There are two variables that you might be conflating here: the BATCH_INTERVAL and pipelines_max_offsets_per_batch_partition (as of 6.7). If the former is set to 2500, every 2.5 seconds we will load whatever is currently in kafka, up to a maximum of pipelines_max_offsets_per_batch_partition (per memsql partition). If there is only one new offset in kafka, then so be it, we will load it.

The stored proc will be called once on the entire batch. The variable batch in the stored proc represents all data for that batch.

buddha · April 1, 2019, 11:54pm

thank you both for your replies!

arun.pradeep · June 14, 2019, 7:12am

Will Memsql allows ‘START TRANSACTION or COMMITS or ROLLBACK’ in the stored procedures while it called from PIPELINE?Please confirm. Received an error message saying “Memsql distribution won’t support it”.

JoYo · June 14, 2019, 5:27pm

Nope, because we automatically wrap the entire transaction (and cursor updates) in a distributed transaction, so we cannot allow the user to mess with that.