Kafka Avro - Default null columns

dlee · April 4, 2019, 9:42pm

We’re creating a REPLACE INTO TABLE pipeline with avro messages from kafka and our schema contains default null fields such as

{"name": "testField", "type": ["null","string"],  "default": null }

If a string value is specified we’re seeing the testField column set to {“string”:“Test”} and if null is specified it’s being set as a string “null” instead of NULL. Any help is appreciated.

Here’s the full pipeline definition:

CREATE OR REPLACE PIPELINE TestPipeline
AS LOAD DATA KAFKA 'kfk/TestQueue'
BATCH_INTERVAL 2500
REPLACE INTO TABLE TestTable
FORMAT AVRO
(
TestTable.test_field <- testField
)
SCHEMA
'
{
    "name": "TestEvent",
    "type": "record",
    "fields": [
        {
            "name": "metadata",
            "type": {
                "type": "fixed",
                "size": 5,
                "name": "unused_metadata"
            }
        },
        {"name": "testField",   "type": ["null","string"],  "default": null }
    ]
}
'
;

sasha · April 4, 2019, 10:48pm

You likely saw {"string":"Test"} and null because CREATE PIPELINE / LOAD DATA converts union-type avro values to the JSON encoding of the entire union. However, if you specify a particular union branch in the subvalue_path clause of the query - e.g. “testField::string” - MemSQL will load the converted value of that particular branch for every record where it’s the selected branch. Where it’s not the selected branch, MemSQL will load NULL instead.

It’s admittedly annoying if you just want to use unions to represent optional values, but it’s a simple rule that works for more complicated unions too.

The reference for Avro conversion semantics is here:

dlee · April 5, 2019, 2:11pm

Ahh I missed that in the docs! Much appreciated and works like a charm!