Understanding and Controlling Changes in Collation of JSON_EXTRACT_STRING() Output

In the MemSQL 7.0 general availability (GA) release, we fixed a bug related to the collation of JSON objects, which affects the results of the JSON_EXTRACT_STRING function.

You can use JSON_EXTRACT_STRING as a function, and also using the ::$ notation, e.g. json_field::$x is the same as JSON_EXTRACT_STRING(json_field, 'x'). Prior to 7.0, the collation of a JSON_EXTRACT_STRING result could be different depending on whether it was called on the field of a row, or on the result of a JSON expression. This inconsistency was fixed in 7.0 GA so that the result of JSON_EXTRACT_STRING always has collation utf8_bin, the standard collation for JSON data.

The fix changed behavior in a way that was noticeable to some application developers, who may have relied on the previous behavior in some cases.

In MemSQL 7.1 GA, and simultaneously in the 7.0.18 patch release, we added a global variable (json_extract_string_collation) to give additional control to application developers over the collation of their JSON_EXTRACT_STRING results.

Default Behavior and New Global Variable to Change Collation Behavior

The default behavior for JSON_EXTRACT_STRING output collation in MemSQL 7.0 and 7.1 is described below. The default behavior in 7.0 and 7.1 differ. You will see a change in behavior when you upgrade from 7.0 to 7.1.

  • In MemSQL version 7.1, the JSON_EXTRACT_STRING output will by default match the collation of the server (i.e., the collation_server variable value), which is typically utf8_general_ci.

  • In MemSQL version 7.0, the JSON_EXTRACT_STRING output will by default be the collation used by JSON, which is binary (in MemSQL this is called utf8_bin).

If your application requires different behavior than the defaults in these respective versions, you can use a new global variable json_extract_string_collation to control this behavior. The variable is a global sync variable and the allowed settings for it are json, server, or auto. By default, MemSQL will use the auto value, which will be interpreted as json in 7.0 and server in 7.1.

The server setting means the result of JSON_EXTRACT_STRING will be the value of collation_server.

The json setting means the result of JSON_EXTRACT_STRING will be the standard collation used for JSON (binary), which in MemSQL is called utf8_bin.

The collation that matches the server collation is the most consistent with what developers will expect when developing SQL applications since other string expressions have this collation. That’s why the default behavior in 7.1 was chosen to be the same as the default server collation.

The 7.0 GA release uses the JSON collation (binary) for the output of JSON_EXTRACT_STRING. We chose not to change this behavior by default in a patch release, to avoid potential breaking changes.

Changing The Default Collation Behavior

If you need to change the value of this variable from the default of auto, you can update the variable via SET GLOBAL. For example, the below command updates the variable to use a collation of json:

SET GLOBAL json_extract_string_collation = "json";

Controlling JSON_EXTRACT_STRING collation at the expression level

If your application requires finer-grain control than the variable allows, you can use a cast to change the collation of JSON_EXTRACT_STRING expression results. For example, suppose that in a particular query, you wanted to use a binary collation for one JSON_EXTRACT_STRING expression but a case-insensitive collation for another JSON_EXTRACT_STRING expression. That can be done like so:

select …
from t
where t.json_field::$x :> text collate utf8_bin = "string1"
and t.json_field::$y :> text collate utf8_general_ci = "string2";

The syntax

    :> text collate CollationName

is a type cast that includes a collation. This allows you to cast any JSON_EXTRACT_STRING result to any desired collation.