JSON Pipeline Only Loads Last Record

gbarrett · March 17, 2020, 7:35pm

Hi there. I am having trouble loading a JSON file into a table via a FS pipeline. Using the script below, everything seems to run fine but only the last record in the JSON file ends up in the table. I have also added how the JSON file is structured at the bottom. Any ideas? Thanks in advance.

SQL
USE Utility_Demo;

DROP TABLE IF EXISTS Utility_Workorders;

CREATE TABLE Utility_Workorders (
Workorder bigint(20) DEFAULT NULL,
WorkorderDescription varchar(117),
WorkorderDate date,
DEVICE_ID bigint(20) DEFAULT NULL,
WorkorderCategory varchar(117),
SHARD KEY (Workorder)
);

CREATE OR REPLACE PIPELINE Workorders
AS LOAD DATA FS ‘/home/ubuntu/UtilityDemo/Workorders/WorkOrders.json’
INTO TABLE Utility_Workorders
(Workorder <- Workorder,
WorkorderDescription <- WorkorderDescription,
@datevar <- WorkorderDate,
DEVICE_ID <- DEVICE_ID,
WorkorderCategory <- WorkorderCategory)
FORMAT JSON
SET WorkorderDate = str_to_date( @datevar,’%m/%e/%Y’)
ON DUPLICATE KEY UPDATE Workorder = VALUES(Workorder);
;

START PIPELINE Workorders;

JSON

[
{
“Workorder”: 1,
“WorkorderDescription”: “Vendor Repair”,
“WorkorderDate”: “5/17/2018”,
“DEVICE_ID”: 122,
“WorkorderCategory”: “Critical”
},
{
“Workorder”: 2,
“WorkorderDescription”: “Vendor Repair”,
“WorkorderDate”: “5/17/2018”,
“DEVICE_ID”: 333,
“WorkorderCategory”: “Critical”
},

sasha · March 17, 2020, 8:58pm

The input format is currently required to be whitespace-delimited JSON, with one row produced for each whitespace-delimited JSON object. Loading a row per array element from a file structured as one array object, or from subarrays of whitespace-delimited objects, is unsupported. I believe there just happens to be a missing error, and the result is a row initialized with data from the last element of the array, as you mention. We’ll be adding support in 7.1, or a dot release shortly after, so this pipeline will indeed do what you want, as written.

In the meantime, are you able to produce whitespace-delimited JSON? Alternatively, you could use a TRANSFORM clause to rewrite the file as whitespace-delimited on the fly. There may also be an option to do this via stored procedure logic (you can load the entire array into a JSON or TEXT column with one row), though it’ll likely be less efficient if the transform can afford to skip actual JSON parsing in favor of simply translating a few known character sequences, e.g. }, → }.

gbarrett · March 17, 2020, 9:05pm

Thank you for the reply Sasha. I am trying to build a demo to showcase this so I simply took a csv file and used an online csv to JSON converter to get the JSON file. I am not the most verbose with JSON so can you provide an example of whitespace-delimited JSON so I can try to convert the original CSV to it?

Thanks,

sasha · March 17, 2020, 9:09pm

{
"Workorder": 1,
"WorkorderDescription": "Vendor Repair",
"WorkorderDate": "5/17/2018",
"DEVICE_ID": 122,
"WorkorderCategory": "Critical"
}

{
"Workorder": 2,
"WorkorderDescription": "Vendor Repair",
"WorkorderDate": "5/17/2018",
"DEVICE_ID": 333,
"WorkorderCategory": "Critical"
}

I’m in no position to vouch for the tool, but it looks like that’s what you get with the “mongodb mode” output option of https://www.convertcsv.com/csv-to-json.htm.

gbarrett · March 17, 2020, 9:19pm

That worked. The converter has a MongoDB mode where it removes the commas. Once I did that it loaded all the records. Thanks Sasha!