Columnstore JSON type fetch performance


#1

Hey,

I’m well aware that the fetch of a JSON column is troublesome and could take longer than a regular column since it is using an optimized and different storage engine, however the numbers I saw do not add up.
I have a JSON column with well over 40 fields in the json data blob.

Fetching 1 field out of the json column:
select payload::columnX from … limit 1;
takes 15-20 ms.

Fetching 4 fields out of the json column:
select payload::column1, payload::column2, payload::column3, payload::column4 from … limit 1;
takes about 70-80ms.

Fetching 40 fields out of the json column - takes about 650-700ms.

So i checked the profiled the query using the memsql studio - during the 650-700ms period, the cpu was only on 6% cpu usage. which would be equivalent to roughly 60 ms of cpu time.

It makes me wonder whether the underlying implementation is using multiple threads to fetch from the JSON storage, or linearly column by column, and the delay is basically disk access that doesn’t get paralleled ?


#2

We use a 3rd party library called parquet to store JSON column and extract json keys.

The query execution is single-threaded (per partition), same as the other data types. The thread is responsible for extracting all the requested data in that partition. It does this block (i.e. 4096 rows in one segment) by block, and column by column in each block. There is not much fancy things there. So I would expect run time grow linearly with the number of projections you have in your query.

I am not sure where the 650-700ms is spent. You need to collect a flamegraph to know exactly what’s going on.

6% CPU utilization sounds suspicious, but I wouldn’t 100% trust the studio, especially for such short-running query (it runs for only 0.7s according to what you said). If you could confirm the CPU usage by logging into the leaf directly we can investigate further by collecting a flamegraph.

If your query’s working set fits into memory, a easy way to get rid of disk-related complication is to run the identical query twice. The second run should not require any hard disk read, as the data should have been in Linux page cache.


#3

Hey,

All the numbers here are provided after running the same query for 5-10 consecutive times.


#4

I don’t see your runtime being suspicious by itself, as in your specific case, the runtime is supposed to grow linearly with the number of projections you have. Your one-projection query takes 15-20ms, so your 40-projection query should take 600-800ms, which is indeed what you get.

The only suspicious point is your claim that the CPU usage is 6%. That’s why I asked you to confirm this by directly logging into the leaf to check. I don’t have much knowledge about studio, but I don’t think it is sampling CPU usage at a high frequency, so likely it’s not going to show you short peaks of CPU usage correctly.

I suggest you use a stored procedure (or simply a SQL script by copy-pasting your query 100 times) to repeatedly run your query, then log into the leaf and run htop to get real CPU usage.


#5

Usually the parquet format is meant to be used with parallel column processing.

I created a file with a bunch of these queries.
Each query takes about 800 ms, and total run for the script running them one after another is about 20 seconds, during which, the cpu does indeed spike to 100-110% cpu.

Also it is the same query used consecutive time, and still takes almost a whole second, where is the caching in this case?


#6

I agree with you this could have been better. But unfortunately we just haven’t implemented that at this time.

Just to confirm, you logged in into the leaf, and you found that the sum of CPU usage of all CPUs is 100-110% CPU, which is how you concluded that on average there is only 6% usage per CPU. Is my understanding correct?

I agree with you that if the CPU usage is indeed 6% then it’s very suspicious. We need to collect off-cpu flamegraph to know where it is blocking on. But I just want to make sure that we didn’t hit a config issue before doing that (for example, you checked the aggregator CPU usage instead of the leaf, or somehow you interpret the CPU usage wrongly).


#7

Yes, when running the query the CPU goes from 15-20% to 100-110%. the machine has 16 cores.

So the 6% usage, is just a memsql studio mistake.

The question is, why is there no caching and why is the same query taking the same amount time of 800ms one after another. On regular columns, the same query would take 10-20ms, even if there are many selected.


#8

I don’t understand how CPU usage can be higher than 100%, but as long as the CPU is fully utilized while the query is running, things should be fine.

This is kind of tech detail, but in case you are interested: we first compute a list of table columns needed to compute all your projections. For example, if your table has integer column a,b,c and you do select a+1, a+b, b+1, b+2, then column a and b are needed, but c is not. We extract the data from column a and b, then compute all your projections. Even if a and b are used multiple times in your projections, they are only extracted from disk once.

However, things are a little different for JSON key extraction. Each JSON key you asked for is considered as a special “fake JSON column” in your table. This means if you select js::a and js::b, even if they are both on JSON column js, they are internally considered as two different “fake JSON columns” for the purpose of MemSQL query execution. So column js is extracted twice. This is not ideal of course, but unfortunately this is currently how the code works.


#9

In linux systems, each 100% cpu utilization in the TOP is 1 core used. if you see 800% it will be 8 cores being used. so in this case it was using 1 core out of 16.

If you ever plan product wise, that memsql will be a drop in replacement for document storage etc, this is something that must be handled, otherwise very poor performance.

Can you please explain what happens if I simply select “JS” without extracting anything ? This one also has a poor performance.


#10

That’s actually problematic. We have one thread per partition, and usually multiple partitions reside on a single node (and also multiple nodes on a physical server if your server has multiple NUMA nodes). So it’s wired that you are only getting a total of 100% CPU utilization (you are supposed to get 100% x #partitions on that leaf). How many partitions are on the leaf you are checking? (try attach the output of show partitions)

Probably you should talk to a PM about your use case. @hanson

parquet only needs to extract a part of the JSON blob if you just want to read a key. So if you are reading js::a and js::b, only a part of the JSON blob needs to be parsed and extracted. But if you select js, the whole blob has to be parsed and extracted and hand back to you. That’s probably why it’s slower.


#11

As part of the query predicates I use the sharding key.


#12

Then it sounds reasonable.