Question about performance of huge file ingestion using FS pipeline

kyoungho.kum · January 2, 2020, 9:03am

Hello.

I have one huge file (300M lines). This file contains csv data. I want to insert into a table using a FS pipeline.
We measured the time required according to the split size.
The result depends on the table schema.
I have two questions.
Please.

The data in this file is as follows:
2020-01-02 12:47:42.798,eq98958_0001,c1,z4xgnn79,80afkh0f64,8xxo2nq70kwhdc3,jdoyrn,qj2p_xmh,l3nj,run,1,kafkaExample.Sensor@54248cd8,10.516321182250977,1

Field Description

field 1: timestamp; same value by 100,000, sorted in ascending order by this value.

field 2: equipment name; all the same value.

field 12 (ex, kafkaExampl …): sensor; 100,000 unique values are repeated. There is only one for each timestamp.

Performance Measurement Procedure

Split huge file using the unix command “split -l”
Create a pipeline and insert the data of the file into the table (sensor_data_table_in or sensor_data_table_ti).
Measure elapse time and CPU usage.

Table schema:

– 1st table –

CREATE TABLE sensor_data_table_in (
            timestamp       timestamp(6) ,
            name    char(13),
            chid    char(2),
            lotid   char(15),
            ppid    char(15),
            recipeid        char(15),
            chstep  char(15),
            stepseq char(13),
            partid  char(15),
            status  char(3),
            slotno  tinyint,
            sensor  char(28),
            value   double,
            lcl     double,
            ucl     double,
            inserttime timestamp(6) DEFAULT CURRENT_TIMESTAMP,
            SHARD KEY(sensor),
            KEY (inserttime) USING CLUSTERED COLUMNSTORE
    );

– 2nd table –

CREATE TABLE sensor_data_table_ti (
        timestamp       timestamp(6) ,
        name    char(13),
        chid    char(2),
        lotid   char(15),
        ppid    char(15),
        recipeid        char(15),
        chstep  char(15),
        stepseq char(13),
        partid  char(15),
        status  char(3),
        slotno  tinyint,
        sensor  char(28),
        value   double,
        lcl     double,
        ucl     double,
        SHARD KEY(sensor),
        KEY (timestamp) USING CLUSTERED COLUMNSTORE
);

There are two differences between the two tables.

‘inserttime’ in the 1st table (default current_time).
Different COLUMNSTORE keys. (1st table: ‘inserttime’, 2nd table: ‘timestamp’)

Performance measurement result : insert into 1st table (sensor_data_table_in)

Leaf	partitions	rows per file	size per file	files	batches	elapse time(sec)	elapse time(mm:ss)	rows per sec.	CPU utilization
1	24	100,000	16 MB	3,000	125.0	184.4	03:04	1,627,255	43%
1	24	520,834	84 MB	576	24.0	148.0	02:27	2,027,041	61%
1	24	2,083,334	337 MB	144	6.0	132.2	02:12	2,268,778	73%
1	24	6,250,000	1,010 MB	48	2.0	130.1	02:10	2,305,325	75%
1	24	12,500,000	2,020 MB	24	1.0	127.6	02:07	2,351,044	75%

1	48	100,000	16 MB	3,000	62.5	199.0	03:18	1,507,910	47%
1	48	520,834	84 MB	576	12.0	138.4	02:18	2,167,730	72%
1	48	2,083,334	337 MB	144	3.0	125.0	02:05	2,399,122	84%
1	48	6,250,000	1,010 MB	48	1.0	120.1	02:00	2,497,858	88%

Performance measurement result : insert into 1st table (sensor_data_table_ti)

Leaf	partitions	rows per file	size per file	files	batches	elapse time(sec)	elapse time(mm:ss)	data_size per sec.	rows per sec.	CPU utilization
1	24	100,000	16 MB	3,000	125.0	226.4	03:46	214 MB	1,325,302	40%
1	24	200,000	32 MB	1,500	62.5	214.2	03:34	226 MB	1,400,510	46%
1	24	250,000	40 MB	1,200	50.0	206.8	03:26	234 MB	1,450,960	53%
1	24	520,834	84 MB	576	24.0	211.1	03:31	230 MB	1,421,280	58%
1	24	1,041,667	168 MB	288	12.0	229.6	03:49	211 MB	1,306,447	58%
1	24	2,083,334	337 MB	144	6.0	243.0	04:03	199 MB	1,234,449	48%
1	24	6,250,000	1,010 MB	48	2.0	265.1	04:25	183 MB	1,131,734	47%
1	24	12,500,000	2,020 MB	24	1.0

1	48	100,000	16 MB	3,000	62.5	231.1	03:51	210 MB	1,298,229	48%
1	48	200,000	32 MB	1,500	31.3	206.7	03:26	235 MB	1,451,489	60%
1	48	250,000	40 MB	1,200	25.0	198.2	03:18	245 MB	1,513,947	65%
1	48	520,834	84 MB	576	12.0	186.5	03:06	260 MB	1,608,380	76%
1	48	1,041,667	168 MB	288	6.0	206.1	03:26	235 MB	1,455,902	82%
1	48	2,083,334	337 MB	144	3.0	218.2	03:38	222 MB	1,374,915	87%
1	48	6,250,000	1,010 MB	48	1.0	232.5	03:52	209 MB	1,290,358	88%

Questions

When inserting into the first table, the larger the file size, the better the “elapse time”.
However, the second table does not. Why?
It looks like the difference between the columnstore keys, but it’s strange to increase and decrease.
Note : elapse_time was obtained with the following query.
select sum (BATCH_TIME) from information_schema.pipelines_batches_summary;
If you are restoring a cvs file backed up from a database, it is more likely that it is actually the second table.
Is there a way to improve performance?

System CPU/Memory: 24(vCpu48)/192G
1 line average size: 165byte ~ 175byte

Note: “load data infile …” is worse than FS pipeline.

Thank you very much.