Question about performance of huge file ingestion using FS pipeline

Hello.

I have one huge file (300M lines). This file contains csv data. I want to insert into a table using a FS pipeline.
We measured the time required according to the split size.
The result depends on the table schema.
I have two questions.
Please.

The data in this file is as follows:
2020-01-02 12:47:42.798,eq98958_0001,c1,z4xgnn79,80afkh0f64,8xxo2nq70kwhdc3,jdoyrn,qj2p_xmh,l3nj,run,1,kafkaExample.Sensor@54248cd8,10.516321182250977,1

Field Description

  • field 1: timestamp; same value by 100,000, sorted in ascending order by this value.
  • field 2: equipment name; all the same value.
  • field 12 (ex, kafkaExampl …): sensor; 100,000 unique values are repeated. There is only one for each timestamp.

Performance Measurement Procedure

  1. Split huge file using the unix command “split -l”
  2. Create a pipeline and insert the data of the file into the table (sensor_data_table_in or sensor_data_table_ti).
  3. Measure elapse time and CPU usage.

Table schema:

– 1st table –

CREATE TABLE sensor_data_table_in (
            timestamp       timestamp(6) ,
            name    char(13),
            chid    char(2),
            lotid   char(15),
            ppid    char(15),
            recipeid        char(15),
            chstep  char(15),
            stepseq char(13),
            partid  char(15),
            status  char(3),
            slotno  tinyint,
            sensor  char(28),
            value   double,
            lcl     double,
            ucl     double,
            inserttime timestamp(6) DEFAULT CURRENT_TIMESTAMP,
            SHARD KEY(sensor),
            KEY (inserttime) USING CLUSTERED COLUMNSTORE
    );

– 2nd table –

CREATE TABLE sensor_data_table_ti (
        timestamp       timestamp(6) ,
        name    char(13),
        chid    char(2),
        lotid   char(15),
        ppid    char(15),
        recipeid        char(15),
        chstep  char(15),
        stepseq char(13),
        partid  char(15),
        status  char(3),
        slotno  tinyint,
        sensor  char(28),
        value   double,
        lcl     double,
        ucl     double,
        SHARD KEY(sensor),
        KEY (timestamp) USING CLUSTERED COLUMNSTORE
);

There are two differences between the two tables.

  • ‘inserttime’ in the 1st table (default current_time).
  • Different COLUMNSTORE keys. (1st table: ‘inserttime’, 2nd table: ‘timestamp’)

Performance measurement result : insert into 1st table (sensor_data_table_in)

Leaf partitions rows per file size per file files batches elapse time(sec) elapse time(mm:ss) rows per sec. CPU utilization
1 24 100,000 16 MB 3,000 125.0 184.4 03:04 1,627,255 43%
1 24 520,834 84 MB 576 24.0 148.0 02:27 2,027,041 61%
1 24 2,083,334 337 MB 144 6.0 132.2 02:12 2,268,778 73%
1 24 6,250,000 1,010 MB 48 2.0 130.1 02:10 2,305,325 75%
1 24 12,500,000 2,020 MB 24 1.0 127.6 02:07 2,351,044 75%
1 48 100,000 16 MB 3,000 62.5 199.0 03:18 1,507,910 47%
1 48 520,834 84 MB 576 12.0 138.4 02:18 2,167,730 72%
1 48 2,083,334 337 MB 144 3.0 125.0 02:05 2,399,122 84%
1 48 6,250,000 1,010 MB 48 1.0 120.1 02:00 2,497,858 88%

Performance measurement result : insert into 1st table (sensor_data_table_ti)

Leaf partitions rows per file size per file files batches elapse time(sec) elapse time(mm:ss) data_size per sec. rows per sec. CPU utilization
1 24 100,000 16 MB 3,000 125.0 226.4 03:46 214 MB 1,325,302 40%
1 24 200,000 32 MB 1,500 62.5 214.2 03:34 226 MB 1,400,510 46%
1 24 250,000 40 MB 1,200 50.0 206.8 03:26 234 MB 1,450,960 53%
1 24 520,834 84 MB 576 24.0 211.1 03:31 230 MB 1,421,280 58%
1 24 1,041,667 168 MB 288 12.0 229.6 03:49 211 MB 1,306,447 58%
1 24 2,083,334 337 MB 144 6.0 243.0 04:03 199 MB 1,234,449 48%
1 24 6,250,000 1,010 MB 48 2.0 265.1 04:25 183 MB 1,131,734 47%
1 24 12,500,000 2,020 MB 24 1.0
1 48 100,000 16 MB 3,000 62.5 231.1 03:51 210 MB 1,298,229 48%
1 48 200,000 32 MB 1,500 31.3 206.7 03:26 235 MB 1,451,489 60%
1 48 250,000 40 MB 1,200 25.0 198.2 03:18 245 MB 1,513,947 65%
1 48 520,834 84 MB 576 12.0 186.5 03:06 260 MB 1,608,380 76%
1 48 1,041,667 168 MB 288 6.0 206.1 03:26 235 MB 1,455,902 82%
1 48 2,083,334 337 MB 144 3.0 218.2 03:38 222 MB 1,374,915 87%
1 48 6,250,000 1,010 MB 48 1.0 232.5 03:52 209 MB 1,290,358 88%

Questions

  1. When inserting into the first table, the larger the file size, the better the “elapse time”.
    However, the second table does not. Why?
    It looks like the difference between the columnstore keys, but it’s strange to increase and decrease.
    Note : elapse_time was obtained with the following query.
    select sum (BATCH_TIME) from information_schema.pipelines_batches_summary;

  2. If you are restoring a cvs file backed up from a database, it is more likely that it is actually the second table.
    Is there a way to improve performance?

  • System CPU/Memory: 24(vCpu48)/192G
  • 1 line average size: 165byte ~ 175byte

Note: “load data infile …” is worse than FS pipeline.

Thank you very much.