Skipping Header and Trailer records in FS Pipeline

anilee.tech · June 23, 2020, 5:25pm

Hi,
Currently, I receive files with header and trailer records.

How can I setup pipeline and have pipeline skip header and trailer and parse remaining records and load it into table?
If we have to use Stored Procedure, can you please provide a sample that can detect header and trailer. I might prefer header, trailer and file name inserted as a record in a separate table for auditing purposes and actual data into Target table.

Thanks!

m_k · June 25, 2020, 4:31am

That’s an interesting feature. How consistent are your headers and trailers? Can they be assumed to be just one line or are they more complex?

The simplest option for on-premise clusters is to employ a transform which strips them out. SingleStoreDB Cloud · SingleStore Documentation

You can get the filename in the process, which can be passed into the stored procedure via pipeline_batch_id() or pipeline_source_file() builtins. Here’s an example:

MySQL [db]> create or replace procedure proc(q query(a int, batch int)) as begin insert into t select batch from q; end //
Query OK, 1 row affected (0.015 sec)

MySQL [db]> create or replace pipeline p as load data fs "/tmp/*.csv" into procedure proc(a) set batch = pipeline_batch_id()//
Query OK, 0 rows affected (0.196 sec)

Note that you can always find the source for the pipeline batch by looking at the relevant tables in INFORMATION_SCHEMA, e.g. PIPELINES_BATCHES.

However, the transform may need to do something extra for headers and trailer, e.g. tag the line with a code so that you can properly route the row in a stored procedure later.

anilee.tech · June 25, 2020, 5:51pm

Header and Trailer records are simple and has consistent format - 1 line each

For example:
HDR 20200625
TRL 1500000

In the above example header has today’s date and Trailer has total number of records.

Thanks!