Load CSV data with a Pipeline


#1

Hi,

I want to load data to a MemSQL Table, using a pipeline, from a HDFS folder.
Each day one or two csv files are generated in this folder, but not all the files have the same schema, for example, one file might have 12 columns and other 6 and other 18 (it depends on what happened that day).

Is there any way to load only the columns I am interested? (3 columns that always appear in the csv file)

For example, csv file1 has:

"code","number","year","C"
"blab","15885","2016","Y"
"aeea","15883","1982","E"
"xpto","15884","1986","B"

csv file2 has:

"code","number","year","C","M","type"
"geom","48956","1991","C","A","1"
"goog","77895","1999","E","R,"2""
"ecua","22354","1988","B","G","1"

Then I created the following table:

CREATE TABLE my_table (
     code VARCHAR(10),
     number INT,
);

And I only want to load data from the first two columns of each csv file, so I create a pipeline:

CREATE PIPELINE my_pipeline
AS LOAD DATA HDFS 'hdfs://path/output_folder/'
INTO TABLE `my_table`
FIELDS TERMINATED BY ','
(code, number);

But gives me error, as the csv file contains more columns than available columns in the table (of course)… is there any way to tell the pipeline to only load this two specific columns and ignore any other? (Take into consideration that each csv file may contain different column names each time)

Thanks!!


#2

You can use a pipeline with a transform to eliminate the columns you don’t want. See https://docs.memsql.com/sql-reference/v6.7/create-pipeline/#with-transform

Or use pipelines to stored procedures, treat the whole line of input as one column in the input batch, and have the stored procedure transform pull off the desired columns and insert them into the target table. See https://docs.memsql.com/sql-reference/v6.7/create-pipeline/#into-procedure


#3

Great I will look into it, thanks Hanson!


#4

I learned that my prior response might have been overkill. Load data and pipelines have native syntax to skip fields in the input, without transforms or stored procedures

load data infile "foo" into table t (col1, @, col2)

will parse a 3 column csv file and ignore the second column.


#5

In particular, that example above is skipping columns by loading unused columns of CSV into unused @-vars, as in the second example here:


#6

Great, works as expected, but require that (all) the source csv files to have exact the same amount of columns… for future reference I solved creating the pipeline like this:

CREATE PIPELINE my_pipeline
AS LOAD DATA HDFS 'hdfs://path/output_folder'
INTO TABLE `my_table`
(code, number, @, @, @, @)
FIELDS TERMINATED BY ','
IGNORE 1 LINES;

Regards,