Need help with HDFS pipeline

I am very much new to memsql.
I am getting error " ERROR 1933 ER_EXTRACTOR_EXTRACTOR_GET_LATEST_OFFSETS: Cannot get source metadata for pipeline. could not walk folder /landingzone/memsql: stat /landingzone/memsql: permission denied "

Here is the syntax I executed in MemSQL Studio.

CREATE PIPELINE my_database.test_pipeline
AS LOAD DATA HDFS ‘hdfs://my-name-node:8020/landingzone/memsql/test_tbl’
INTO TABLE my_database.test_tbl
FIELDS TERMINATED BY ‘\t’;

/landingzone/memsql/test_tbl has one text file with fields separated by “\t”.

Here are my questions.

  1. Do I need to configure anything in hdfs server for memsql to work?
  2. What user does memsql’s above syntax use to extract data from hdfs?
  3. How does memsql understand the delimiter in hdfs files?
  4. How does memsql read metadata just from hdfs files?
  5. Is name-node the only way to connect to hdfs?

Thanks in advance.

Hi Sanraj, thanks for trying out HDFS pipelines.

  1. I think you only need to provide the correct user to talk to HDFS. You can add it to the CREATE PIPELINE statement.
...
CREDENTIALS {"user":"<username>"}
INTO TABLE my_database.test_tbl
...
  1. By default, MemSQL will use "user":"memsql".

  2. All pipelines statements use the following syntax for specifying delimiters. If the file is gzipped, it will be unzipped first, then parsed with delimiter options.
    https://docs.memsql.com/v7.0/reference/sql-reference/pipelines-commands/create-pipeline/#syntax

    [{FIELDS | COLUMNS}
     TERMINATED BY 'string'
       [[OPTIONALLY] ENCLOSED BY 'char']
       [ESCAPED BY 'char']
    ]
    [LINES
      [STARTING BY 'string']
      [TERMINATED BY 'string']
    ]
  1. MemSQL will fetch file metadata from the HDFS namenode, then files will get downloaded to the leaf nodes from the HDFS datanodes.

  2. Yes, most HDFS operations should start with talking to the namenode, and only afterward connecting to data nodes.

Thanks @mkobyakov. I tried below but I got the same permission error. Could this be due to my HDFS being setup with Kerberos authentication? If yes, is there any easy way other than following the steps mentioned in https://docs.memsql.com/v7.0/reference/configuration-reference/engine-variables-overview/engine-variables-overview/ ?

Also can we directly copy a table from Teradata database to MemSQL?

CREATE PIPELINE my_database.test_pipeline
AS LOAD DATA HDFS ‘hdfs://my-name-node:8020/landingzone/memsql/test_tbl’
CREDENTIALS ‘{“user”:"<my_user_nm>"}’
INTO TABLE my_database. test_tbl
FIELDS TERMINATED BY ‘\t’;

I see. For Kerberos support, you’ll need to additionally configure the cluster to use JRE and krb5-client, and turn on advanced_hdfs_pipelines. The steps are outlined here:

Don’t skip the section down the page, “Authenticating with Kerberos”.

@mkobyakov Is there a direct way of copying a table from Teradata to Mem SQL?

Could you please post a separate question? The answer is yes :slight_smile: