HDFS Pipeline Error

Hi, all.

I trying to create hdfs pipeline but, I’m getting Error.

CREATE PIPELINE pi_hdfs
AS LOAD DATA HDFS ‘hdfs://~~:8020/~~’
CONFIG ‘{“disable_gunzip”: true}’
INTO TABLE ‘~~’(data_line)
SET file_path = pipeline_source_file();

========================================

ERROR 1993 ER_EXTRACTOR_GET_LASTEST_OFFSETS: 
Cannot get source metadata for pipeline. could not walk folder /CTTM/TEST/~~/:stat /CTTM/TEST/~~/:getFileInfo call failed with FATAL_UNAUTHORIZED (org.apache.hadoop.security.AuthorizationException)

Actually, Hadoop’s access account is ‘sysadmin’. So, I run this query, and I’m getting same error again.

CREATE PIPELINE pi_hdfs
AS LOAD DATA HDFS ‘hdfs://~~:8020/~~’
CONFIG ‘{“disable_gunzip”: true}’
CREDENTIALS '{"user":"sysadmin"}'
INTO TABLE ‘~~’(data_line)
SET file_path = pipeline_source_file();

Is there any other way to change access account when pipeline access hadoop?

hi minkyung.kang, and thanks for trying out HDFS pipelines!

could you run the hdfs ls -l command which shows user that owns your hdfs:// path and post the output here?

i just noticed you have different double quotes in CONFIG and CREDENTIALS sections. can you double check them?

hi mkobyakov

First, There were several typos in my question. Hadoop access account is ‘gpadmin’, not ‘sysadmin’.
and It’s the same in query. Using Different double quotes in query is typing error, too.
I am sorry to make you confused.

Now, I will answer your request.

  1. User that owns hdfs:// path is ‘gpadmin’.
  2. Different double quotes is typo. :pensive:

Second query is corrected.

CREATE PIPELINE pi_hdfs
AS LOAD DATA HDFS 'hdfs://~~:8020/~~'
CONFIG '{"disable_gunzip": true}'
CREDENTIALS '{"user":"gpadmin"}'
INTO TABLE '~~'(data_line)
SET file_path = pipeline_source_file();

Additionally, I trying to run query with this syntax, too. but same error occurred.

CREDENTIALS '{"user":'gpadmin'}'

CREDENTIALS '{"user":gpadmin}'

The error mentions this path. Could you check that gpadmin user has permissions to access all subfolders of the top path recursively?

neither of these is correct syntax. double-quote is best

Thanks for your reply, mkobyakov
I checked all subfolders with hdfs ls command, and ‘gpadmin’ has permissions to access all this directory. Is there anything else I need to check?

hmm, those are most of the things i would suspect on memsql side. can you describe your Hadoop cluster in more detail? do you by any chance use an authentication module like Kerberos?

I’m not using authentication module in Hadoop.
All subfolders of the top path in hadoop are given r-x permissions in other users. so, ‘memsql’ account has access to file and directory.
Can pipeline problems occur depending on the version? I’m using MemSQL version 7.1.10.

hi minkyung.kang.

HDFS pipelines also supports an “advanced” mode, where the pipeline is run via native Java libraries.

if you are not using encryption or authentication, you do not need those extra CONFIG elements mentioned in the guide.

you would still need to make sure the folders and files are readable by the user that runs the engine, in your case memsql.

would you be able to try this, and let us know if this worked?