Is it possible to transform a parquet file from s3 using pipeline?

sys2020 · July 24, 2019, 5:04pm

Hi guys,
I have a buckets in s3 with parquet files and json content in an ec2 instance where memsql is installed.
I would like to know if I can use a python script which is stored inside the ec2 and used by memsql via transform to create a pipeline that decrypts the parquet files and gives a treatment to the json content before uploading it to the memsql table.

mbhojani · July 26, 2019, 5:57pm

Hello,

Yes it is possible to use our Transform feature to convert parquet files. Also note that we have released MemSQL 7.0 beta version, which has native parquet ingest capabilities.

https://www.memsql.com/7-beta-1

fdorigo · October 10, 2019, 6:32pm

I am testing parquet support with version 7-beta2, I am receiving the following error for unsupported parquet type int96, what is the expected parquet support for memsql7?

Failed to parse parquet metadata: “Not yet implemented: INT96 support not yet implemented.”.

sasha · October 10, 2019, 9:30pm

Int96 support didn’t make it into beta 2, but it will in fact be supported in the final release of 7.0.

sanjeev.mishra · November 25, 2019, 6:13pm

@mbhojani I checked the documentation of 7.0 beta and I found no mention of Parquet support. Can you please send me the link to the section that describes how to import parquet?

nikita · November 26, 2019, 12:12am

We just release MemSQL 7 release candidate and it supports Parquet ingestion including int96

sanjeev.mishra · November 26, 2019, 6:55pm

@nikita Can you please provide some documentation on LOAD DATA or CREATE PIPELINE for parquet. We are evaluating the beta right now. Thanks

alec · November 27, 2019, 2:57am

sanjeev.mishra · November 30, 2019, 4:11pm

@nikita I have downloaded the 7.0 beta and tested creating FS pipeline and I still get int96 parse error: Failed to parse parquet metadata: “Not yet implemented: INT96 support not yet implemented.”

I am using cluster-in-a-box: dev:147346b8-ead9-4067-ab83-6dd0480ed9ee

alec · December 4, 2019, 9:55pm

Sorry @sanjeev.mishra, it didn’t quite make it in there for first release. Look for it in the release notes for v7.0 in a dot-release later this month.

Dani · January 24, 2020, 2:14pm

Any update on this? @alec

nikita · January 24, 2020, 2:31pm

We have built this feature and it’s expected to land in 7.1 around April. Does this timeline work for you?

Dani · January 27, 2020, 8:54am

This is a red flag on our shortlisting project now, many of our existing datalake files use this format to store timestamped data - would it be possible to ask for a whitepaper on an official performant runtime workaround that would not include transforming everything in the lake?

Thanks

nikita · March 24, 2020, 5:36am

We just added support for int96 in parquet