Is it possible to transform a parquet file from s3 using pipeline?

Hi guys,
I have a buckets in s3 with parquet files and json content in an ec2 instance where memsql is installed.
I would like to know if I can use a python script which is stored inside the ec2 and used by memsql via transform to create a pipeline that decrypts the parquet files and gives a treatment to the json content before uploading it to the memsql table.

Hello,

Yes it is possible to use our Transform feature to convert parquet files. Also note that we have released MemSQL 7.0 beta version, which has native parquet ingest capabilities.

https://www.memsql.com/7-beta-1

I am testing parquet support with version 7-beta2, I am receiving the following error for unsupported parquet type int96, what is the expected parquet support for memsql7?

Failed to parse parquet metadata: “Not yet implemented: INT96 support not yet implemented.”.

Int96 support didn’t make it into beta 2, but it will in fact be supported in the final release of 7.0.

@mbhojani I checked the documentation of 7.0 beta and I found no mention of Parquet support. Can you please send me the link to the section that describes how to import parquet?

We just release MemSQL 7 release candidate and it supports Parquet ingestion including int96

@nikita Can you please provide some documentation on LOAD DATA or CREATE PIPELINE for parquet. We are evaluating the beta right now. Thanks

@nikita I have downloaded the 7.0 beta and tested creating FS pipeline and I still get int96 parse error: Failed to parse parquet metadata: “Not yet implemented: INT96 support not yet implemented.”

I am using cluster-in-a-box: dev:147346b8-ead9-4067-ab83-6dd0480ed9ee

Sorry @sanjeev.mishra, it didn’t quite make it in there for first release. Look for it in the release notes for v7.0 in a dot-release later this month.

Any update on this? @alec

We have built this feature and it’s expected to land in 7.1 around April. Does this timeline work for you?

This is a red flag on our shortlisting project now, many of our existing datalake files use this format to store timestamped data - would it be possible to ask for a whitepaper on an official performant runtime workaround that would not include transforming everything in the lake?

Thanks

We just added support for int96 in parquet