Hi Ömer! We are actually working on this right now and have very recently (couple days ago) validated that it’s possible to use our MemSQL Spark Connector directly with AWS Glue. You can get our connector from GitHub - memsql/singlestore-spark-connector: A connector for MemSQL and Spark and here are the rough steps to get started:
Create an S3 bucket and folder.
Add the Spark Connector files to the folder.
Create another folder in the same bucket to be used as the Glue temporary directory in later steps.
Switch to the AWS Glue Service.
Click on Jobs on the left panel under ETL.
Provide a name for the job.
Select an IAM role. Create a new IAM role if one doesn’t already exist and be sure to add all Glue policies to this role.
Select the option for A new script to be authored by you.
Give the script a name.
Set the temporary directory to the one you created.
Expand Script libraries and job parameters:
Under the Dependent jars path, add entries for Spark Connector .jar files
Please let us know if you are able to get this working.
Hi Ömer, it seems like you have an old version of MemSQL (3.1). Currently, the latest version of MemSQL is 7.1. An update to this version may solve your problem.
In addition, you may use the MariaDB driver in AWS Glue by providing a MariaDB jar in a Dependent jars path of your ETL Job.
Please let us know if it works for you.
Best Regards!
Hi,
we are acutally using the most recent version of MemSQL (7.1).
I believe the shown error is a result of the MySQL drivers that were used, or some function that MemSQL does not support.
The MariaDB driver is not even able to connect to the database, at all.
We have also received feedback from AWS, that MemSQL support natively is not planned (as expected tbh).
We will try to implement carl’s solution and reach out again. Thanks
Kind regards,
Ömer
thanks for the description. We were not able to get it to work, though. We are receiving the following error now: “Command failed with exit code 1”
Could you please provide a more detailed description of what needs to be done, please?
To give you a better understanding of what we want to do:
We would like to run ETL tasks within Glue, that allow us to push and pull data from various sources/destinations (that are being supported by Glue, e.g. S3, SQL Server etc.) and transform data within Glue. Ideally we’d be able to run SQL statements in MemSQL, that are being triggered from Glue.
(We have read though, that this is not possible and we’d need to use Scala to run SQL queries on dataframes/datasets that are in Glue, and not in MemSQL).
At the moment we can’t even push data from S3 to MemSQL using Glue (we know this can be easily done using MemSQL pipelines - our final goal is different here).
If you could provide us any help, that’d be very much appreciated.
Hi,
Thanks for the answer.
I can’t see in your script any memsql configuration properties (like host, port of the memsql instance, user, etc.). Could you please describe how did you configure memsql in your Glue job?
Hi,
that is exactly my point. I guess we are missing or misunderstanding something. Can you provide us some example code, please?
We configured it using the JDBC driver, but I assume this is incorrect when using the MemSQL Spart Connector.
Thanks,
Ömer
Edit your Glue job and add the following to the Jar lib path field: s3://<your_bucket>/spray-json_2.11-1.3.5.jar,s3://<your_bucket>/memsql-spark-connector_2.11-3.0.4-spark-2.4.4.jar,s3://<your_bucket>/mariadb-java-client-2.7.0.jar
We weren’t really able to understand what we did wrong here. We assume “java.io.FileNotFoundException: File --spark-event-logs-path does not exist” points toward some issue, but did not know what to check.
Hi,
not sure that it’s the point, but could you check if your spark-event-log-path exists. You run your job with this job parameter, seems like it couldn’t be found.
Also if this path exists, could you please also try to run your job without this parameter?
Hi,
it’s hard to say because I don’t see your whole code, but I suppose this line should contain a tab before this code because it’s python and validation depends on the right tabs.
Hi,
is it possible, that we receive direct support in this specific instance?
Also, do you see any opportunity to run SQL code from within AWS Glue? In such a way that data transformation is not happening in Glue, but in MemSQL?
Kind regards,
Ömer