Https://docs.memsql.com/v6.8/guides/client-and-application/third-party-integrations/spark-connector/

puliyathsmitha · March 24, 2020, 11:11am

I am trying to load data to memsql using the query dataframe.format('com.spark.memsql.connector").mode('overwrite").save(“database.table_name”)

The problem is its not overwriting the records,but act like append.That is the same data is loaded again and again.Any solution to overwrite records or to avoid duplicates?

roxannapourzand · March 24, 2020, 8:38pm

Hello puliyathsmitha,

Thanks for reaching out and welcome to the MemSQL community!

We strongly recommend you use the Beta of the 3.0 Spark connector, which we are preparing for release soon, and it will resolve the issue you are encountering.

The connector is published on Github (https://github.com/memsql/memsql-spark-connector/tree/3.0.0-beta) and it is also on Maven Central (Maven Central Repository Search)

To overwrite existing records to avoid duplicates, you can use the 3.0 connector to specify the Overwrite and truncate options. See the examples below.

If you want to drop/recreate the table, you can use Overwrite mode and truncate = false, which indicates that you will drop/recreate the table to overwrite existing records.

df.write
    .format("memsql")
    .option("loadDataCompression", "LZ4")
    .option("truncate", "false")
    .mode(SaveMode.Overwrite)
    .save("foo.bar")

Alternatively, you can truncate the table instead of recreating to overwrite records:

df.write
    .format("memsql")
    .option("loadDataCompression", "LZ4")
    .option("truncate", "true")
    .mode(SaveMode.Overwrite)
    .save("foo.bar")

For additional resources on connecting and setting up the 3.0 version of the connector, please see:

The Github ReadME: https://github.com/memsql/memsql-spark-connector/blob/3.0.0-beta/README.md
The public documentation on the 3.0 connector: SingleStoreDB Cloud · SingleStore Documentation

Note that the API to write data to MemSQL is slightly different in the 3.0 version as compared to 2.0. It integrates with the JDBC reader/writer API. The documentation and ReadMe examples should have the information you need to get started, but please don’t hesitate to reach out for any further questions.

Thanks!
Roxanna

puliyathsmitha · March 26, 2020, 2:34am

Hello Roxanna,

Thanks for the reply.

I tried your solution.

df.write
.format(“memsql”)
.option(“loadDataCompression”, “LZ4”)
.option(“truncate”, “false”)
.mode(SaveMode.Overwrite)
.save(“foo.bar”) ,

But I received two errors.
Failed to find data source: memsql. Please find packages at Third-Party Projects | Apache Spark

Exception in thread “main” java.util.NoSuchElementException: No value found for ‘LZ4’

I am not able to change maven dependency to 3.0.0-beta-spark-2.3.4 from 2.0.1 because its throwing exception that dependency not found.

So I used current 2.0.1 and modified code in this way:

df.write
.format(“com.memsql.spark.connector”)
.option(“truncate”, “false”)
.mode(SaveMode.Overwrite)
.save(“foo.bar”) ,

but the duplicates are still there.Its not overwritten.

My memsql dependency in maven is:

com.memsql memsql-connector_2.11 2.0.1

I created a data frame
val df1 = Seq(empTable(1,“Tim”,“dev”),empTable(2,“Tom”,“dev”),empTable(3,“Fank”,“hr”)).toDF

and wrote to memsql:
df1.write
.format(“com.memsql.spark.connector”)
.option(“truncate”, “false”)
.mode(SaveMode.Overwrite)
.save(“foo.bar”)

Again created a data frame and wrote to the same memsql table.
val df2 = Seq(empTable(2,“Tom”,“dev”),empTable(3,“Fank”,“hr”),empTable(4,“kim”,“hr”)).toDF

df2.write
.format(“com.memsql.spark.connector”)
.option(“truncate”, “false”)
.mode(SaveMode.Overwrite)
.save(“foo.bar”)
Ideally now memsql table should contain 4 records,but it have 6, ie, duplicates are created.
But there is an additional column ’ memsql insert time’ automatically created ,which is different for each record.Is that the reason for it?

Regards
Smitha

roxannapourzand · April 2, 2020, 1:44am

Hi puliyathsmitha,

Thank you for your reply!

The 2.0 connector does not operate with the options I suggested in the 3.0 connector, so it makes sense why you saw the same behavior when changing the code.

Let’s try and get you up and running with the 3.0 connector. Since this is the beta version, you will have to explicitly specify which beta version you are running in your project dependency.

Can you try adding this to the dependency and see if it is recognized? This will pull our beta8 version of the connector (latest). I realized our documentation is slightly misleading on this, so I’ve created a task to update it.

<dependency>
  <groupId>com.memsql</groupId>
  <artifactId>memsql-spark-connector_2.11</artifactId>
  <version>3.0.0-beta8-spark-2.3.4</version>
</dependency>

For reference, here is a link to the Maven coordinates. If you click into these, you can get the respective code to insert the dependencies: Maven Central Repository Search

Please let me know if this works for you. You should be able to run the code above using the 3.0 connector and should not see duplicates.

Best,
Roxanna