Ingestion from MongoDB

vkruoso · November 20, 2018, 7:47pm

Is there any popular MongoDB adapter available to replicate data from a MongoDB cluster to a MemSQL cluster? I have a use case were I would like to listen to MongoDB collection updates and replicate for later analytics. Is there any solution out-of-the-box or should I go for a custom implementation?

carl · November 21, 2018, 1:44am

Currently we don’t have native replication from MongoDB into MemSQL. That said, if I was to build such a thing I would look into the following approach:

Find a way to replicate a Mongo replica set into Kafka (from a brief Google search there appears to be some options here
Consume the Kafka stream of changes using MemSQL Pipelines (docs here: SingleStoreDB Cloud · SingleStore Documentation)
Stream the data into a Columnstore table, storing each of the raw updates using our native JSON column type: SingleStoreDB Cloud · SingleStore Documentation

Would love to hear if this solution works for you and if so it would be great if you could report back with your experience. Including which pieces of software you ended up using would be helpful as well.

geet · June 14, 2019, 6:45pm

I am trying to do the same (i.e. sync mongo to memsql).
I have debezium dumping mongo oplogs to kafka but could not find a good simple transformation to load that oplog into memsql using memsql pipelines.
Does anyone know if something already exists or if there is a better solution?
Thanks.

i.e. Looking for help with Step 3 in carl’s comment above. Do not want to the raw json but instead want to do save parts of the raw data.

JoYo · June 14, 2019, 11:23pm

If you know exactly what keys you want, create pipeline ... format json (memsql_col <- path::to::json_key, ...) is the way to go.
https://docs.memsql.com/sql-reference/v6.8/load-data/#json-load-data (this reference is for load data but the syntax is the same in pipelines)

geet · June 17, 2019, 5:22pm

Thank you for the response YoYo.

Although I do know exactly what fields I’d like to extract and can write my own transform, the json being posted is fairly complex and I am hoping someone has already written a transform for the different scenarios (schema change, invalid data, error handling, etc.)

geet · August 22, 2019, 7:34pm

Just wanted to let everyone know what ended up working out for us…

mongo oplog --> kafka (using the debezium connector) --> memsql pipelines --> memsql tables

Here are some tips:

We ended up going with “transforms.unwrap.type”: “io.debezium.connector.mongodb.transforms.UnwrapFromMongoDbEnvelope” because that was the easiest to integrate with fewest changes on our side.
We split our syncing on each collection into it’s own connector. Although maybe a bit inefficient, this was more robust because if there was a problem with one collection, the others can continue to succeed.
The json for one of our collections was very complex. The only work-around was to use
“transforms.unwrap.array.encoding”:“document” and then parse it on the memsql side. Since we split our collections into separate connectors, only the complex one had to altered, the others followed the default encoding (array).
We relied heavily on “field.blacklist”. “field.whitelist” (which could have saved a ton of our time) is currently not available.

Good luck!

Jacky · August 22, 2019, 9:34pm

Hey @geet

It’s Jacky, product manager at MemSQL

Thanks so much for following up on ingesting from Mongo!

I think the community would love to read on how you did it – seeing at how many views this has gotten.

Would you mind posting a new thread with what you wrote hand share the guide with the MemSQL community? I’m sure there will be a lot of value in it