Stemming and stopwords support with FULLTEXT index

memsql

#1

Does MemSQL support stemming and stopwords with FULLTEXT index out of the box? The MATCH documentation mentions stopwords and one of your blog posts mentions stemming.

If yes, which languages are supported and how does MemSQL determinate which language been used?

Otherwise, I’ll just stick to a application-side solution.


#2

Great question. Regarding stopwords, we do have a list of stopwords that we use. The list is hardcoded. We do need to document them though, which I will make sure gets done. Stemming is something we decided not to implement for the moment but something we are considering for the future. We currently only support English.


#3

@rick thanks for the clarification.

I’ll go on with an application-side solution. This will probably also give us better control.

Perhaps the hardcoded stopwords should be optional in a future release? It might conflict with other languages? E.g. the English word have (probably a stopword?) means garden in Danish.


#4

Any chances a future version will support both wildcard and logic (AND, OR etc.) together?

In Scandinavian languages we have many compound words like “underkjole” so a search query in our case could look like “hvid AND *kjole” (white underdress - similiar to “white AND *dress”).

Also, a way to escape operators would be useful. E.g. “and” means “duck” in Danish. Even through the operator has to be uppercase per documentation, no results are returned when using lowercase, but a LIKE or equals operator does.


#5

Sounds like the issues are because you are trying to use this on non-english languages. We only support English right now. Once we support other languages there would be a different set of stop words for each language (and other behavioral differences).

We do support wildcard and boolean logic operators. Also, I am not sure what you mean by “a way to escape operators”. It sounds like you are trying to do something specific and it isn’t working. If you send me a description of what you are trying to do and the query, and the results we can see if there is a way to make it work.

Rick


#6

Indeed, we’re using non-English languages.

I’ll get back to you tomorrow with an example of wildcard and logic limitations and an example illustrating the drawbacks by not having a way to escape operators.