Stemming and stopwords support with FULLTEXT index

Does MemSQL support stemming and stopwords with FULLTEXT index out of the box? The MATCH documentation mentions stopwords and one of your blog posts mentions stemming.

If yes, which languages are supported and how does MemSQL determinate which language been used?

Otherwise, I’ll just stick to a application-side solution.

Great question. Regarding stopwords, we do have a list of stopwords that we use. The list is hardcoded. We do need to document them though, which I will make sure gets done. Stemming is something we decided not to implement for the moment but something we are considering for the future. We currently only support English.

@rick thanks for the clarification.

I’ll go on with an application-side solution. This will probably also give us better control.

Perhaps the hardcoded stopwords should be optional in a future release? It might conflict with other languages? E.g. the English word have (probably a stopword?) means garden in Danish.

Any chances a future version will support both wildcard and logic (AND, OR etc.) together?

In Scandinavian languages we have many compound words like “underkjole” so a search query in our case could look like “hvid AND *kjole” (white underdress - similiar to “white AND *dress”).

Also, a way to escape operators would be useful. E.g. “and” means “duck” in Danish. Even through the operator has to be uppercase per documentation, no results are returned when using lowercase, but a LIKE or equals operator does.

Sounds like the issues are because you are trying to use this on non-english languages. We only support English right now. Once we support other languages there would be a different set of stop words for each language (and other behavioral differences).

We do support wildcard and boolean logic operators. Also, I am not sure what you mean by “a way to escape operators”. It sounds like you are trying to do something specific and it isn’t working. If you send me a description of what you are trying to do and the query, and the results we can see if there is a way to make it work.

Rick

Indeed, we’re using non-English languages.

I’ll get back to you tomorrow with an example of wildcard and logic limitations and an example illustrating the drawbacks by not having a way to escape operators.

We’re still hitting some edge cases primarily because of non-English languages.

Any plans to make English stopwords optional, @rick?

Also, a public list of your hardcoded stopwords would be really helpful. Then we might be able to implement some workarounds until further.

The forced stopwords also gives some troubles for English searches. When we have a user searching for e.g. “games for kids” we transform this into “games* AND for* AND kids*” to ensure all terms is present. Because for is a stopword and we performs a boolean expression no results are returned.

This is just an example to argue for why it is useful to have optional stopwords.

I received this list from their support team in March. I agree that perhaps they could create a system table with the stopwords in it so someone could update or delete some of them. We have some industry terms that occur so often, they do nothing to enrich the search and probably lengthen the index.

“a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”,

“for”, “if”, “in”, “into”, “is”, “it”,

“no”, “not”, “of”, “on”, “or”, “such”,

“that”, “the”, “their”, “then”, “there”, “these”,

“they”, “this”, “to”, “was”, “will”, “with”

2 Likes

Thanks for the feedback. We will consider better support for non-english in our next release.

Rick
MemSQL VP Product Management

1 Like

Thanks @rick, I really appreciate your consideration for next release a lot!

Let me know if you need more examples/cases when you discusses this internally. I’ll happy to collaborate.