Today, we are sharing that MemSQL now has Full-Text Search, a highly requested feature, built into the product. Thanks to customer feedback, we are delighted to make it available for all companies building real-time applications.
What is Full-Text Search?
You might be thinking, “MemSQL is pretty fast at searching things and they already support large strings, so why do they need to add anything?” So let’s start with a description of Full-Text Search (FTS). Full-Text Search is different than querying for a string value in two ways.
The first is a performance difference. Sometimes when you search for text you know exactly what you are looking for. For example, if you want to see a list of mystery books you would ask the database to find all the books with a subject of “Mystery”, easy! But what if you have many documents, each with a lot of words in it, and you don’t know what information the documents contain. Specifically, suppose you wanted to find all the documents that talk about “Canada”. You could use a LIKE query or a regular expression, but that is going to be pretty slow with a regular database index, no matter how fast your database is. What you need is a different kind of index that keeps track of all the words in the document and how often they appear. So when you search for Canada, the system can quickly determine if the word Canada appears in the document and how many times. This is called an inverted index.
Secondly, your search might have a lot of matches. FTS includes the notion of relevance. Relevance tells you not just whether there is a match but also how good the match is. This allows you to sort the results showing the most relevant results first.
The combination of the inverted index data structure, stemming capabilities and relevance scoring are why Full-Text Searching is advantageous over searching strings using regular database queries.
How is Full-Text Search integrated with MemSQL?
MemSQL is a system that understands the notion of an index, so integrating FTS into MemSQL was relatively straightforward. We just had to teach MemSQL to understand this new kind of index. Like most technologies that support full-text indexing, we took advantage of the open-source Lucene technology. Lucene is a robust existing technology that is essentially the standard for full-text searching in the industry. It has been around a long time, so we didn’t have to reinvent the wheel.
To create an index, a user simply has to specify which columns they want to be full-text indexed when creating the table. Under the covers this causes a full-text index to be created. The index is partitioned using the same sharding pattern as the table, allowing for a distributed index that can also participate efficiently in local joins. The lifecycle of the index is governed by the lifecycle of the table, so you don’t have to do any extra work. All the management is handled by the system (failover, rebalance, node addition and removal, backup/restore, etc.). In addition, the index is bound to the same security mechanisms as the table, so the data is protected by the role-based access control features in MemSQL. This ensures the data in your index is only available to the users who have been granted permission to see the data. So all the management of the full-text capability is transparent to the user.
To make use of the index, you simply include the MATCH statement directly in your SQL query. This is the great part about the integration with MemSQL. You can mix and match your full-text search along with structured search in your filter. This makes the process very easy for someone familiar with SQL to take advantage of the full-text index. Because the index is automatically sharded, users get to take advantage of all the machines in the cluster to execute the lookup in the index resulting in blazing fast results. The full-text query itself is the standard Lucene syntax, which will be familiar to those who have used other full-text systems.
How would I use it?
Utilizing FTS in conjunction with a regular database application is helpful across a number of different use cases. Here are a few examples.
The simplest case is doing something such as an autocomplete functionality in a search box for your application. In this case, there are no documents in the application, it simply wants to quickly identify which results of a string column match a subset of characters. A simple query against the table using the match statement and returning the full-text of the column accomplishes the task.
Mixing Structured and Unstructured Query
A more complex example shows up in industries such as insurance or real-estate, which are inherently document-centric. In these use cases, the documents are a core part of the data set. Let’s take a real estate transaction as an example. There is a lot of data that is well understood (and therefore easy to structure) such as location, number of bedrooms and bathrooms, year built, etc. So you can imagine if someone wants to query the system using a structured query to limit a specific location and age of house, but also look for agreements that have a particular phrase in the doc. In this case, the user constructs a query with a structured search filter for location and age and adds a match statement in the filter to look for the phrase.
First let’s create a table. This looks like a regular CREATE TABLE statement. The only thing added is the FULLTEXT statement with the column names “body” and “title” added. This means we want a full-text index on the body column and the title column. Note you can only do full-text on string types. That’s it, your full-text index is ready to use. Just insert some data and you can start writing queries.
CREATE TABLE books( id INT UNSIGNED, title VARCHAR(100), publish_year INT UNSIGNED, body TEXT, KEY (id) USING CLUSTERED COLUMNSTORE, FULLTEXT (title, body));
Now let’s put some data into the table.
INSERT INTO books VALUES (1, 'Typing Test', 1981, 'The quick brown fox jumped over the lazy dog'); INSERT INTO books VALUES(2, 'Ylvis', 2014, 'What does the fox say'); INSERT INTO books VALUES(3, 'Gone With the Wind', 1936, 'In her face were too sharply blended the delicate features of her mother, a Coast aristocrat of French descent, and the heavy ones of her florid Irish father. But it was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin--that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns.');
Now we run optimized tables to ensure the index is updated. (Inserts are flushed asynchronously to the index). You don’t need to do this in practice as it will flush reasonably quickly, but for the purposes of this example you want to make sure the inserts are in the index.
OPTIMIZE TABLE books FLUSH;
Here is an example of a basic full-text query. The query looks for all rows that have a title or body with the string “fox” in it.
SELECT * FROM books WHERE MATCH (title, body) AGAINST ('fox');
This example uses a structured predicate (matching only books published in 2014) with a full-text predicate (books with the word fox in it).
SELECT count(*) FROM books WHERE publish_year = 2014 AND MATCH (body) AGAINST ('fox');
We also support returning a relevance score, which is a score for how successful the match is. You can put the relevance as an output column in your query.
SELECT id, title, MATCH (body) AGAINST ('Fox') relevance FROM books WHERE MATCH (body) AGAINST ('Fox');
You can also use the relevance score in the WHERE clause to filter the results.
SELECT id, title, MATCH (body) AGAINST ('fox') FROM books WHERE MATCH (body) AGAINST ('fox') >= 0.12;
Doing FTS with MemSQL is that simple. But don’t take my word for it. Download the beta, and start playing with the Full-Text Index feature right now.