Hybrid Search
Notebook
Note
This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.
Source: OpenAI Cookbook
Hybrid search integrates both keyword-based search and semantic search in order to combine the strengths of both and provide users with a more comprehensive and efficient search experience. This notebook is an example on how to perform hybrid search with SingleStore's database and notebooks.
Setup
Let's first download the libraries necessary.
In [1]:
%pip install wget openai==1.3.3 --quiet
In [2]:
import jsonimport osimport pandas as pdimport wget
In [3]:
# Import the library for vectorizing the data (Up to 2 minutes)!pip install sentence-transformers --quietfrom sentence_transformers import SentenceTransformermodel = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')
Import data from CSV file
This csv file holds the title, summary, and category of approximately 2000 news articles.
In [4]:
# download reviews csv filecvs_file_path = 'https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/AG_news_samples.csv'file_path = 'AG_news_samples.csv'if not os.path.exists(file_path):wget.download(cvs_file_path, file_path)print('File downloaded successfully.')else:print('File already exists in the local file system.')
In [5]:
df = pd.read_csv('AG_news_samples.csv')df
In [6]:
data = df.to_dict(orient='records')data[0]
Action Required
If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
Set up the database
Set up the SingleStoreDB database which will hold your data.
In [7]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS news;%sql CREATE DATABASE news;
Action Required
Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
In [8]:
%%sqlDROP TABLE IF EXISTS news_articles;CREATE TABLE IF NOT EXISTS news_articles (title TEXT,description TEXT,genre TEXT,embedding BLOB,FULLTEXT (title, description));
Get embeddings for every row based on the description column
In [9]:
# Will take around 3.5 minutes to get embeddings for all 2000 rowsdescriptions = [row['description'] for row in data]all_embeddings = model.encode(descriptions)all_embeddings.shape
Merge embedding values into data
rows.
In [10]:
for row, embedding in zip(data, all_embeddings):row['embedding'] = embedding
Here's an example of one row of the combined data.
In [11]:
data[0]
Populate the database
In [12]:
%sql TRUNCATE TABLE news_articles;import sqlalchemy as safrom singlestoredb import create_engine# Use create_table from singlestoredb since it uses the notebook connection URLconn = create_engine().connect()statement = sa.text('''INSERT INTO news.news_articles (title,description,genre,embedding)VALUES (:title,:description,:label,:embedding)''')conn.execute(statement, data)
Semantic search
Connect to OpenAI
In [13]:
import openaiEMBEDDING_MODEL = 'text-embedding-ada-002'GPT_MODEL = 'gpt-3.5-turbo'
In [14]:
import getpassopenai.api_key = getpass.getpass('OpenAI API Key: ')
Run semantic search and get scores
In [15]:
search_query = 'Articles about Aussie captures'search_embedding = model.encode(search_query)# Create the SQL statement.query_statement = sa.text('''SELECTtitle,description,genre,DOT_PRODUCT(embedding, :embedding) AS scoreFROM news.news_articlesORDER BY score DESCLIMIT 10''')# Execute the SQL statement.results = pd.DataFrame(conn.execute(query_statement, dict(embedding=search_embedding)))results
Hybrid search
This search finds the average of the score gotten from the semantic search and the score gotten from the key-word search and sorts the news articles by this combined score to perform an effective hybrid search.
In [16]:
hyb_query = 'Articles about Aussie captures'hyb_embedding = model.encode(hyb_query)# Create the SQL statement.hyb_statement = sa.text('''SELECTtitle,description,genre,DOT_PRODUCT(embedding, :embedding) AS semantic_score,MATCH(title, description) AGAINST (:query) AS keyword_score,(semantic_score + keyword_score) / 2 AS combined_scoreFROM news.news_articlesORDER BY combined_score DESCLIMIT 10''')# Execute the SQL statement.hyb_results = pd.DataFrame(conn.execute(hyb_statement, dict(embedding=hyb_embedding, query=hyb_query)))hyb_results
Clean up
Action Required
If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.
In [17]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS news;
Details
Tags
License
This Notebook has been released under the Apache 2.0 open source license.