Semantic Search with OpenAI QA
Notebook
Note
This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.
In this Notebook you will use a combination of Semantic Search and a Large Langauge Model (LLM) to build a basic Retrieval Augmented Generation (RAG) application. For a great introduction into what RAG is, please read A Beginner's Guide to Retrieval Augmented Generation (RAG).
Prerequisites for interacting with ChatGPT
Install OpenAI package
Let's start by installing the openai Python package.
In [1]:
!pip install openai==1.3.3 --quiet
Connect to ChatGPT and display the response
In [2]:
import openaiEMBEDDING_MODEL = "text-embedding-ada-002"GPT_MODEL = "gpt-3.5-turbo"
You will need an OpenAI API key in order to use the the openai
Python library.
In [3]:
import getpassimport osos.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')client = openai.OpenAI()
Test the connection.
In [4]:
response = client.chat.completions.create(model=GPT_MODEL,messages=[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Who won the gold medal for curling in Olymics 2022?"},])print(response.choices[0].message.content)
Get the data about Winter Olympics and provide the information to ChatGPT as context
1. Install and import libraries
In [5]:
!pip install tabulate tiktoken wget --quiet
In [6]:
import jsonimport numpy as npimport osimport pandas as pdimport wget
2. Fetch the CSV data and read it into a DataFrame
Download pre-chunked text and pre-computed embeddings. This file is ~200 MB, so may take a minute depending on your connection speed.
In [7]:
embeddings_url = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"embeddings_path = "winter_olympics_2022.csv"if not os.path.exists(embeddings_path):wget.download(embeddings_url, embeddings_path)print("File downloaded successfully.")else:print("File already exists in the local file system.")
Here we are using the converters=
parameter of the pd.read_csv
function to convert the JSON
array in the CSV file to numpy arrays.
In [8]:
def json_to_numpy_array(x: str | None) -> np.ndarray | None:"""Convert JSON array string into numpy array."""return np.array(json.loads(x)) if x else Nonedf = pd.read_csv(embeddings_path, converters=dict(embedding=json_to_numpy_array))df
In [9]:
df.info(show_counts=True)
3. Set up the database
Action Required
If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
Create the database.
In [10]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS winter_wikipedia;%sql CREATE DATABASE winter_wikipedia;
Action Required
Make sure to select the winter_wikipedia database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by the %%sql magic command and SQLAlchemy to make connections to the selected database.
In [11]:
%%sqlCREATE TABLE IF NOT EXISTS winter_olympics_2022 (id INT PRIMARY KEY,text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,embedding BLOB);
4. Populate the table with our DataFrame
Create a SQLAlchemy connection.
In [12]:
import singlestoredb as s2conn = s2.create_engine().connect()
Use the to_sql
method of the DataFrame to upload the data to the requested table.
In [13]:
df.to_sql('winter_olympics_2022', con=conn, index=True, index_label='id', if_exists='append', chunksize=1000)
5. Do a semantic search with the same question from above and use the response to send to OpenAI again
In [14]:
import sqlalchemy as sadef get_embedding(text: str, model: str = 'text-embedding-ada-002') -> str:"""Return the embeddings."""return [x.embedding for x in client.embeddings.create(input=[text], model=model).data][0]def strings_ranked_by_relatedness(query: str,df: pd.DataFrame,table_name: str,relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),top_n: int=100,) -> tuple:"""Returns a list of strings and relatednesses, sorted from most related to least."""# Get the embedding of the query.query_embedding_response = get_embedding(query, EMBEDDING_MODEL)# Create the SQL statement.stmt = sa.text(f"""SELECTtext,DOT_PRODUCT_F64(JSON_ARRAY_PACK_F64(:embedding), embedding) AS scoreFROM {table_name}ORDER BY score DESCLIMIT :limit""")# Execute the SQL statement.results = conn.execute(stmt, dict(embedding=json.dumps(query_embedding_response), limit=top_n))strings = []relatednesses = []for row in results:strings.append(row[0])relatednesses.append(row[1])# Return the results.return strings[:top_n], relatednesses[:top_n]
In [15]:
from tabulate import tabulatestrings, relatednesses = strings_ranked_by_relatedness("curling gold medal",df,"winter_olympics_2022",top_n=5)for string, relatedness in zip(strings, relatednesses):print(f"{relatedness=:.3f}")print(tabulate([[string]], headers=['Result'], tablefmt='fancy_grid'))print('\n\n')
In [16]:
import tiktokendef num_tokens(text: str, model: str=GPT_MODEL) -> int:"""Return the number of tokens in a string."""encoding = tiktoken.encoding_for_model(model)return len(encoding.encode(text))def query_message(query: str,df: pd.DataFrame,model: str,token_budget: int) -> str:"""Return a message for GPT, with relevant source texts pulled from SingleStoreDB."""strings, relatednesses = strings_ranked_by_relatedness(query, df, "winter_olympics_2022")introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'question = f"\n\nQuestion: {query}"message = introductionfor string in strings:next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'if (num_tokens(message + next_article + question, model=model)> token_budget):breakelse:message += next_articlereturn message + questiondef ask(query: str,df: pd.DataFrame=df,model: str=GPT_MODEL,token_budget: int=4096 - 500,print_message: bool=False,) -> str:"""Answers a query using GPT and a table of relevant texts and embeddings in SingleStoreDB."""message = query_message(query, df, model=model, token_budget=token_budget)if print_message:print(message)messages = [{"role": "system", "content": "You answer questions about the 2022 Winter Olympics."},{"role": "user", "content": message},]response = client.chat.completions.create(model=model,messages=messages,temperature=0)response_message = response.choices[0].message.contentreturn response_message
In [17]:
print(ask('Who won the gold medal for curling in Olymics 2022?'))
Clean up
Action Required
If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.
In [18]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS winter_wikipedia;
Details
Tags
License
This Notebook has been released under the Apache 2.0 open source license.