Generate vector embeddings with Azure OpenAI

To perform semantic searches, you must first generate embedding vectors from a model, store them in a vector database, and then query the embeddings. You’ll create a database, populate it with sample data, and run semantic searches against those listings.

By the end of this exercise, you’ll have an Azure Database for PostgreSQL flexible server instance with the vector and azure_ai extensions enabled. You’ll generate embeddings for the Seattle Airbnb Open Data dataset’s listings table. You’ll also run semantic searches against these listings by generating a query’s embedding vector and performing a vector cosine distance search.

Before you start

You need an Azure subscription with administrative rights.

Deploy resources into your Azure subscription

This step guides you through using Azure CLI commands from the Azure Cloud Shell to create a resource group and run a Bicep script to deploy the Azure services necessary for completing this exercise into your Azure subscription.

Note: If you are doing multiple modules in this learning path, you can share the Azure environment between them. In that case, you only need to complete this resource deployment step once.

Open a web browser and navigate to the Azure portal.
Select the Cloud Shell icon in the Azure portal toolbar to open a new Cloud Shell pane at the bottom of your browser window.

If prompted, select the required options to open a Bash shell. If you have previously used a PowerShell console, switch it to a Bash shell.
At the Cloud Shell prompt, enter the following to clone the GitHub repo containing exercise resources:
```
 git clone https://github.com/MicrosoftLearning/mslearn-postgresql.git
```
Next, you run three commands to define variables to reduce redundant typing when using Azure CLI commands to create Azure resources. The variables represent the name to assign to your resource group (RG_NAME), the Azure region (REGION) into which resources will be deployed, and a randomly generated password for the PostgreSQL administrator login (ADMIN_PASSWORD).

In the first command, the region assigned to the corresponding variable is eastus, but you can also replace it with a location of your preference. However, if replacing the default, you must select another Azure region that supports abstractive summarization to ensure you can complete all of the tasks in the modules in this learning path.
```
 REGION=eastus
```
The following command assigns the name to be used for the resource group that will house all the resources used in this exercise. The resource group name assigned to the corresponding variable is rg-learn-postgresql-ai-$REGION, where $REGION is the location you specified above. However, you can change it to any other resource group name that suits your preference.
```
 RG_NAME=rg-learn-postgresql-ai-$REGION
```
The final command randomly generates a password for the PostgreSQL admin login. Make sure you copy it to a safe place to use later to connect to your PostgreSQL flexible server.
```
 a=()
 for i in {a..z} {A..Z} {0..9}; 
     do
     a[$RANDOM]=$i
     done
 ADMIN_PASSWORD=$(IFS=; echo "${a[*]::18}")
 echo "Your randomly generated PostgreSQL admin user's password is:"
 echo $ADMIN_PASSWORD
```
If you have access to more than one Azure subscription, and your default subscription is not the one in which you want to create the resource group and other resources for this exercise, run this command to set the appropriate subscription, replacing the <subscriptionName|subscriptionId> token with either the name or ID of the subscription you want to use:
```
 az account set --subscription <subscriptionName|subscriptionId>
```
Run the following Azure CLI command to create your resource group:
```
 az group create --name $RG_NAME --location $REGION
```
Finally, use the Azure CLI to execute a Bicep deployment script to provision Azure resources in your resource group:
```
 az deployment group create --resource-group $RG_NAME --template-file "mslearn-postgresql/Allfiles/Labs/Shared/deploy.bicep" --parameters restore=false adminLogin=pgAdmin adminLoginPassword=$ADMIN_PASSWORD
```
The Bicep deployment script provisions the Azure services required to complete this exercise into your resource group. The resources deployed include an Azure Database for PostgreSQL - Flexible Server, Azure OpenAI, and an Azure AI Language service. The Bicep script also performs some configuration steps, such as adding the azure_ai and vector extensions to the PostgreSQL server’s allowlist (via the azure.extensions server parameter), creating a database named rentals on the server, and adding a deployment named embedding using the text-embedding-ada-002 model to your Azure OpenAI service. Note that the Bicep file is shared by all modules in this learning path, so you may only use some of the deployed resources in some exercises.

The deployment typically takes several minutes to complete. You can monitor it from the Cloud Shell or navigate to the Deployments page for the resource group you created above and observe the deployment progress there.
Close the Cloud Shell pane once your resource deployment is complete.

Troubleshooting deployment errors

You may encounter a few errors when running the Bicep deployment script.

If you previously ran the Bicep deployment script for this learning path and subsequently deleted the resources, you may receive an error message like the following if you are attempting to rerun the script within 48 hours of deleting the resources:

  {"code": "InvalidTemplateDeployment", "message": "The template deployment 'deploy' is not valid according to the validation procedure. The tracking id is '4e87a33d-a0ac-4aec-88d8-177b04c1d752'. See inner errors for details."}
    
  Inner Errors:
  {"code": "FlagMustBeSetForRestore", "message": "An existing resource with ID '/subscriptions/{subscriptionId}/resourceGroups/rg-learn-postgresql-ai-eastus/providers/Microsoft.CognitiveServices/accounts/{accountName}' has been soft-deleted. To restore the resource, you must specify 'restore' to be 'true' in the property. If you don't want to restore existing resource, please purge it first."}

If you receive this message, modify the azure deployment group create command above to set the restore parameter equal to true and rerun it.

If the selected region is restricted from provisioning specific resources, you must set the REGION variable to a different location and rerun the commands to create the resource group and run the Bicep deployment script.

  {"status":"Failed","error":{"code":"DeploymentFailed","target":"/subscriptions/{subscriptionId}/resourceGroups/{resourceGrouName}/providers/Microsoft.Resources/deployments/{deploymentName}","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","details":[{"code":"ResourceDeploymentFailure","target":"/subscriptions/{subscriptionId}/resourceGroups/{resourceGrouName}/providers/Microsoft.DBforPostgreSQL/flexibleServers/{serverName}","message":"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.","details":[{"code":"RegionIsOfferRestricted","message":"Subscriptions are restricted from provisioning in this region. Please choose a different region. For exceptions to this rule please open a support request with Issue type of 'Service and subscription limits'. See https://review.learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-request-quota-increase for more details."}]}]}}

If the script is unable to create an AI resource due to the requirement to accept the responsible AI agreement, you may experience the following error; in which case use the Azure Portal user interface to create an Azure AI Services resource, and then re-run the deployment script.

  {"code": "InvalidTemplateDeployment", "message": "The template deployment 'deploy' is not valid according to the validation procedure. The tracking id is 'f8412edb-6386-4192-a22f-43557a51ea5f'. See inner errors for details."}
     
  Inner Errors:
  {"code": "ResourceKindRequireAcceptTerms", "message": "This subscription cannot create TextAnalytics until you agree to Responsible AI terms for this resource. You can agree to Responsible AI terms by creating a resource through the Azure Portal then trying again. For more detail go to https://go.microsoft.com/fwlink/?linkid=2164190"}

Connect to your database using psql in the Azure Cloud Shell

In this task, you connect to the rentals database on your Azure Database for PostgreSQL server using the psql command-line utility from the Azure Cloud Shell.

In the Azure portal, navigate to your newly created Azure Database for PostgreSQL - Flexible Server.
In the resource menu, under Settings, select Databases select Connect for the rentals database. Note that selecting Connect does not actually connect you to the database; it simply provides instructions for connecting to the database using various methods. Review the instructions to Connect from browser or locally and use those to connect using the Azure Cloud Shell.
At the “Password for user pgAdmin” prompt in the Cloud Shell, enter the randomly generated password for the pgAdmin login.

Once logged in, the psql prompt for the rentals database is displayed.
Throughout the remainder of this exercise, you continue working in the Cloud Shell, so it may be helpful to expand the pane within your browser window by selecting the Maximize button at the top right of the pane.

Setup: Configure extensions

To store and query vectors, and to generate embeddings, you need to allow-list and enable two extensions for Azure Database for PostgreSQL Flexible Server: vector and azure_ai.

To allow-list both extensions, add vector and azure_ai to the server parameter azure.extensions, as per the instructions provided in How to use PostgreSQL extensions.
Run the following SQL command to enable the vector extension. For detailed instructions, read How to enable and use pgvector on Azure Database for PostgreSQL - Flexible Server.
```
 CREATE EXTENSION vector;
```

To enable the azure_ai extension, run the following SQL command. You’ll need the endpoint and API key for the Azure OpenAI resource. For detailed instructions, read Enable the azure_ai extension.

 CREATE EXTENSION azure_ai;
 SELECT azure_ai.set_setting('azure_openai.endpoint', 'https://<endpoint>.openai.azure.com');
 SELECT azure_ai.set_setting('azure_openai.subscription_key', '<API Key>');

Populate the database with sample data

Before you explore the azure_ai extension, add a couple of tables to the rentals database and populate them with sample data so you have information to work with as you review the extension’s functionality.

Run the following commands to create the listings and reviews tables for storing rental property listing and customer review data:

 DROP TABLE IF EXISTS listings;
    
 CREATE TABLE listings (
     id int,
     name varchar(100),
     description text,
     property_type varchar(25),
     room_type varchar(30),
     price numeric,
     weekly_price numeric
 );

 DROP TABLE IF EXISTS reviews;
    
 CREATE TABLE reviews (
     id int,
     listing_id int, 
     date date,
     comments text
 );

Next, use the COPY command to load data from CSV files into each table you created above. Start by running the following command to populate the listings table:
```
 \COPY listings FROM 'mslearn-postgresql/Allfiles/Labs/Shared/listings.csv' CSV HEADER
```
The command output should be COPY 50, indicating that 50 rows were written into the table from the CSV file.
Finally, run the command below to load customer reviews into the reviews table:
```
 \COPY reviews FROM 'mslearn-postgresql/Allfiles/Labs/Shared/reviews.csv' CSV HEADER
```
The command output should be COPY 354, indicating that 354 rows were written into the table from the CSV file.

To reset your sample data, you can execute DROP TABLE listings, and repeat these steps.

Create and store embedding vectors

Now that we have some sample data, it’s time to generate and store the embedding vectors. The azure_ai extension makes calling the Azure OpenAI embedding API easy.

Add the embedding vector column.

The text-embedding-ada-002 model is configured to return 1,536 dimensions, so use that for the vector column size.
```
 ALTER TABLE listings ADD COLUMN listing_vector vector(1536);
```
Generate an embedding vector for the description of each listing by calling Azure OpenAI through the create_embeddings user-defined function, which is implemented by the azure_ai extension:
```
 UPDATE listings
 SET listing_vector = azure_openai.create_embeddings('embedding', description, max_attempts => 5, retry_delay_ms => 500)
 WHERE listing_vector IS NULL;
```
Note that this may take several minutes, depending on the available quota.

See an example vector by running this query:

 SELECT listing_vector FROM listings LIMIT 1;

You will get a result similar to this, but with 1536 vector columns:

 postgres=> SELECT listing_vector FROM listings LIMIT 1;
 -[ RECORD 1 ]--+------ ...
 listing_vector | [-0.0018742813,-0.04530062,0.055145424, ... ]

Perform a semantic search query

Now that you have listing data augmented with embedding vectors, it’s time to run a semantic search query. To do so, get the query string embedding vector, then perform a cosine search to find the listings whose descriptions are most semantically similar to the query.

Generate the embedding for the query string.

 SELECT azure_openai.create_embeddings('embedding', 'bright natural light');

You will get a result like this:

 -[ RECORD 1 ]-----+-- ...
 create_embeddings | {-0.0020871465,-0.002830255,0.030923981, ...}

Use the embedding in a cosine search (<=> represents cosine distance operation), fetching the top 10 most similar listings to the query.

 SELECT id, name FROM listings ORDER BY listing_vector <=> azure_openai.create_embeddings('embedding', 'bright natural light')::vector LIMIT 10;

You’ll get a result similar to this. Results may vary, as embedding vectors are not guaranteed to be deterministic:

     id    |                name                
 ----------+-------------------------------------
  6796336  | A duplex near U district!
  7635966  | Modern Capitol Hill Apartment
  7011200  | Bright 1 bd w deck. Great location
  8099917  | The Ravenna Apartment
  10211928 | Charming Ravenna Bungalow
  692671   | Sun Drenched Ballard Apartment
  7574864  | Modern Greenlake Getaway
  7807658  | Top Floor Corner Apt-Downtown View
  10265391 | Art filled, quiet, walkable Seattle
  5578943  | Madrona Studio w/Private Entrance

You may also project the description column to be able to read the text of the matching rows whose descriptions were semantically similar. For example, this query returns the best match:

 SELECT id, description FROM listings ORDER BY listing_vector <=> azure_openai.create_embeddings('embedding', 'bright natural light')::vector LIMIT 1;

Which prints something like:

    id    | description
 ---------+------------
  6796336 | This is a great place to live for summer because you get a lot of sunlight at the living room. A huge living room space with comfy couch and one ceiling window and glass windows around the living room.

To intuitively understand semantic search, observe that the description doesn’t actually contain the terms “bright” or “natural.” But it does highlight “summer” and “sunlight,” “windows,” and a “ceiling window.”

Check your work

After performing the above steps, the listings table contains sample data from Seattle Airbnb Open Data on Kaggle. The listings were augmented with embedding vectors to execute semantic searches.

Confirm the listings table has four columns: id, name, description, and listing_vector.

 \d listings

It should print something like:

                         Table "public.listings"
       Column    |         Type           | Collation | Nullable | Default 
 ----------------+------------------------+-----------+----------+---------
   id            | integer                |           | not null | 
   name          | character varying(255) |           | not null | 
   description   | text                   |           | not null | 
  listing_vector | vector(1536)           |           |          | 
  Indexes:
     "listings_pkey" PRIMARY KEY, btree (id)

Confirm that at least one row has a populated listing_vector column.
```
 SELECT COUNT(*) > 0 FROM listings WHERE listing_vector IS NOT NULL;
```
The result must show a t, meaning true. An indication that there’s at least one row with embeddings of its corresponding description column:
```
 ?column? 
 ----------
 t
 (1 row)
```
Confirm the embedding vector has 1536 dimensions:
```
 SELECT vector_dims(listing_vector) FROM listings WHERE listing_vector IS NOT NULL LIMIT 1;
```
Yielding:
```
 vector_dims 
 -------------
         1536
 (1 row)
```

Confirm that semantic searches return results.

Use the embedding in a cosine search, fetching the top 10 most similar listings to the query.

 SELECT id, name FROM listings ORDER BY listing_vector <=> azure_openai.create_embeddings('embedding', 'bright natural light')::vector LIMIT 10;

You’ll get a result like this, depending on which rows were assigned embedding vectors:

  id |                name                
 --------+-------------------------------------
  315120 | Large, comfy, light, garden studio
  429453 | Sunny Bedroom #2 w/View: Wallingfrd
  17951  | West Seattle, The Starlight Studio
  48848  | green suite seattle - dog friendly
  116221 | Modern, Light-Filled Fremont Flat
  206781 | Bright & Spacious Studio
  356566 | Sunny Bedroom w/View: Wallingford
  9419   | Golden Sun vintage warm/sunny
  136480 | Bright Cheery Room in Seattle House
  180939 | Central District Green GardenStudio

Clean up

Once you have completed this exercise, delete the Azure resources you created. You are charged for the configured capacity, not how much the database is used. Follow these instructions to delete your resource group and all resources you created for this lab.

Open a web browser and navigate to the Azure portal, and on the home page, select Resource groups under Azure services.
In the filter for any field search box, enter the name of the resource group you created for this lab, and then select your resource group from the list.
On the Overview page of your resource group, select Delete resource group.
In the confirmation dialog, enter the resource group name you are deleting to confirm and then select Delete.