Use speech-capable generative AI models
Increasingly, generative AI model capabilities are evolving beyond text-based language completion to support content in other formats - including audible speech.
In this exercise, you'll use generative AI models to support two common scenarios:
- Speech synthesis (text-to-speech) - generating speech output.
- Speech recognition (speech-to-text) - transcribing speech input.
While this exercise is based on Python, you can develop generative AI speech applications using multiple language-specific SDKs; including:
This exercise takes approximately 30 minutes.
Prerequisites
Before starting this exercise, ensure you have:
- An active Azure subscription
- Visual Studio Code installed
- Python version 3.13 or higher installed
- Git installed and configured
Create a Microsoft Foundry project
Microsoft Foundry uses projects to organize models, resources, data, and other assets used to develop an AI solution.
-
In a web browser, open the Microsoft Foundry portal at
https://ai.azure.comand sign in using your Azure credentials. Close any tips or quick start panes that are opened the first time you sign in, and if necessary use the Foundry logo at the top left to navigate to the home page. -
If it is not already enabled, in the tool bar the top of the page, enable the New Foundry option. Then, if prompted, create a new project with a unique name; expanding the Advanced options area to specify the following settings for your project:
- Foundry resource: Use the default name for your resource (usually {project_name}-resource)
- Subscription: Your Azure subscription
- Resource group: Create or select a resource group
- Region: Select any of the AI Foundry recommended regions
Tip: Make a note of the region you selected. You'll need it later!
-
Select Create. Wait for your project to be created.
Deploy models
To develop speech-enables apps, we're going to need speech-enabled models. Specifically, we need a model that can perform speech-generation, and a model that can process speech input.
Deploy a speech-generation model
- On the project home page, in the Start building menu, select Browse models.
- In the model catalog, search for
gpt-4o-mini-tts. - Review the model card, and then deploy it using the default settings.
- When the model has been deployed, view its details, noting that the Target URI and Key required to use it are available here (you'll need these later).
Deploy a speech-recognition model
- In the Foundry portal menu bar, select Build; and then view the Models page. Note that the gpt-4o-mini-tts model you deployed is listed.
- Select **Deploy a base model, and search the catalog for
gpt-4o-mini-transcribe. - Deploy a gpt-4o-mini-transcribe model using the default settings.
- Return to the Models page and verify that both pf the model you deployed are listed.
Get the application files from GitHub
The initial application files you'll need to develop speech applications are provided in a GitHub repo.
-
Open Visual Studio Code.
-
Open the command palette (Ctrl+Shift+P) and use the
Git:clonecommand to clone thehttps://github.com/microsoftlearning/mslearn-ai-languagerepo to a local folder (it doesn't matter which one). Then open it.You may be prompted to confirm you trust the authors.
-
In Visual Studio Code, view the Extensions pane; and if it is not already installed, install the Python extension.
-
In the Command Palette, use the command
python:select interpreter. Then select an existing environment if you have one, or create a new Venv environment based on your Python 3.1x installation.Tip: If you are prompted to install dependencies, you can install the ones in the requirements.txt file in the /Labfiles/03-gen-ai-speech/Python/generate-speech folder; but it's OK if you don't - we'll install them later!
Create a speech-generation app
- After the repo has been cloned, in the Explorer pane, navigate to the folder containing the application code files at /Labfiles/03-gen-ai-speech/Python/generate-speech. The application files include:
- .env (the application configuration file)
- requirements.txt (the Python package dependencies that need to be installed)
- generate-speech.py (the code file for the application)
Configure your application
-
In the Explorer pane, right-click the generate-speech folder containing the application files, and select Open in integrated terminal (or open a terminal in the Terminal menu and navigate to the /Labfiles/03-gen-ai-speech/Python/generate-speech folder.)
Note: Opening the terminal in Visual Studio Code will automatically activate the Python environment. You may need to enable running scripts on your system.
-
Ensure that the terminal is open in the generate-speech folder with the prefix (.venv) to indicate that the Python environment you created is active.
-
Install the OpenAI SDK package and other required packages by running the following command:
pip install -r requirements.txt openai -
In the Explorer pane, in the generate-speech folder, select the .env file to open it. Then update the configuration values to include the Target URI (endpoint) and key for your gpt-4o-mini-tts model.
Tip: Copy the Target URI and key from the model details page in the Foundry portal.
Save the modified configuration file.
Write code to use the model for speech-generation
-
In the Explorer pane, in the generate-speech folder, select the generate-speech.py file to open it.
-
Review the existing code. You will add code to use the OpenAI SDK to access your model.
Tip: As you add code to the code file, be sure to maintain the correct indentation.
-
At the top of the code file, under the existing namespace references, find the comment Import namespaces and add the following code to import the namespace you will need to use the OpenAI SDK:
# import namespaces from openai import AzureOpenAI -
In the main function, note that code to load the endpoint and key from the configuration file has already been provided. Then find the comment Create the Azure OpenAI client, and add the following code to create a client for the OpenAI API:
# Create the Azure OpenAI client client = AzureOpenAI( azure_endpoint=endpoint, api_key=key, api_version="2025-03-01-preview" ) -
Find the comment Generate speech and save to file, and add the following code to submit a prompt to the speech-generation model save the response as a file.
# Generate speech and save to file with client.audio.speech.with_streaming_response.create( model=model_deployment, voice="alloy", input="My voice is my passport!", instructions="Speak in a serious tone.", ) as response: response.stream_to_file(speech_file_path) -
Save the changes to the code file.
Run the application
-
In the terminal pane, enter the following command to run the program (you can maximize or resize the terminal pane to see more output):
python generate-speech.py -
Observe the output as the code generates the requested speech and saves it in a file. The code should also play the generated audio file.
Create a speech-transcription app
- In the Explorer pane, navigate to the folder containing the application code files at /Labfiles/03-gen-ai-speech/Python/transcribe-speech. The application files include:
- .env (the application configuration file)
- requirements.txt (the Python package dependencies that need to be installed)
- transcribe-speech.py (the code file for the application)
Configure your application
-
In the Explorer pane, right-click the transcribe-speech folder containing the application files, and select Open in integrated terminal (or in the existing terminal, navigate to the /Labfiles/03-gen-ai-speech/Python/transcribe-speech folder.)
Note: Opening the terminal in Visual Studio Code will automatically activate the Python environment. You may need to enable running scripts on your system.
-
Ensure that the terminal is open in the transcribe-speech folder with the prefix (.venv) to indicate that the Python environment you created previously is active.
-
Install the OpenAI SDK package and other required packages by running the following command:
pip install -r requirements.txt openaiNote: This step isn't actually necessaey if you completed the previous part of this exercise, as botg apps use the same environment and have the same dependencies - but it won't do any harm!
-
In the Explorer pane, in the transcribe-speech folder, select the .env file to open it. Then update the configuration values to include the Target URI (endpoint) and key for your gpt-4o-mini-transcribe model.
Tip: Copy the Target URI and key from the model details page in the Foundry portal.
Save the modified configuration file.
Write code to use the model for speech-transcription
-
In the Explorer pane, in the transcribe-speech folder, select the transcribe-speech.py file to open it.
-
Review the existing code. You will add code to use the OpenAI SDK to access your model.
Tip: As you add code to the code file, be sure to maintain the correct indentation.
-
At the top of the code file, under the existing namespace references, find the comment Import namespaces and add the following code to import the namespace you will need to use the OpenAI SDK:
# import namespaces from openai import AzureOpenAI -
In the main function, note that code to load the endpoint and key from the configuration file has already been provided. Then find the comment Create the Azure OpenAI client, and add the following code to create a client for the OpenAI API:
# Create the Azure OpenAI client client = AzureOpenAI( azure_endpoint=endpoint, api_key=key, api_version="2025-03-01-preview" ) -
Find the comment Call model to transcribe audio file, and add the following code to submit an audio file to the speech-transcription model generate a transcript.
# Call model to transcribe audio file audio_file = open(file_path, "rb") transcription = client.audio.transcriptions.create( model=model_deployment, file=audio_file, response_format="text" ) print(transcription) -
Save the changes to the code file.
Run the application
-
In the terminal pane, enter the following command to run the program (you can maximize or resize the terminal pane to see more output):
python transcribe-speech.py -
Observe the output as the code submits the audio file to the model for transcription and displays the results. The code should also play the audio file.
Clean up
If you've finished exploring speech-enabled models in Foundry Tools, you should delete the resources you have created in this exercise to avoid incurring unnecessary Azure costs.
- Open the Azure portal and view the contents of the resource group where you deployed the resources used in this exercise.
- On the toolbar, select Delete resource group.
- Enter the resource group name and confirm that you want to delete it.