Explore AI speech

In this exercise, you’ll interact with a generative AI model using speech. The goal of this exercise is to explore speech-to-text (STT) and text-to-speech (TTS) functionality with a generative AI model.

This exercise should take approximately 15 minutes to complete.

Open the Chat Playground app

Let’s start by chatting with a generative AI model. In this exercise, we’ll use a browser-based application to chat with a small language model that is useful for general chat solutions in low bandwidth scenarios. The app also uses Web Speech APIs for speech recognition and synthesis.

Note: If your browser supports WebGPU, the chat playground uses the Microsoft Phi 3 Mini model running on your computer’s GPU. If not, the SmolLM2 model is used, running on CPU - with reduced response-generation quality. Performance for either model may vary depending on the available memory in your computer and your network bandwidth to download the model. After opening the app, use the ? (About this app) icon in the chat area to find out more.

In a web browser, open the Chat Playground at https://aka.ms/chat-playground.

The app intiializes by downloading a language model.

Tip: The first time you download a model, it may take a few minutes. Subsequent downloads will be faster. If your browser or operating system does not support WebGPU models, the fallback CPU-based model will be selected (which provides slower performance and reduced quality of response generations).
View the Chat Playground app, which should look like this:

Configure Voice mode

The Chat playground application supports voice mode, in which you can interact with a generative AI model using speech.

Note: Voice mode depends on browser support for the WebSpeech API and access to voices for speech synthesis. The app should work successfully in most modern browsers. If your browser configuration is not compatible, you may experience errors; and ultimately voice mode may not work for you.

In the pane on the left, under the selected model, enable Voice mode

If the Configuration pane is not displayed automatically on the right, open it using the Configuration (⚙) button above the Chat pane.
In the configuration pane, view the voices in the Voice drop-down list.

Text-to-speech solutions use voices to control the cadence, pronunciation, timbre, and other aspects of generated speech. The available voices depend on your browser and operating system, and can include local voices installed in the operating system as well as online voices available for your browser.
Select any of the available voices, and use the Preview selected voice (▷) button to hear a sample of the voice.

Note: Online voices are downloaded on-demand, which may take a few seconds. The app verifies that they are loaded successfully, and displays an error if not.
When you have selected the voice you want to use, close the Configuration pane.

Use speech to interact with the model

The app supports both speech recognition and speech synthesis, enabling you to have a voice-based conversation with the model.

In the Chat pane, use the Start session button to start a conversation with the model. If prompted, allow access to the system microphone.
When the app status is Listening…, say something like “What is speech recognition?” and wait for a response.

Tip: If an error occurs or the app can’t detect any speech input, check your microphone settings and try again. Ambient noise may cause failures - though in some cases, the issue may be that your browser setup does not support the WebSpeech API for voice recognition.
Verify that the app status changes to Processing…. The app will process the spoken input, using speech-to-text to convert your speech to text and submit it to the model as a prompt.

Tip: Processing speech and retrieving a response from the model may take some time in this browser-based sample app - especially when using the CPU-based model. Be patient!
When the status changes to Speaking…, the app uses text-to-speech to vocalize the response from the model. When it’s finished, the original prompt and the response will be shown as text.

Tip: If no voices are available in your browser, only the text reponse will be shown.
To continue the conversation, use the Start session button to submit a second spoken prompt, such as “What is speech synthesis?”, and review the response.

Summary

In this exercise, you explored the use of speech-to-text and text-to-speech with a generative AI model in a simple playground app.

The app used in this lab is based on a simplified version of the agent playground in Microsoft Foundry portal; in which Azure Speech in Foundry tools Voice Live capabilities can be added to an agent. While the app in this lab is limited to “single-turn” spoken interactions, the Azure Speech Voice Live capabilities in Microsoft Foundry include multi-turn real-time conversations with support for interruptions and background noise suppression.